PDA

View Full Version : Boitho - distributed crawler



vaughan
01-03-2006, 06:27 AM
Read about this project here (http://www.boitho.com/dc/)

Join AMD Users when you install the application and create your username / handle.

NeoGen
01-03-2006, 06:56 AM
This is similar to Majestic-12, but I'm still trying to figure out the options of the crawler.
I think I made something wrong here because the system got very slow for a while, even though boitho wasn't using almost no cpu. After shutting down boitho everything started working normally again.

Evil-Dragon
01-03-2006, 07:12 AM
I took a look at this before it was annouced on distributedcomputing.info. IMO it's a bit crap, lol. All it did for me was get robots.txt. Not exactly exciting is it...

Btw we lost a team member to x grubbers yesterday :(

NeoGen
01-03-2006, 03:48 PM
Anyone knows how often the stats are updated?

Lagu
01-03-2006, 04:15 PM
Hej

So sad we loose Peyoti. I wounder why he change team. If peoples disappear we will have difficult to compete against other Teams. Lets hope it is the last time such a thing happen.

Lagu :shock:

Lagu
01-03-2006, 04:18 PM
Hej

I have downloaded the agent but yet not exequte it. What is the difference between MJ-12 and Boitho?

Lagu :)

Ototero
01-03-2006, 05:02 PM
Lagu,

Ultimately, I think they are both the same. Crawling and/or validating the web

NeoGen
01-03-2006, 06:21 PM
Boitho makes thumbnail images of the web pages it crawls to show alongside the search engine results.
I guess that translates in more cpu usage than MJ-12. But on the other hand it seems that it crawls much slower than MJ-12.

NeoGen
01-04-2006, 10:48 AM
There's something in Boitho that makes my system almost crawl to a halt when it's running, but I can't point out anything in particular.
The main processes have normal priority and almost get no cpu, but I'm starting to suspect that the crawler threads are getting above normal or even high priority. But those are not shown in the task manager.

And I also don't know why I don't have any points. I know I haven't run it much but I should have at least something there.

runarb
01-06-2006, 10:34 PM
Hi

I am one of the people behind Boitho.


NeoGen, sorry to hear about your problem with the crawler.

In the folder you installed the Boitho client it should be a file called "ErrorLog.txt". This is a log off all errors. Can you send me this to: runarb [at] boitho dot com, sow I can take a look?

The threads runs as priority idle. The crawler uses two possesses, BGui.exe and BCrawler.exe both main threads runs as normal. The BCrawler.exe is then responsible for crawling, and creates new threads with priority idle as necessarily.

The crawler isn’t rely suitable for running along side when you are using the computer. As default it is configured to only run downloads when it hasn’t been used for more then 5 minutes. See Tools-> Options-> Crawling Mode




Anyone knows how often the stats are updated?

The statistics is live, and is updated every time your client sends us pages it hav crawled. Pages are sent in when you have crawled 500.

The graphs are updated every 5 minutes.



Boitho makes thumbnail images of the web pages it crawls to show alongside the search engine results. I guess that translates in more cpu usage than MJ-12.

It also has to download all the images from the site, not only the html to make the thumbnail. The bandwidth and CPU needed to make a thumbnail of an internet page is about 10 times more then the resources needed just to download the html.



All it did for me was get robots.txt. Not exactly exciting is it...

It happens that we crawl a lot of robots.txt pages from time to time.

Boitho cashes the robots.txt pages locally and therefore have to get the robots.txt files before we can issue a url for crawling to a node.

Around the 13 des we added allot of new pages, from domains we hadn’t crawled before. (from the .com, .net and .edu list from Verisign ). If one looks at the crawler statistic page one can see we mostly crawled robots.txt pages from the 13 to 21 des: http://dcsetup.boitho.com/cgi-bin/dc/topCrawlers.cgi because of this.

NeoGen
01-06-2006, 11:49 PM
Well, email with log file is sent. And now I've tried another tactic.
Uninstalled and reinstalled Boitho. Seems to have started working again, but slowly. After almost 15 minutes it only shows 2 URL's ok and everything else is 0.

EDIT:
I've got it set to work with a few crawlers even when I'm at the computer. I'll try taking down PSP Sieve and see if that's why boitho is running so slowly

NeoGen
01-06-2006, 11:58 PM
Damn... it fired off working instantly!

So proth sieve was stealing all the cpu cycles for itself. :?
Lowering proth sieve's thread priority (through the command line options) solved the problem.
Now PSP Sieve and boitho work happily together! :)

Now to see if my stats update correctly at the site...

PCZ
01-07-2006, 04:45 AM
http://stats.free-dc.org/new/projpage.php?proj=boi

NeoGen
01-07-2006, 04:54 AM
Cool! Thanks PCZ!
My stats are now working well across all stats pages :)

The first few thousands were lost, but after the reinstall and solving the priorities problem, it's working like a charm!

NeoGen
01-07-2006, 05:05 AM
I see that Bok didn't like the way Boitho's stats were sorted and made a little sorting of his own... and that costed 3 ranks to Free-DC :lol:

PCZ
01-07-2006, 05:33 AM
Yes he needs to sort that out.
URL's have a different weight in the stats than robots.txt and Bok doesn't know the formula.

Hopefully someone from the project can shed some light on this and the stats can be more accurate.
Trouble is they don't have a forum and info is hard to comeby.

NeoGen
01-07-2006, 05:44 AM
They have a forum, yes, I already don't know where I found out about it but I've been there.
http://www.boitho.com/forum/

And a blog too:
http://www.boitho.com/blog/

But isn't the formula simply summing both the pages and the robots.txt? I got that idea when I saw their stats, the current numbers seem to match this theory.

PCZ
01-07-2006, 06:08 AM
It seems you cant just add the URL's and robot.txt together.

NeoGen
01-07-2006, 06:24 AM
I'm pretty sure it's either summing up or they got the same weight in the result (50/50)
With some fast calculations on excel here, I just saw that changing the weights 60/40 towards any of the parcels is enough to make the sorting go wrong. (results in a slightly different table from the one on the site)

runarb
01-07-2006, 09:50 AM
Stats formula is as follow:

Rank = crawledpages + (robots.txt / 6)

But one should probably use the "Place" column, not generate it's one. Sow we can change this later.

Ototero
01-07-2006, 10:34 AM
Thanks runarb, I'll change my stats now, because they have only just started (mine that is) I added.

NeoGen
01-07-2006, 12:17 PM
Ah, so the crawled pages get full weight but the robots don't. Geez, that's so logical and it didn't even cross my mind! I was splitting the total weight by the parcels, no wonder it wasn't working. :P