View Full Version : Boitho - distributed crawler
01-03-2006, 06:27 AM
Read about this project here (http://www.boitho.com/dc/)
Join AMD Users when you install the application and create your username / handle.
01-03-2006, 06:56 AM
This is similar to Majestic-12, but I'm still trying to figure out the options of the crawler.
I think I made something wrong here because the system got very slow for a while, even though boitho wasn't using almost no cpu. After shutting down boitho everything started working normally again.
01-03-2006, 07:12 AM
I took a look at this before it was annouced on distributedcomputing.info. IMO it's a bit crap, lol. All it did for me was get robots.txt. Not exactly exciting is it...
Btw we lost a team member to x grubbers yesterday :(
01-03-2006, 03:48 PM
Anyone knows how often the stats are updated?
So sad we loose Peyoti. I wounder why he change team. If peoples disappear we will have difficult to compete against other Teams. Lets hope it is the last time such a thing happen.
I have downloaded the agent but yet not exequte it. What is the difference between MJ-12 and Boitho?
01-03-2006, 05:02 PM
Ultimately, I think they are both the same. Crawling and/or validating the web
01-03-2006, 06:21 PM
Boitho makes thumbnail images of the web pages it crawls to show alongside the search engine results.
I guess that translates in more cpu usage than MJ-12. But on the other hand it seems that it crawls much slower than MJ-12.
01-04-2006, 10:48 AM
There's something in Boitho that makes my system almost crawl to a halt when it's running, but I can't point out anything in particular.
The main processes have normal priority and almost get no cpu, but I'm starting to suspect that the crawler threads are getting above normal or even high priority. But those are not shown in the task manager.
And I also don't know why I don't have any points. I know I haven't run it much but I should have at least something there.
01-06-2006, 10:34 PM
I am one of the people behind Boitho.
NeoGen, sorry to hear about your problem with the crawler.
In the folder you installed the Boitho client it should be a file called "ErrorLog.txt". This is a log off all errors. Can you send me this to: runarb [at] boitho dot com, sow I can take a look?
The threads runs as priority idle. The crawler uses two possesses, BGui.exe and BCrawler.exe both main threads runs as normal. The BCrawler.exe is then responsible for crawling, and creates new threads with priority idle as necessarily.
The crawler isnít rely suitable for running along side when you are using the computer. As default it is configured to only run downloads when it hasnít been used for more then 5 minutes. See Tools-> Options-> Crawling Mode
Anyone knows how often the stats are updated?
The statistics is live, and is updated every time your client sends us pages it hav crawled. Pages are sent in when you have crawled 500.
The graphs are updated every 5 minutes.
Boitho makes thumbnail images of the web pages it crawls to show alongside the search engine results. I guess that translates in more cpu usage than MJ-12.
It also has to download all the images from the site, not only the html to make the thumbnail. The bandwidth and CPU needed to make a thumbnail of an internet page is about 10 times more then the resources needed just to download the html.
All it did for me was get robots.txt. Not exactly exciting is it...
It happens that we crawl a lot of robots.txt pages from time to time.
Boitho cashes the robots.txt pages locally and therefore have to get the robots.txt files before we can issue a url for crawling to a node.
Around the 13 des we added allot of new pages, from domains we hadnít crawled before. (from the .com, .net and .edu list from Verisign ). If one looks at the crawler statistic page one can see we mostly crawled robots.txt pages from the 13 to 21 des: http://dcsetup.boitho.com/cgi-bin/dc/topCrawlers.cgi because of this.
01-06-2006, 11:49 PM
Well, email with log file is sent. And now I've tried another tactic.
Uninstalled and reinstalled Boitho. Seems to have started working again, but slowly. After almost 15 minutes it only shows 2 URL's ok and everything else is 0.
I've got it set to work with a few crawlers even when I'm at the computer. I'll try taking down PSP Sieve and see if that's why boitho is running so slowly
01-06-2006, 11:58 PM
Damn... it fired off working instantly!
So proth sieve was stealing all the cpu cycles for itself. :?
Lowering proth sieve's thread priority (through the command line options) solved the problem.
Now PSP Sieve and boitho work happily together! :)
Now to see if my stats update correctly at the site...
01-07-2006, 04:54 AM
Cool! Thanks PCZ!
My stats are now working well across all stats pages :)
The first few thousands were lost, but after the reinstall and solving the priorities problem, it's working like a charm!
01-07-2006, 05:05 AM
I see that Bok didn't like the way Boitho's stats were sorted and made a little sorting of his own... and that costed 3 ranks to Free-DC :lol:
Yes he needs to sort that out.
URL's have a different weight in the stats than robots.txt and Bok doesn't know the formula.
Hopefully someone from the project can shed some light on this and the stats can be more accurate.
Trouble is they don't have a forum and info is hard to comeby.
01-07-2006, 05:44 AM
They have a forum, yes, I already don't know where I found out about it but I've been there.
And a blog too:
But isn't the formula simply summing both the pages and the robots.txt? I got that idea when I saw their stats, the current numbers seem to match this theory.
It seems you cant just add the URL's and robot.txt together.
01-07-2006, 06:24 AM
I'm pretty sure it's either summing up or they got the same weight in the result (50/50)
With some fast calculations on excel here, I just saw that changing the weights 60/40 towards any of the parcels is enough to make the sorting go wrong. (results in a slightly different table from the one on the site)
01-07-2006, 09:50 AM
Stats formula is as follow:
Rank = crawledpages + (robots.txt / 6)
But one should probably use the "Place" column, not generate it's one. Sow we can change this later.
01-07-2006, 10:34 AM
Thanks runarb, I'll change my stats now, because they have only just started (mine that is) I added.
01-07-2006, 12:17 PM
Ah, so the crawled pages get full weight but the robots don't. Geez, that's so logical and it didn't even cross my mind! I was splitting the total weight by the parcels, no wonder it wasn't working. :P
Powered by vBulletin® Version 4.2.0 Copyright © 2013 vBulletin Solutions, Inc. All rights reserved.