Boitho - distributed crawler

**runarb** · 01-06-2006, 10:34 PM

Hi

I am one of the people behind Boitho.

NeoGen, sorry to hear about your problem with the crawler.

In the folder you installed the Boitho client it should be a file called "ErrorLog.txt". This is a log off all errors. Can you send me this to: runarb [at] boitho dot com, sow I can take a look?

The threads runs as priority idle. The crawler uses two possesses, BGui.exe and BCrawler.exe both main threads runs as normal. The BCrawler.exe is then responsible for crawling, and creates new threads with priority idle as necessarily.

The crawler isn’t rely suitable for running along side when you are using the computer. As default it is configured to only run downloads when it hasn’t been used for more then 5 minutes. See Tools-> Options-> Crawling Mode

Anyone knows how often the stats are updated?

The statistics is live, and is updated every time your client sends us pages it hav crawled. Pages are sent in when you have crawled 500.

The graphs are updated every 5 minutes.

Boitho makes thumbnail images of the web pages it crawls to show alongside the search engine results. I guess that translates in more cpu usage than MJ-12.

It also has to download all the images from the site, not only the html to make the thumbnail. The bandwidth and CPU needed to make a thumbnail of an internet page is about 10 times more then the resources needed just to download the html.

All it did for me was get robots.txt. Not exactly exciting is it...

It happens that we crawl a lot of robots.txt pages from time to time.

Boitho cashes the robots.txt pages locally and therefore have to get the robots.txt files before we can issue a url for crawling to a node.

Around the 13 des we added allot of new pages, from domains we hadn’t crawled before. (from the .com, .net and .edu list from Verisign ). If one looks at the crawler statistic page one can see we mostly crawled robots.txt pages from the 13 to 21 des: http://dcsetup.boitho.com/cgi-bin/dc/topCrawlers.cgi because of this.

Thread: Boitho - distributed crawler

Thread Tools

Display

Threaded View

Posting Permissions