Page 1 of 3 123 LastLast
Results 1 to 10 of 22

Thread: Boitho - distributed crawler

  1. #1
    Join Date
    Jul 2003
    Location
    Sydney, Australia
    Posts
    5,662

    Boitho - distributed crawler

    Read about this project here

    Join AMD Users when you install the application and create your username / handle.

  2. #2
    NeoGen's Avatar
    NeoGen is offline AMD Users Alchemist Moderator
    Site Admin
    Join Date
    Oct 2003
    Location
    North Little Rock, AR (USA)
    Posts
    8,451
    This is similar to Majestic-12, but I'm still trying to figure out the options of the crawler.
    I think I made something wrong here because the system got very slow for a while, even though boitho wasn't using almost no cpu. After shutting down boitho everything started working normally again.

  3. #3
    Join Date
    Oct 2005
    Location
    Birmingham, UK
    Posts
    534
    I took a look at this before it was annouced on distributedcomputing.info. IMO it's a bit crap, lol. All it did for me was get robots.txt. Not exactly exciting is it...

    Btw we lost a team member to x grubbers yesterday

  4. #4
    NeoGen's Avatar
    NeoGen is offline AMD Users Alchemist Moderator
    Site Admin
    Join Date
    Oct 2003
    Location
    North Little Rock, AR (USA)
    Posts
    8,451

  5. #5
    Join Date
    Jan 2005
    Location
    Sundsvall, Sweden
    Posts
    3,532
    Hej

    So sad we loose Peyoti. I wounder why he change team. If peoples disappear we will have difficult to compete against other Teams. Lets hope it is the last time such a thing happen.

    Lagu :shock:
    Once an AMDuser always an AMD user

  6. #6
    Join Date
    Jan 2005
    Location
    Sundsvall, Sweden
    Posts
    3,532
    Hej

    I have downloaded the agent but yet not exequte it. What is the difference between MJ-12 and Boitho?

    Lagu
    Once an AMDuser always an AMD user

  7. #7
    Join Date
    May 2004
    Location
    Kent, UK
    Posts
    3,511
    Lagu,

    Ultimately, I think they are both the same. Crawling and/or validating the web

  8. #8
    NeoGen's Avatar
    NeoGen is offline AMD Users Alchemist Moderator
    Site Admin
    Join Date
    Oct 2003
    Location
    North Little Rock, AR (USA)
    Posts
    8,451
    Boitho makes thumbnail images of the web pages it crawls to show alongside the search engine results.
    I guess that translates in more cpu usage than MJ-12. But on the other hand it seems that it crawls much slower than MJ-12.

  9. #9
    NeoGen's Avatar
    NeoGen is offline AMD Users Alchemist Moderator
    Site Admin
    Join Date
    Oct 2003
    Location
    North Little Rock, AR (USA)
    Posts
    8,451
    There's something in Boitho that makes my system almost crawl to a halt when it's running, but I can't point out anything in particular.
    The main processes have normal priority and almost get no cpu, but I'm starting to suspect that the crawler threads are getting above normal or even high priority. But those are not shown in the task manager.

    And I also don't know why I don't have any points. I know I haven't run it much but I should have at least something there.

  10. #10
    Hi

    I am one of the people behind Boitho.


    NeoGen, sorry to hear about your problem with the crawler.

    In the folder you installed the Boitho client it should be a file called "ErrorLog.txt". This is a log off all errors. Can you send me this to: runarb [at] boitho dot com, sow I can take a look?

    The threads runs as priority idle. The crawler uses two possesses, BGui.exe and BCrawler.exe both main threads runs as normal. The BCrawler.exe is then responsible for crawling, and creates new threads with priority idle as necessarily.

    The crawler isn’t rely suitable for running along side when you are using the computer. As default it is configured to only run downloads when it hasn’t been used for more then 5 minutes. See Tools-> Options-> Crawling Mode



    Anyone knows how often the stats are updated?
    The statistics is live, and is updated every time your client sends us pages it hav crawled. Pages are sent in when you have crawled 500.

    The graphs are updated every 5 minutes.


    Boitho makes thumbnail images of the web pages it crawls to show alongside the search engine results. I guess that translates in more cpu usage than MJ-12.
    It also has to download all the images from the site, not only the html to make the thumbnail. The bandwidth and CPU needed to make a thumbnail of an internet page is about 10 times more then the resources needed just to download the html.


    All it did for me was get robots.txt. Not exactly exciting is it...
    It happens that we crawl a lot of robots.txt pages from time to time.

    Boitho cashes the robots.txt pages locally and therefore have to get the robots.txt files before we can issue a url for crawling to a node.

    Around the 13 des we added allot of new pages, from domains we hadn’t crawled before. (from the .com, .net and .edu list from Verisign ). If one looks at the crawler statistic page one can see we mostly crawled robots.txt pages from the 13 to 21 des: http://dcsetup.boitho.com/cgi-bin/dc/topCrawlers.cgi because of this.

Page 1 of 3 123 LastLast

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •