Majestic 12 [Archive] - AMD Users.com - Distributed Computing Team

View Full Version : Majestic 12

rrcrain

12-12-2005, 08:15 PM

Has anyone researcehd the majestic 12 project? Given companies like Google, Lycos, MSN and others actually provide the internet with search engines while they are searching makes me wonder if the project is no more than a means of sucking up internet bandwidth.

Also, consider this. How many URLs are taken down daily, new ones put online daily and then ask how this project could possibly hope to build an accurate index before going live with a search engine.

NeoGen

12-12-2005, 08:53 PM

Majestic-12 already has a search engine working, I remember seeing a link to it on the page, but not sure where. The search engine index and database is what we're helping to build, billion after billion of crawled webpages.
If a project like this really takes off then it's database will not only be larger, as it will be much more accurate than Google's or any other. One big problem with Google that I see alot is that it returns pages that no longer exist for years. It's understandable that they have a hard time recrawling their n billion pages on a regular basis.
Majestic-12 on the other hand, if it has a large user base constantly recrawling old pages and searching new ones will have a really fresh and accurate database at its disposal.

NeoGen

12-12-2005, 08:58 PM

Found it! Here's the search engine, still in alpha stage but already working and spitting out links.
http://majestic12.kicks-ass.org:8888/

I think it's working with only a small part of our crawled webpages, as it is in testing stage, but it's starting off well. :)

p.s. - It's working great already. Searching for "AMD Users" we appear on top of the search! :)

Majestic-12

12-13-2005, 01:23 AM

Thanks NeoGen for answering it :)

rrcrain: you asked 2 distinct questions that I am very happy to answer:

Q: Can the project scale to world class search engine level of URLs?

A: Most certainly - right now just 60 people crawl 50 mln urls a day, this means that just 6,000 people can crawl 5,000 mln urls a day - that's recrawling all Google's data in a few days. Compare number of people needed to achieve this level with how many people were needed by distributed.net to crack one code and see that its totally feasible to solve this task with much fewer people than other projects have.

Q: Is it all a waste of bandwidth?

A: No, its not - granted the crawler scales linearly and number of crawled urls jumped from 4.5 mln in July to 45-60 mln today, the search engine itself is lagging behind - it only has got 50 mln urls indexed, but I am working very hard to make sure it scales to billions of pages.

Build a WWW-level search engine is an exceptionally hard programming task that has pretty much been done only by huge companies such as Google/Yahoo/Microsoft and also Gigablast. It is not easy, but totally doable - I wish I could just jump from 50 mln indexed pages to 5 bln indexed pages but a fair amount of not-sexy hard work needs to be done before it happens.

I am certainly working hard every day - including weekends to make sure it happens as soon as possible :)