What happen in the near future

**Lagu** · 12-19-2005, 06:55 PM

Alexc has written this announcement:

Current situation

From crawling point of view we are doing extremely well - 50+ mln pages daily is a very high number of pages - 850 mln urls loaded just 3 weeks ago disappeared like sugar in hot water! Congratulations to all as it shows strength of the concept - we are not yet at Google's crawling levels, but we sure achieved far more than Google did in their year 1 of operations - first known index for them was 24 mln pages (Source)

Its all good, but search engine right now is still at just 45 mln pages - a drop in an ocean comparing to number of crawled pages. BarkerJ summed it up pretty nicely:

Originally Posted by BarkerJ

In short, our bottleneck to global dominance is currently getting the search engine to process a large amount of data, not a lack of data.

Indeed, there are rivers, lakes, seas and oceans of data, but its being stored right now rather than quickly indexed and added to search engine for daily use - and of course I want to have a genuinly usable search engine that uses crawled data.

So what to do?

In theory it all looks good - there is search engine already and it works, in practice however is that it may work for 50 mln pages, but it wont do for 500 mln - the keyword here is scalability - all components of the system should scale well to billions of pages.

Here are the main 4 stages that make it up:

crawling -> indexing -> merging -> searching

Crawling scales exceptionally very well - this is the most advanced and automated part of the system that works pretty well, no problems here.

Indexing - I've done a lot of work to speed it up in October, and happy to say that on new server "FiddleAbout" (AMD x2 3800, 1 GB dual channel DDR 400) indexing runs very fast and dual cores scale very very well - looks like it should be able to index 45-50 mln pages per 24H. Even better indexing is already very scalable so upcoming 3 boxes will work as part of the whole system, thus giving us ability to index 170-200 mln pages per day. This is more than we crawl now and it will allow to catch up with backlog, so we are doing well in this area.

Merging the last phase before actual searching that takes separate indexed barrels and merges them into searchable index - this area needs big improvements in order to scale to billions of pages. Merging right now is too linear not using more than 1 CPU - this is to be changed to use multiple CPUs and multiple machines, this is the key step to scaling seacrh engine up.

Searching - this needs to be scaled up as well, but luckily main work is done and so long as merging is scaled I expect this part to scale up easily.

So, its the merging's module scalability that needs to be addressed ASAP.

So what about recrawling?

A good search engine is not just a big search engine, but also up to date search engine - some pages change dramatically and people got accusstommed (def sp!) to searching for recent stuff online, and people are very quick to make a conclusion about whether search engine is good or crap - I am pretty harsh in this respect too.

We can only be reasonably up to date if we recrawl - most small search engines can't afford recrawling, so recrawl is not a dirty word, its actually good that we can afford to think in terms of freshness of information, not just its quantity.

As it was mentioned there is a lot of data flowing in every day - 1 TB+ of uncompressed data every day, it has to be well managed (indexed and serached) before we can dig much deeper.

Here are my considerations for recrawl:

1) first 2 bln pages are not as well compressed as they can be now - this means pages occupy 3-5 times more than they can be compressed now, thus recrawling them would free up some disks for index.

2) first pages we crawled were nearer the top of the sites than current deeper pages - they definately need to be indexed

3) some pages are as old as 1 year

4) recrawl gives me breathing space of around 2 months to scale the search engine - at 50 mln per day we can redo 3 bln in just 60 days.

5) new url loads should have much smarter prioritisation - current limit on urls per site is too generic, say Microsoft.com or GeoCities.com should not be subject to the same limit of 15k urls per site as newly formed doorway pages website with lots of junk - in order to be smart we need to index at least 1 bln pages and build info on sites importance.

This is why I think we should start recrawl of existing buckets rather than load more billions.

Exception: national TLDs (top level domains) will still be added, the reason is because there are so few of them comparing to .COMs/.NETs/.ORGs that there is a point to grow them as numbers are too small to create problems anyway. I hope people are okay with this exception.

Also, I want to state here very clearly important thing - the fact that we can recrawl data does NOT diminish in any way important of it being crawled in the first 9 months of this project - I have to admit here that thoughts of quitting this project in the face of impossibility of the task at hand crossed my mind a few times but this ended in April and won't come back - if it was not for people whose participation helped come through the hardest first 6-9 months of this project then it would have never reached this stage.

Will only changed pages send back?

Not right now - this functionality is trickier than it seems as it would require intelligent indexing/merging of multiple barrels with the same name taking only most recent pages from them - this requires all crawled data to be online, which it is not right now, but I expect this to happen in Q1 2006.

The only drawback of not having smart recrawl is that data send back will not be smaller, but right now with Conan active its pretty reasonably small - compare current barrel of 5 MB with 40-50 MB we typically seen in early days of the project.

Conclusion

It is good that we can afford to look into recrawl of data to keep it fresh - many search engines simply can't do that, we are maturing faster than I thought, which is a good thing

So what now?

Simple - old buckets will be marked as "uncrawled", and thus automatically put into recrawling. All your stats will remain the same - they won't be wiped out as crawled bucket stays crawled bucket regardless of whether its recrawl or not.[/quote]

Lagu

**NeoGen** · 12-19-2005, 08:40 PM

Summing it up, there will never be lack of links to crawl

**Evil-Dragon** · 12-20-2005, 11:01 AM

Recrawling is infinate... and the web is growing by the day, new sites are created, old sites die...

We're recrawling the first 2 billion again now, this is due to them being in the old pre conan format, taking up too much disk space and full of crap that will take forever to index.

Still 2 billion looks alot but at the moment with an average of 40 million a day it will only take about 50 days, which is less than 2 months. By feb, this old data will be done, the search engine will also be scaled up in Jan so big things are beginning to happen.

This recrawl gives Alex some time to work without having to worry about loading more URL's onto the system for a while. New urls will be loaded by only for national TLD (.uk, .de, .nl, etc..)

**Lagu** · 01-04-2006, 10:02 PM

Hej

I apologise, but I will remind you all of what Evil-Drogan has written weeks ago:

Another warning, we're planning to move house soon. It will probably be early next year (Jan 5th) and i won't have internet for about 2 weeks i reckon. So my crawling will be null, other than Jane's computer crawling about 10,000 - 100,000 a day.

If anything we'll be taken over about then when I move.

Sorry guys

Our member Evil-Drogan is our strongest crawler. Today he earned over 2.200.000 points to our teams. We must be aware we will/can lose points every day compared to today. We are now 2nd in rank but can be third in rank.

Lagu

**Evil-Dragon** · 01-06-2006, 03:03 PM

The time is near now, i've stopped crawling now unfortunatly. Contracts have now been signed so i've got to start getting all my stuff packed away. In 1 or 2 weeks we will be moving which is closer than you think really.

Hope the team won't miss me too much, and i'll be back to crawling again in about a month.

Keep up the good fight for me while i'm gone.

**Ototero** · 01-06-2006, 04:27 PM

Keep the stress levels down. Moving is simply horrible.

**Nflight** · 01-06-2006, 04:27 PM

Evil Dragon after one day without your effort, it is like a month without sunshine!
We can't wait for your return!

Myself and Mitro will be churning away along with our other comrades who continue to push all the time!

Thank You for your effort Evil Dragon, and have a happy move!

**NeoGen** · 01-06-2006, 06:15 PM

These will be two tough weeks for us at Majestic-12, but I trust that we'll be able to pull through it still in second rank.

One thing that is shocking me now is to see someone is pumping out even more than PCZ today... :shock:

**Lagu** · 01-06-2006, 08:56 PM

Hello

I’m now running MJ-12 on 2 computers since yesterday and my crawling is over 1.200.000 Url`s. A half Evil-Drogan so to say. I have an 8 Mbits cable but the average download speed is 1.2 Mbits and the upload is as for all other the Achilles heel far too slow.

Lagu

**Nflight** · 01-09-2006, 01:40 AM

BAd News, my 4200 + is down and that is what I run while the other does my homework operations. Sorry folks I am out for awhile!

Thread: What happen in the near future

Thread Tools

Display

What happen in the near future

Majestic 12

Majestic 12

Posting Permissions