PDA

View Full Version : What happen in the near future



Lagu
12-19-2005, 06:55 PM
Alexc has written this announcement:

Current situation

From crawling point of view we are doing extremely well - 50+ mln pages daily is a very high number of pages - 850 mln urls loaded just 3 weeks ago disappeared like sugar in hot water! Congratulations to all as it shows strength of the concept - we are not yet at Google's crawling levels, but we sure achieved far more than Google did in their year 1 of operations - first known index for them was 24 mln pages (Source (http://www-db.stanford.edu/~backrub/google.html))

Its all good, but search engine right now is still at just 45 mln pages - a drop in an ocean comparing to number of crawled pages. BarkerJ summed it up pretty nicely:


In short, our bottleneck to global dominance is currently getting the search engine to process a large amount of data, not a lack of data.

Indeed, there are rivers, lakes, seas and oceans of data, but its being stored right now rather than quickly indexed and added to search engine for daily use - and of course I want to have a genuinly usable search engine that uses crawled data.

So what to do?

In theory it all looks good - there is search engine already and it works, in practice however is that it may work for 50 mln pages, but it wont do for 500 mln - the keyword here is scalability - all components of the system should scale well to billions of pages.

Here are the main 4 stages that make it up:

crawling -> indexing -> merging -> searching

Crawling scales exceptionally very well - this is the most advanced and automated part of the system that works pretty well, no problems here.

Indexing - I've done a lot of work to speed it up in October, and happy to say that on new server "FiddleAbout" (AMD x2 3800, 1 GB dual channel DDR 400) indexing runs very fast and dual cores scale very very well - looks like it should be able to index 45-50 mln pages per 24H. Even better indexing is already very scalable so upcoming 3 boxes will work as part of the whole system, thus giving us ability to index 170-200 mln pages per day. This is more than we crawl now and it will allow to catch up with backlog, so we are doing well in this area.

Merging the last phase before actual searching that takes separate indexed barrels and merges them into searchable index - this area needs big improvements in order to scale to billions of pages. Merging right now is too linear not using more than 1 CPU - this is to be changed to use multiple CPUs and multiple machines, this is the key step to scaling seacrh engine up.

Searching - this needs to be scaled up as well, but luckily main work is done and so long as merging is scaled I expect this part to scale up easily.

So, its the merging's module scalability that needs to be addressed ASAP.

So what about recrawling?

A good search engine is not just a big search engine, but also up to date search engine - some pages change dramatically and people got accusstommed (def sp!) to searching for recent stuff online, and people are very quick to make a conclusion about whether search engine is good or crap - I am pretty harsh in this respect too.

We can only be reasonably up to date if we recrawl - most small search engines can't afford recrawling, so recrawl is not a dirty word, its actually good that we can afford to think in terms of freshness of information, not just its quantity.

As it was mentioned there is a lot of data flowing in every day - 1 TB+ of uncompressed data every day, it has to be well managed (indexed and serached) before we can dig much deeper.

Here are my considerations for recrawl:

1) first 2 bln pages are not as well compressed as they can be now - this means pages occupy 3-5 times more than they can be compressed now, thus recrawling them would free up some disks for index.

2) first pages we crawled were nearer the top of the sites than current deeper pages - they definately need to be indexed

3) some pages are as old as 1 year

4) recrawl gives me breathing space of around 2 months to scale the search engine - at 50 mln per day we can redo 3 bln in just 60 days.

5) new url loads should have much smarter prioritisation - current limit on urls per site is too generic, say Microsoft.com or GeoCities.com should not be subject to the same limit of 15k urls per site as newly formed doorway pages website with lots of junk - in order to be smart we need to index at least 1 bln pages and build info on sites importance.

This is why I think we should start recrawl of existing buckets rather than load more billions.

Exception: national TLDs (top level domains) will still be added, the reason is because there are so few of them comparing to .COMs/.NETs/.ORGs that there is a point to grow them as numbers are too small to create problems anyway. I hope people are okay with this exception.

Also, I want to state here very clearly important thing - the fact that we can recrawl data does NOT diminish in any way important of it being crawled in the first 9 months of this project - I have to admit here that thoughts of quitting this project in the face of impossibility of the task at hand crossed my mind a few times but this ended in April and won't come back - if it was not for people whose participation helped come through the hardest first 6-9 months of this project then it would have never reached this stage.

Will only changed pages send back?

Not right now - this functionality is trickier than it seems as it would require intelligent indexing/merging of multiple barrels with the same name taking only most recent pages from them - this requires all crawled data to be online, which it is not right now, but I expect this to happen in Q1 2006.

The only drawback of not having smart recrawl is that data send back will not be smaller, but right now with Conan active its pretty reasonably small - compare current barrel of 5 MB with 40-50 MB we typically seen in early days of the project.

Conclusion

It is good that we can afford to look into recrawl of data to keep it fresh - many search engines simply can't do that, we are maturing faster than I thought, which is a good thing :)

So what now?

Simple - old buckets will be marked as "uncrawled", and thus automatically put into recrawling. All your stats will remain the same - they won't be wiped out as crawled bucket stays crawled bucket regardless of whether its recrawl or not.[/quote]

Lagu :)

NeoGen
12-19-2005, 08:40 PM
Summing it up, there will never be lack of links to crawl :)

Evil-Dragon
12-20-2005, 11:01 AM
Recrawling is infinate... and the web is growing by the day, new sites are created, old sites die...

We're recrawling the first 2 billion again now, this is due to them being in the old pre conan format, taking up too much disk space and full of crap that will take forever to index.

Still 2 billion looks alot but at the moment with an average of 40 million a day it will only take about 50 days, which is less than 2 months. By feb, this old data will be done, the search engine will also be scaled up in Jan so big things are beginning to happen.

This recrawl gives Alex some time to work without having to worry about loading more URL's onto the system for a while. New urls will be loaded by only for national TLD (.uk, .de, .nl, etc..)

Lagu
01-04-2006, 10:02 PM
Hej

I apologise, but I will remind you all of what Evil-Drogan has written weeks ago:

Another warning, we're planning to move house soon. It will probably be early next year (Jan 5th) and i won't have internet for about 2 weeks i reckon. So my crawling will be null, other than Jane's computer crawling about 10,000 - 100,000 a day.

If anything we'll be taken over about then when I move.

Sorry guys

Our member Evil-Drogan is our strongest crawler. Today he earned over 2.200.000 points to our teams. We must be aware we will/can lose points every day compared to today. We are now 2nd in rank but can be third in rank.

Lagu :-(

Evil-Dragon
01-06-2006, 03:03 PM
The time is near now, i've stopped crawling now unfortunatly. Contracts have now been signed so i've got to start getting all my stuff packed away. In 1 or 2 weeks we will be moving which is closer than you think really.

Hope the team won't miss me too much, and i'll be back to crawling again in about a month.

Keep up the good fight for me while i'm gone.

Ototero
01-06-2006, 04:27 PM
Keep the stress levels down. Moving is simply horrible.

Nflight
01-06-2006, 04:27 PM
Evil Dragon after one day without your effort, it is like a month without sunshine!
We can't wait for your return!

Myself and Mitro will be churning away along with our other comrades who continue to push all the time!

Thank You for your effort Evil Dragon, and have a happy move!

NeoGen
01-06-2006, 06:15 PM
These will be two tough weeks for us at Majestic-12, but I trust that we'll be able to pull through it still in second rank.

One thing that is shocking me now is to see someone is pumping out even more than PCZ today... :shock:

Lagu
01-06-2006, 08:56 PM
Hello

I’m now running MJ-12 on 2 computers since yesterday and my crawling is over 1.200.000 Url`s. A half Evil-Drogan so to say. I have an 8 Mbits cable but the average download speed is 1.2 Mbits and the upload is as for all other the Achilles heel far too slow.

Lagu :)

Nflight
01-09-2006, 01:40 AM
BAd News, my 4200 + is down and that is what I run while the other does my homework operations. Sorry folks I am out for awhile! :-(

mitro
01-09-2006, 02:18 AM
what happened???? :-(

Nflight
01-09-2006, 11:09 AM
Mitro at this point I am unsure, except that it won't go past the post. It gets stuck at the windows emblem with the blue graphics sliding across the screen, in an endless rotation.
I checked the power supply as if that might be the problem, no problem.
I checked the array, no problem it sees the array.
It is an ASUS board, and when the Boinc is running and Majestic12 the hard drive light was on almost continuous.
The network card was always on, on startup and now there is no glimmer of output or input for that matter.
All fans come on so there is nothing wrong with the cooling.
Any suggestions?

mitro
01-09-2006, 01:38 PM
Usually for me that ends up being either hard drive corruption or a bad/wrong video driver. If you have a spare old hard drive I'd try using it and installing windows on it and see if it works. The good news is that it posts and you get to the Windows boot screen. That hopefully means your hardware is ok.

Ototero
01-09-2006, 01:39 PM
I've seen that on my machine (64 3000+).

Have you tried a safe mode boot up?

I think I just reloaded windows on top of itself. :cry:

Nflight
01-09-2006, 01:53 PM
I tried the safe mode boot up, this was done at 5am this morning. No it did not get me any farther then the emblem and the blue graphics.

Although Mitro's suggestion seems like a possibility as it has been making some noises lately in the HDD. It was nothing constant just a sound like it was tin pan banging once in a great while!

I bet now that we can have this discussion it was the HDD. Good one there Mitro...Take a bow! You deserve it.

mitro
01-09-2006, 02:00 PM
If safe mode gets you no further its most likely the HDD. I'll save my bow until its fixed. :)

Nflight
01-09-2006, 02:09 PM
I spoke to my expert hardware man, he has said there is one other issue that seems to occur on AMD dual cores.
This issue is it likes a file coruption running the windows 64 OS.

He has seen this issue happen enough to point it out to me. He is not sure of the reason or why it happens, but it happens! We will be taking the machine in for a checkup tomorrow morning at 9am. Will report the findings as soon as I know!

Fingers are crossed for good luck!

moving fusion
01-10-2006, 08:17 PM
Hi all,

I am going to give Majestic-12 a go and see what its all about.

Only on a 512 but no limits... See if i can help the team catch 'X Grubbers Kick Ass'

;)

Lagu
01-10-2006, 08:46 PM
Moving fusion

Thank you for signing up on MJ-12. We need both members and power to that project.

Lagu :)

moving fusion
01-11-2006, 04:06 PM
Lagu,

Looking at the stats page i got 29 - 100,000 (no idea what that means...?) just from last night and some of today.


*****************************
Moving Fusion WCG Ranking 3,670

Ototero
01-11-2006, 04:24 PM
Hey Moving,

The 29 means you are 29th highest Majestic scorer in AMD Users
The 100,000 is the score you submitted in the last 24 hours.

When the colour goes green, you have moved higher.
When it goes red, you have moved lower (lost a position)

Good going