PDA

View Full Version : Bionc Errors in early 2006 ? ? ?



Nflight
01-03-2006, 10:03 PM
The answer lies here in this long winded explanation, although it is long it explains the problems the Berkeley team is having to resolve the issues on the other end of our disgruntlement with not being able to access the WU's. The URL to find this info: http://setiweb.ssl.berkeley.edu/tech_news.php

Dated: January 2, 2006 - 19:15 UTC

Happy New Year! The holiday season has been a bit of a headache, as several nagging problems kept the BOINC backend from running optimally. Luckily, most of us were around town and able to stop/start/reboot/kick things as needed to keep the project rolling as much as possible.

Most of the issues stem from an excessive load on the BOINC database. Remember that the BOINC database is the one that contains all the information pertaining to the distributed computing side of things: like users and teams, but also cursory workunit and result data for scheduling and sending/receiving purposes. We are still in the middle of the master database merge (see below for more information about that). This is an entirely different database that contains all the scientific products of SETI@home (both Classic and BOINC). So while we are busy merging the old and new scientific databases into one, this has no bearing on the problems people are having connecting to our servers, posting messages on the forums, etc. The merge process will be continuing for many weeks, in fact.

In a nutshell, the BOINC database issues started when we built up a large "waiting to assimilate" queue in mid-December. Then we got hit with, among other things, an influx of new users, a network outage beyond our control, a failed disk, a full disk, a spat of noisy workunits, and a database crash. All events were handled effectively, but the queues weren't draining as fast as we wished.

The sum of all this ended up being large, unweildy tables in the BOINC database, as old workunits and results weren't being purged and more entries were being inserted. All the backend processes that enumerate on these tables (the validator, the assimilator, the file_deleter, etc.) all slowed down. It got to the point that just doing a "select count(*)" on these tables would take 30 minutes, which is why we shut off the counts on the status page.

To help all this, we rebuilt some of the backend processes. Those who pay close attention may have noticed that workunit names have changed over the past week or so. It used to be a tape name followed by four dot-delimited numbers. Now there are five numbers. The new number (which is currently "1" for all workunits) is a scientific configuration setting. Having this number in the name saves us two expensive database reads to look up these configuration settings. This change vastly improved the assimilator throughput, but we were already mired in the problems listed above. Without this change, though, we would have been dead in the water, as the deleters would back up behind the assimilators. We would have then run out of workunit space, the splitters would halt, and no new work would be created.

Adding insult to injury, we found the feeder has some kind of bug in it. The feeder is the process that keeps a stash of results in shared memory that the scheduler reads to find out what to send to clients requesting work. Over time the feeder gets less and less able to keep a full stash. Eventually the feeder can't keep up with the scheduler's demand for results, and then clients get "no work available" messages. These clients retry quickly, and these extra connections cause stress on the server, which then starts dropping these connections. So every day or so we've been restarting the feeder to clear out its clogged shared memory segments and that temporarily improves connectivity. We're looking into it.

Since the last database compression/backup on Wednesday, we purged about 8 million results from the result table. So we decided to have another compression/backup outage today (Monday) to reap the benefit of a much smaller result table sooner than later.

Now that we know what is going on we can quit banging our heads against the wall and realize it is not our machines that are at failure but its theres !