Is it sending/recieving a WU every second or what..?
Maybe 500-1000 WU:s all at once should do the trick or..?
My point exactly, I go through the same thing with the BOINC Alpha Project, if I try to run more than 1 Pc at the project my Network gets Choked up trying to Upload the 4 mb Files that only take 1 minute to run. So I only run 1 Pc at a time and it works okay then because the Uploads can keep up with the Downloads.
Originally Posted by Nflight
But this sciLINC Project is ridiculous, with the less than 1 sec Wu's theres no way the Uploads can keep up with the downloads ...
Maybe someone with a faster upload should do those projects..(8/8)?
(if it´s necesseary at all..)
There seems to still be problems:
Unable to connect to database - please try again later Error: 2002Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2)
They managed to corrupt their database, probably because of 1000's of WU's coming in from all angles every second.
Database corruption doesn't happen because of the number of cuncurrent connections... it comes down to how the database server locks the operation requests.
If the database is not properly configured, it can do update and delete operations to data that is being used elsewhere by some other cuncurrent connection. That causes corruption because the other connection is trying to work with data that doesn't exist anymore or was altered during that time.
Usually its not a big deal when you're on test environments, with only half a dozen people accessing the database it rarely or never happens that two people access the same data at the same time, but if you switch to a production system with hundreds or thousands of clients all accessing and altering the data it can happen that two or more try to alter the same data at the same time...
Very true NeoGen, but alas that was not our issue either.
Originally Posted by NeoGen
The database was properly configured and we had tested with dozens of clients hitting the server while it ran 20+ instances of httpd, 9 different project daemons and an equal number of periodic tasks, some lasting 8 hours. All of this generated hundreds of queries per second with peaks of a couple thousand.
We were bitten by electrical issues. Several machines in the building where the server is housed completely fell over around the time this occurred.
(Un?)fortunately the SciLINC server stayed up, but system file buffers were corrupted and garbage was written out to the drive. This spanned the Apache log files, the MySQL log files, the Linux kernel log files and unfortunately the tablespace for at least one of our tables, the SciLINC result table.
We have since managed to recover everything else and put the project back up, but we are not feeding new work units nor are we registering new accounts at this point in time.
I would like to personally apologize to those that were affected by the high CPU load that resulted from transferring 2,500 small files all at the same time. And, I would also like to thank everyone that has shown an interest in SciLINC and the research that is being done.
Ouch... I guess that not even the best and most well tuned Database system in the world would be prepared for that.
Thankfully I got my account already, so I'll be there when the time comes for another round of workunits.
In the mean time, thanks for the update, and keep up the good work! Your project has an interesting purpose, outside of the common nowadays maths or bio sciences, I'm looking forward to see it developing.
When do you think the project will be working again?
And do you intend to have a alpha or beta version first?
Last edited by Bubben; 06-20-2007 at 11:59 PM.