Here, for convenience, is the full text of Bruce's message. If you wish to comment on, or reply to what Bruce has said, follow Mike's link to the original message.
Quote:
Dear Einstein@Home volunteers and contributors,
I thought I would post a description of what went wrong and how it was fixed.
(1) Project performance problems. These were due to our database getting overloaded. It was processing an average of 950 queries per second, with peaks of up to about 3000 queries per second. Ultimately, these were due to the way that the BOINC locality scheduler works and the fact that our new analysis run did not have many low-frequency workunits. Einstein@Home is the only project that uses the locality scheduler, which is designed to send many workunits for the same data file, only sending a new data file when there is no work left for the previous data file. What happened was that many hosts that had low frequency files (because they were slower than the majority of hosts) requested work for these files, or NEW workunits also for low frequency files. When the project ran out of work for these files, the locality scheduler would then perform an extremely database intensive 'crawl' through the database looking for more work. So the slowest 20% of hosts were generating very large numbers of database queries looking for non-existent low frequency workunits. I fixed this by modifying the algorithm that searches for new work. Anyone interested in the details can look at BOINC CVS next week when I check in the modified code.
The database is now averaging about 60 to 80 queries per second, and the database server and project servers are once again snappy and responsive.
(2) File server problems. Our project uses three file servers, each of which has about 8TB of RAID-6 disk space. The file servers use Areca 24-port SATA controller cards, and Western Digital WD4000YR disks. For a number of months we have been experiencing problems in which a disk would apparently drop from the array and then reappear a few seconds later, prompting a RAID array rebuild. In the end we sent one of our server boxes (approximately 80 kg, worth about 10kUSD) by express mail to Taiwan, and the Areca engineers looked at it more closely. (Many thanks to these engineers, who have given us first-rate support!) It turned out that our problems were due to a hardware problem with the WD4000YR drives. They have a SATA interface chip which (in some revisions of the WD4000YR) is incompatible with an interface chip used on the Areca RAID controller. This incompatibility is only triggered by issuing NCQ commands. So by disabling NCQ on the RAID controller, the problem was fixed. Our two remaining file servers have now been working without issues for more than two weeks.
These things were further exacerbated by my move to Germany with my family (our kids are 2.5 and 6 years old) which meant that I couldn't give these issues enough attention until now.
Hopefully these problems are behind us! I am grateful to everyone for their patience, and apologize for how long it took to track these things down and deal with them.
Breakdown Analysis
)
Here, for convenience, is the full text of Bruce's message. If you wish to comment on, or reply to what Bruce has said, follow Mike's link to the original message.
Cheers,
Gary.