Over the last weekend we tried to push up work generation for ABP2 such that we would process the data as fast as we get it from ARECIBO.
Many of you probably noticed the trouble our server infrastructure had keeping track of so many short running workunits (the critical time I'd expect to be about 1h).
To avoid these problems I'm currently working on 'bundling' every four current ABP2 workunits into single new ones. This means that for every 'next generation' ABP task your client would download four data files, process them one after the other, upload four result files, but report only a single task (that ran 4x as long as a current task, but would get 4x as much credit, too).
I'll try to make this backwards compatible to avoid yet another application (ABP3). If all goes well, a set of new App versions will be issued in the next days, that can process both current and next generation work. Behind the scenes I'll replace server-side daemons with versions that also can handle both kinds of results. So with luck the only things you'll notice of that change are new App versions and later longer running ABP2 workunits.
BM
BM
Copyright © 2024 Einstein@Home. All rights reserved.
Next ABP generation
)
So the problem isn't with total bandwidth or processing power, but rather with the rate of small, concurrent requests? Interesting: at first glance this seems like a temporary fix that won't actually solve anything in the long run - but perhaps it's more a case of finding the 'sweet spot' in your configuration, that any future upgrades will benefit from equally.
The bottleneck is clearly the
)
The bottleneck is clearly the database. The shorter the tasks run the more tasks the system need to keep track of for the same amount of total computing time. The database server can only keep a limited number of results in memory. If there are more, the database becomes limited by the I/O speed of its disks, which is what happened over the weekend.
We'll have a new, larger database server in the next weeks, but it might be that there are other bottlenecks in the BOINC system that are currently hidden behind the database limitations. I've already seen the 'db purger' growing a backlog. And it will take some time until the new server is ready to use.
The right way to fix these problems is to reduce the number of tasks the database needs to keep track of, while still doing the same amount of 'work'. This is preliminary in the sense that the amount of datafiles / 'old style tasks' grouped together is bound to the speed of the application. When the app becomes faster due to code improvements, better CUDA usage or a new generation of CPUs, we may raise this number of four again. The code of the app and the server side components should be flexible enough to handle this.
BM
BM
RE: The bottleneck is
)
Wouldn't it be simpler to produce a single task of 3 or 4 times the size? This way you'd have less tasks and the longer run times that would reduce the server traffic, without the complexity of trying to bundle 4 tasks together.
Seti went through a similar issue and doubled their wu sensitivity. One bottleneck that we tried to address over there was the MySQL logging. A couple of SSD's were used for the log drives as they seem to cop a hammering, in order to spread the I/O.
BOINC blog
Bernd, You may want to
)
Bernd,
You may want to take a look at this WU that seems to be having lots of problems.
All crunchers have failed to this point (3) and the max errors, etc is set at 20/20/20...meaning potentially lots of wasted crunching time.
Seti Classic Final Total: 11446 WU.
RE: Bernd, You may want to
)
They failed for different reasons. One is a CUDA-realted error, one is a 'too many exits' error which is probably related to the problem described here, and only one is a segv that actually accumulated computing time. If there are more of these very same errors at the same point of a workunit, then I'll start to worry.
We do have a webpage that monitors workunits that have collected only client (or validate) errors and no successful result. The notification level is set to 2 x , i.e. a WU needs to have 4 errors to show up there, and this level worked quite well.
And btw. what's the relation to the subject of this thread?
BM
BM
RE: Wouldn't it be simpler
)
That's exactly what this bundling is trying to achieve, without the need to e.g. invent new file formats for input and output.
Interesting. But I'm afraid that for the ABP search this wouldn't help us reach our goal. The sensitivity has been carefully adjusted to not miss a signal, and the reason for speeding up the application (ABP1 vs ABP2) is that we want to process data in 'real time' i.e. as fast as it comes in from ARECIBO.
This refers to the innodb transaction log, right?
BM
BM
RE: The sensitivity has
)
Is "realtime" speed will be enough? In fact, besides a constant flow of new data from ARECIBO we have over 1200 hours of data already collected in the archive. They too must be processed (yet processed only about 70).
So even with the speed of realtime x 2 (on the assumption that "realtime" ~ 0.7hour/day), will take more than 4 years to "spend" this "cache" and begin to process the data synchronously with ARECIBO.
P.S.
By the way, can you tell me how many data corresponds to one ABP2 WU? Just a few seconds? And those few seconds, took 2 MB for digital storage, right?
RE: ...we want to process
)
Um, isn't it on the funding chopping block?
I admit this is a bit old but Radio Telescope And Its Budget Hang in the Balance
Heck, they killed OTH-B for the lack of a few bucks and the same year they almost killed SOSUS though it may be gone by now too for all I know ...
RE: RE: One bottleneck
)
I'm not too sure which DB they stuck them on in the end. You'd have to ask Eric over at berkeley which one they they put it on.
BOINC blog
RE: By the way, can you
)
I think the original ARECIBO data files (~4GB each) correspond to five minutes observation time. They are pre-processed (dedispersed) to each result in 628 workunits.
BM
BM