Next ABP generation

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4312
Credit: 250721154
RAC: 35823
Topic 194759

Over the last weekend we tried to push up work generation for ABP2 such that we would process the data as fast as we get it from ARECIBO.

Many of you probably noticed the trouble our server infrastructure had keeping track of so many short running workunits (the critical time I'd expect to be about 1h).

To avoid these problems I'm currently working on 'bundling' every four current ABP2 workunits into single new ones. This means that for every 'next generation' ABP task your client would download four data files, process them one after the other, upload four result files, but report only a single task (that ran 4x as long as a current task, but would get 4x as much credit, too).

I'll try to make this backwards compatible to avoid yet another application (ABP3). If all goes well, a set of new App versions will be issued in the next days, that can process both current and next generation work. Behind the scenes I'll replace server-side daemons with versions that also can handle both kinds of results. So with luck the only things you'll notice of that change are new App versions and later longer running ABP2 workunits.

BM

BM

Ver Greeneyes
Ver Greeneyes
Joined: 26 Mar 09
Posts: 140
Credit: 9562235
RAC: 0

Next ABP generation

So the problem isn't with total bandwidth or processing power, but rather with the rate of small, concurrent requests? Interesting: at first glance this seems like a temporary fix that won't actually solve anything in the long run - but perhaps it's more a case of finding the 'sweet spot' in your configuration, that any future upgrades will benefit from equally.

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4312
Credit: 250721154
RAC: 35823

The bottleneck is clearly the

The bottleneck is clearly the database. The shorter the tasks run the more tasks the system need to keep track of for the same amount of total computing time. The database server can only keep a limited number of results in memory. If there are more, the database becomes limited by the I/O speed of its disks, which is what happened over the weekend.

We'll have a new, larger database server in the next weeks, but it might be that there are other bottlenecks in the BOINC system that are currently hidden behind the database limitations. I've already seen the 'db purger' growing a backlog. And it will take some time until the new server is ready to use.

The right way to fix these problems is to reduce the number of tasks the database needs to keep track of, while still doing the same amount of 'work'. This is preliminary in the sense that the amount of datafiles / 'old style tasks' grouped together is bound to the speed of the application. When the app becomes faster due to code improvements, better CUDA usage or a new generation of CPUs, we may raise this number of four again. The code of the app and the server side components should be flexible enough to handle this.

BM

BM

MarkJ
MarkJ
Joined: 28 Feb 08
Posts: 437
Credit: 139002861
RAC: 0

RE: The bottleneck is

Message 96936 in response to message 96935

Quote:

The bottleneck is clearly the database. The shorter the tasks run the more tasks the system need to keep track of for the same amount of total computing time. The database server can only keep a limited number of results in memory. If there are more, the database becomes limited by the I/O speed of its disks, which is what happened over the weekend.

We'll have a new, larger database server in the next weeks, but it might be that there are other bottlenecks in the BOINC system that are currently hidden behind the database limitations. I've already seen the 'db purger' growing a backlog. And it will take some time until the new server is ready to use.

The right way to fix these problems is to reduce the number of tasks the database needs to keep track of, while still doing the same amount of 'work'. This is preliminary in the sense that the amount of datafiles / 'old style tasks' grouped together is bound to the speed of the application. When the app becomes faster due to code improvements, better CUDA usage or a new generation of CPUs, we may raise this number of four again. The code of the app and the server side components should be flexible enough to handle this.

BM

Wouldn't it be simpler to produce a single task of 3 or 4 times the size? This way you'd have less tasks and the longer run times that would reduce the server traffic, without the complexity of trying to bundle 4 tasks together.

Seti went through a similar issue and doubled their wu sensitivity. One bottleneck that we tried to address over there was the MySQL logging. A couple of SSD's were used for the log drives as they seem to cop a hammering, in order to spread the I/O.

RandyC
RandyC
Joined: 18 Jan 05
Posts: 6621
Credit: 111139797
RAC: 0

Bernd, You may want to

Bernd,

You may want to take a look at this WU that seems to be having lots of problems.

All crunchers have failed to this point (3) and the max errors, etc is set at 20/20/20...meaning potentially lots of wasted crunching time.

Seti Classic Final Total: 11446 WU.

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4312
Credit: 250721154
RAC: 35823

RE: Bernd, You may want to

Message 96938 in response to message 96937

Quote:

Bernd,

You may want to take a look at this WU that seems to be having lots of problems.

All crunchers have failed to this point (3) and the max errors, etc is set at 20/20/20...meaning potentially lots of wasted crunching time.

They failed for different reasons. One is a CUDA-realted error, one is a 'too many exits' error which is probably related to the problem described here, and only one is a segv that actually accumulated computing time. If there are more of these very same errors at the same point of a workunit, then I'll start to worry.

We do have a webpage that monitors workunits that have collected only client (or validate) errors and no successful result. The notification level is set to 2 x , i.e. a WU needs to have 4 errors to show up there, and this level worked quite well.

And btw. what's the relation to the subject of this thread?

BM

BM

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4312
Credit: 250721154
RAC: 35823

RE: Wouldn't it be simpler

Message 96939 in response to message 96936

Quote:
Wouldn't it be simpler to produce a single task of 3 or 4 times the size? This way you'd have less tasks and the longer run times that would reduce the server traffic, without the complexity of trying to bundle 4 tasks together.

That's exactly what this bundling is trying to achieve, without the need to e.g. invent new file formats for input and output.

Quote:
Seti went through a similar issue and doubled their wu sensitivity.

Interesting. But I'm afraid that for the ABP search this wouldn't help us reach our goal. The sensitivity has been carefully adjusted to not miss a signal, and the reason for speeding up the application (ABP1 vs ABP2) is that we want to process data in 'real time' i.e. as fast as it comes in from ARECIBO.

Quote:
One bottleneck that we tried to address over there was the MySQL logging. A couple of SSD's were used for the log drives as they seem to cop a hammering, in order to spread the I/O.

This refers to the innodb transaction log, right?

BM

BM

Mad_Max
Mad_Max
Joined: 2 Jan 10
Posts: 154
Credit: 2217014681
RAC: 494717

RE: The sensitivity has

Message 96940 in response to message 96939

Quote:

The sensitivity has been carefully adjusted to not miss a signal, and the reason for speeding up the application (ABP1 vs ABP2) is that we want to process data in 'real time' i.e. as fast as it comes in from ARECIBO.


Is "realtime" speed will be enough? In fact, besides a constant flow of new data from ARECIBO we have over 1200 hours of data already collected in the archive. They too must be processed (yet processed only about 70).
So even with the speed of realtime x 2 (on the assumption that "realtime" ~ 0.7hour/day), will take more than 4 years to "spend" this "cache" and begin to process the data synchronously with ARECIBO.

P.S.
By the way, can you tell me how many data corresponds to one ABP2 WU? Just a few seconds? And those few seconds, took 2 MB for digital storage, right?

Paul D. Buck
Paul D. Buck
Joined: 17 Jan 05
Posts: 754
Credit: 5385205
RAC: 0

RE: ...we want to process

Message 96941 in response to message 96939

Quote:
...we want to process data in 'real time' i.e. as fast as it comes in from ARECIBO.


Um, isn't it on the funding chopping block?

I admit this is a bit old but Radio Telescope And Its Budget Hang in the Balance

Heck, they killed OTH-B for the lack of a few bucks and the same year they almost killed SOSUS though it may be gone by now too for all I know ...

MarkJ
MarkJ
Joined: 28 Feb 08
Posts: 437
Credit: 139002861
RAC: 0

RE: RE: One bottleneck

Message 96942 in response to message 96939

Quote:
Quote:
One bottleneck that we tried to address over there was the MySQL logging. A couple of SSD's were used for the log drives as they seem to cop a hammering, in order to spread the I/O.

This refers to the innodb transaction log, right?

BM

I'm not too sure which DB they stuck them on in the end. You'd have to ask Eric over at berkeley which one they they put it on.

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4312
Credit: 250721154
RAC: 35823

RE: By the way, can you

Message 96943 in response to message 96940

Quote:
By the way, can you tell me how many data corresponds to one ABP2 WU? Just a few seconds? And those few seconds, took 2 MB for digital storage, right


I think the original ARECIBO data files (~4GB each) correspond to five minutes observation time. They are pre-processed (dedispersed) to each result in 628 workunits.

BM

BM

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.