Next ABP generation

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4312

Credit: 250629217

RAC: 34116

3 Feb 2010 22:43:17 UTC

Topic 194759

(moderation:

)

Over the last weekend we tried to push up work generation for ABP2 such that we would process the data as fast as we get it from ARECIBO.

Many of you probably noticed the trouble our server infrastructure had keeping track of so many short running workunits (the critical time I'd expect to be about 1h).

To avoid these problems I'm currently working on 'bundling' every four current ABP2 workunits into single new ones. This means that for every 'next generation' ABP task your client would download four data files, process them one after the other, upload four result files, but report only a single task (that ran 4x as long as a current task, but would get 4x as much credit, too).

I'll try to make this backwards compatible to avoid yet another application (ABP3). If all goes well, a set of new App versions will be issued in the next days, that can process both current and next generation work. Behind the scenes I'll replace server-side daemons with versions that also can handle both kinds of results. So with luck the only things you'll notice of that change are new App versions and later longer running ABP2 workunits.

Ver Greeneyes

Joined: 26 Mar 09

Posts: 140

Credit: 9562235

RAC: 0

Next ABP generation

4 Feb 2010 2:15:35 UTC

Message 96934

(moderation:

)

So the problem isn't with total bandwidth or processing power, but rather with the rate of small, concurrent requests? Interesting: at first glance this seems like a temporary fix that won't actually solve anything in the long run - but perhaps it's more a case of finding the 'sweet spot' in your configuration, that any future upgrades will benefit from equally.

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4312

Credit: 250629217

RAC: 34116

The bottleneck is clearly the

4 Feb 2010 9:28:32 UTC

Message 96935

(moderation:

)

The bottleneck is clearly the database. The shorter the tasks run the more tasks the system need to keep track of for the same amount of total computing time. The database server can only keep a limited number of results in memory. If there are more, the database becomes limited by the I/O speed of its disks, which is what happened over the weekend.

We'll have a new, larger database server in the next weeks, but it might be that there are other bottlenecks in the BOINC system that are currently hidden behind the database limitations. I've already seen the 'db purger' growing a backlog. And it will take some time until the new server is ready to use.

The right way to fix these problems is to reduce the number of tasks the database needs to keep track of, while still doing the same amount of 'work'. This is preliminary in the sense that the amount of datafiles / 'old style tasks' grouped together is bound to the speed of the application. When the app becomes faster due to code improvements, better CUDA usage or a new generation of CPUs, we may raise this number of four again. The code of the app and the server side components should be flexible enough to handle this.

MarkJ

Joined: 28 Feb 08

Posts: 437

Credit: 139002861

RAC: 0

RE: The bottleneck is

4 Feb 2010 10:25:47 UTC

Message 96936 in response to message 96935

(moderation:

)

Quote:

The bottleneck is clearly the database. The shorter the tasks run the more tasks the system need to keep track of for the same amount of total computing time. The database server can only keep a limited number of results in memory. If there are more, the database becomes limited by the I/O speed of its disks, which is what happened over the weekend.

We'll have a new, larger database server in the next weeks, but it might be that there are other bottlenecks in the BOINC system that are currently hidden behind the database limitations. I've already seen the 'db purger' growing a backlog. And it will take some time until the new server is ready to use.

The right way to fix these problems is to reduce the number of tasks the database needs to keep track of, while still doing the same amount of 'work'. This is preliminary in the sense that the amount of datafiles / 'old style tasks' grouped together is bound to the speed of the application. When the app becomes faster due to code improvements, better CUDA usage or a new generation of CPUs, we may raise this number of four again. The code of the app and the server side components should be flexible enough to handle this.

BM

Wouldn't it be simpler to produce a single task of 3 or 4 times the size? This way you'd have less tasks and the longer run times that would reduce the server traffic, without the complexity of trying to bundle 4 tasks together.

Seti went through a similar issue and doubled their wu sensitivity. One bottleneck that we tried to address over there was the MySQL logging. A couple of SSD's were used for the log drives as they seem to cop a hammering, in order to spread the I/O.

BOINC blog

RandyC

Joined: 18 Jan 05

Posts: 6616

Credit: 111139797

RAC: 0

Bernd, You may want to

4 Feb 2010 10:35:53 UTC

Message 96937

(moderation:

)

Bernd,

You may want to take a look at this WU that seems to be having lots of problems.

All crunchers have failed to this point (3) and the max errors, etc is set at 20/20/20...meaning potentially lots of wasted crunching time.

Seti Classic Final Total: 11446 WU.

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4312

Credit: 250629217

RAC: 34116

RE: Bernd, You may want to

4 Feb 2010 11:06:42 UTC

Message 96938 in response to message 96937

(moderation:

)

Quote:

Bernd,

You may want to take a look at this WU that seems to be having lots of problems.

All crunchers have failed to this point (3) and the max errors, etc is set at 20/20/20...meaning potentially lots of wasted crunching time.

They failed for different reasons. One is a CUDA-realted error, one is a 'too many exits' error which is probably related to the problem described here, and only one is a segv that actually accumulated computing time. If there are more of these very same errors at the same point of a workunit, then I'll start to worry.

We do have a webpage that monitors workunits that have collected only client (or validate) errors and no successful result. The notification level is set to 2 x , i.e. a WU needs to have 4 errors to show up there, and this level worked quite well.

And btw. what's the relation to the subject of this thread?

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4312

Credit: 250629217

RAC: 34116

RE: Wouldn't it be simpler

4 Feb 2010 11:14:08 UTC

Message 96939 in response to message 96936

(moderation:

)

Quote:

Wouldn't it be simpler to produce a single task of 3 or 4 times the size? This way you'd have less tasks and the longer run times that would reduce the server traffic, without the complexity of trying to bundle 4 tasks together.

That's exactly what this bundling is trying to achieve, without the need to e.g. invent new file formats for input and output.

Quote:

Seti went through a similar issue and doubled their wu sensitivity.

Interesting. But I'm afraid that for the ABP search this wouldn't help us reach our goal. The sensitivity has been carefully adjusted to not miss a signal, and the reason for speeding up the application (ABP1 vs ABP2) is that we want to process data in 'real time' i.e. as fast as it comes in from ARECIBO.

Quote:

One bottleneck that we tried to address over there was the MySQL logging. A couple of SSD's were used for the log drives as they seem to cop a hammering, in order to spread the I/O.

This refers to the innodb transaction log, right?

Mad_Max

Joined: 2 Jan 10

Posts: 154

Credit: 2215614695

RAC: 452512

RE: The sensitivity has

4 Feb 2010 22:15:43 UTC

Message 96940 in response to message 96939

(moderation:

)

Quote:

The sensitivity has been carefully adjusted to not miss a signal, and the reason for speeding up the application (ABP1 vs ABP2) is that we want to process data in 'real time' i.e. as fast as it comes in from ARECIBO.

Is "realtime" speed will be enough? In fact, besides a constant flow of new data from ARECIBO we have over 1200 hours of data already collected in the archive. They too must be processed (yet processed only about 70).
So even with the speed of realtime x 2 (on the assumption that "realtime" ~ 0.7hour/day), will take more than 4 years to "spend" this "cache" and begin to process the data synchronously with ARECIBO.

P.S.
By the way, can you tell me how many data corresponds to one ABP2 WU? Just a few seconds? And those few seconds, took 2 MB for digital storage, right?

Paul D. Buck

Joined: 17 Jan 05

Posts: 754

Credit: 5385205

RAC: 0

RE: ...we want to process

5 Feb 2010 5:21:48 UTC

Message 96941 in response to message 96939

(moderation:

)

Quote:

...we want to process data in 'real time' i.e. as fast as it comes in from ARECIBO.

Um, isn't it on the funding chopping block?

I admit this is a bit old but Radio Telescope And Its Budget Hang in the Balance

Heck, they killed OTH-B for the lack of a few bucks and the same year they almost killed SOSUS though it may be gone by now too for all I know ...

MarkJ

Joined: 28 Feb 08

Posts: 437

Credit: 139002861

RAC: 0

RE: RE: One bottleneck

5 Feb 2010 10:47:29 UTC

Message 96942 in response to message 96939

(moderation:

)

Quote:

Quote:
One bottleneck that we tried to address over there was the MySQL logging. A couple of SSD's were used for the log drives as they seem to cop a hammering, in order to spread the I/O.

This refers to the innodb transaction log, right?

BM

I'm not too sure which DB they stuck them on in the end. You'd have to ask Eric over at berkeley which one they they put it on.

BOINC blog

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4312

Credit: 250629217

RAC: 34116

RE: By the way, can you

6 Feb 2010 23:23:43 UTC

Message 96943 in response to message 96940

(moderation:

)

Quote:

By the way, can you tell me how many data corresponds to one ABP2 WU? Just a few seconds? And those few seconds, took 2 MB for digital storage, right

I think the original ARECIBO data files (~4GB each) correspond to five minutes observation time. They are pre-processed (dedispersed) to each result in 628 workunits.

Next ABP generation

Forums › Cruncher's Corner

Comment viewing options

Forums › Cruncher's Corner