The coming distributed processing explosion

Mike Hewson

Moderator

Joined: 1 Dec 05

Posts: 6588

Credit: 317438156

RAC: 370508

RE: I just found a

14 Sep 2010 23:45:32 UTC

Message 99628 in response to message 99627

(moderation:

)

Quote:

I just found a screenshot I took almost 3 years ago. the floating point speed was 66 TFLOPS.

Well done! If you take the ratio (active hosts last week) per TFLOPS that yields : 812.3 in 2007 and 356.8 now. So that's about 2.3 times more hosts to make up a single TFLOP then as compared to now. That is progress ... :-)

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

Mike Hewson

Moderator

Joined: 1 Dec 05

Posts: 6588

Credit: 317438156

RAC: 370508

I've done another calculation

15 Sep 2010 2:36:25 UTC

Message 99629

(moderation:

)

I've done another calculation looking at the ABP progress page to answer the question : when will we be analysing in 'realtime'? Meaning : when do we catch up with the backlog of Aricebo beams, given our current processing rate in excess of 500% of the rate at which it is acquired? This devolves to simultaneously solving these two expressions ( using the figures shown as I speak ):

41621 + 271 * time

( number of beams processed to date plus a time rate of processing per day thereafter .... )

68112 + 49 * time

( the number of beams existing to date plus a time rate of acquisition per day thereafter. If 41621 represents 61.1% of something, then ~ 68112 is that something .... )

where time is in days, counting from today as zero. This yields ~ 119.3 which is close enough to 4 calendar months, thus mid January 2011. I presume this means, if all relevant matters remaining then as they are now, that there will be less ABP work to actually do when we do catch up.

Cheers, Mike.

( edit ) But, of course, it's a sweet circumstance to have more horsepower under the hood than needed. As opposed to the converse ... :-)

( edit ) Here's a cheeky one ( as I blitz other assumptions to state an equal detection probability per beam ) : by that day in mid January some ~ 73951 beams will have been analysed. Given that we have found 2 new pulsars ( one fully announced, the other hinted as being in the pipeline ) to date, then that's ~ 20810 beams per new pulsar, and with ~ 32330 more to process by that day, then that's ~ 1.5 new pulsar discoveries awaiting between now and mid January. But I guarantee nothing .... :-)

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

tolafoph

Joined: 14 Sep 07

Posts: 122

Credit: 74659937

RAC: 0

RE: This yields ~ 119.3

15 Sep 2010 6:50:00 UTC

Message 99630

(moderation:

)

Quote:

This yields ~ 119.3 which is close enough to 4 calendar months, thus mid January 2011. I presume this means, if all relevant matters remaining then as they are now, that there will be less ABP work to actually do when we do catch up.

I made similar calculations a few weeks ago.
And maybe there is a new CUDA app by than. The real time crunching of ABP data only needs around 30 TFLOPS.
bOINC shows : NVIDIA GPU 0: GeForce GTX 260 (driver version 25896, CUDA version 3010, compute capability 1.3, 873MB, 537 GFLOPS peak)

So even if I estimate that the actual speed is only 30 GFLOPS for a GTX 260. Only 1000 CUDA cards like mine will be necessary.

Mike Hewson

Moderator

Joined: 1 Dec 05

Posts: 6588

Credit: 317438156

RAC: 370508

RE: I made similar

15 Sep 2010 7:41:07 UTC

Message 99631 in response to message 99630

(moderation:

)

Quote:

I made similar calculations a few weeks ago.
And maybe there is a new CUDA app by than. The real time crunching of ABP data only needs around 30 TFLOPS.
bOINC shows : NVIDIA GPU 0: GeForce GTX 260 (driver version 25896, CUDA version 3010, compute capability 1.3, 873MB, 537 GFLOPS peak)

So even if I estimate that the actual speed is only 30 GFLOPS for a GTX 260. Only 1000 CUDA cards like mine will be necessary.

Yup. Assuming the E@H available total of ~ 320 TFLOPS are being presently allocated at 50:50 b/w GW and ABP units, then we only need 160 * ( 49/271 ) ~ 28.9 TFLOPS to keep pace with present Aricebo data production. Mind you the total E@H TFLOPS estimate is based on overall project RAC, so doesn't separate CPU FLOPS vs GPU FLOPS per se. Still a good ballpark figure though. It's impressive that 'only 1000 modest' video cards can give E@H 10% of it's computational capacity. But beware, as while speed is one thing correctness is still King ...

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

tullio

Joined: 22 Jan 05

Posts: 2118

Credit: 61407735

RAC: 0

The real bottleneck is in

15 Sep 2010 8:21:56 UTC

Message 99632

(moderation:

)

The real bottleneck is in transmission speed. The Allen Telescope Array is getting 1 GB/s of data, yet they have only a 40 Mbit/s link shared between the SETI Institute and Berkeley U.'s Radioastronomy Laboratory. How are they going to process all the data they get I do not know.
Tullio

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4312

Credit: 250528537

RAC: 34004

- The computing power

15 Sep 2010 13:59:24 UTC

Message 99633

(moderation:

)

- The computing power available is still the limiting factor for the sensitivity at least of the GW search. With more computing power we could run more sensitive searches for GW even on old data. We're still analyzing S5 and haven't touched S6, so I don't think we'll run out of "work" until Advanced LIGO data is available.

- As for the radio pulsar search I think the data available will last for "work" (in terms of BOINC) until the end of the year. We're working on a new "workunit generator" which will enable us to use data e.g. from radio telescopes other than Arecibo to search with the new method.

- The "new server" I wrote about is actually one spare data server from the ATLAS computing cluster, we hadn't bought it as new. The new setup was made to increase the server capacities for more and faster processing of binary pulsar search.

- Some years ago we found that having a significant amount of tasks that run shorter than an hour pose a problem to the BOINC system, in particular to the database. Therefore the current ABP tasks actually are "bundles" of four "micro-tasks" that are processed one after the other more or less independently. When the processing speed further increases e.g. because of the next generation CUDA App, we'll increase the "bundle size" to keep the database healthy.

- The intention is that the next generation ABP CUDA Apps will have the same computing requirements as the current one (basically 512MB RAM, single precision only). It should run on Fermis, but not require them.

tolafoph

Joined: 14 Sep 07

Posts: 122

Credit: 74659937

RAC: 0

I looked into the history of

15 Sep 2010 16:11:57 UTC

Message 99634

(moderation:

)

I looked into the history of Einstein@home and saw that we are runnig S5 tasks since june 2006. So IÂ´m no worried that the project will run out of work after the big pile of Arecibo data is crunched.

Matt Giwer

Joined: 12 Dec 05

Posts: 144

Credit: 6891649

RAC: 0

Let me throw this out for

16 Sep 2010 3:23:47 UTC

Message 99635

(moderation:

)

Let me throw this out for comment.

All this talk about GPUs is fine.

From what I have seen the target intro price for the hottest GPU card for game addicts is $US5-600. Not since IBM controlled the BIOS have I paid that kind of money for an entire computer.* And then I see cards which are great improvements over the usual motherboard for $US50.

Has anyone compared price to performance? I mean I read of two GPUs being better than one. Is that two $50 or two $500? Are you swaggering braggarts ® all gamers or is this about throughput?

There is a basic problem with comparisons in that they are impossible. The common "all else being equal" or the technical "change only one variable at a time" is thwarted by the incredible variety of the "all else" which is never equal.

Obviously the more hyperthreaded cores the better but it is not clear that more than one GPU is better. And how much better is also a question. And what price level GPU is being discussed?

=====
* Not only that old but I taught myself Fortran on an RCA time share computer in 1967.

tullio

Joined: 22 Jan 05

Posts: 2118

Credit: 61407735

RAC: 0

I had to install gcc-fortran,

16 Sep 2010 5:59:10 UTC

Message 99636

(moderation:

)

I had to install gcc-fortran, blas and lapack this morning just to compile an "octave" package, a kind of open-source Matlab, to run some setiquest programs. All went fine but I must install also gnuplot.
Tullio

ExtraTerrestria...

Joined: 10 Nov 04

Posts: 770

Credit: 578346872

RAC: 197536

@Matt: the simple answer is..

16 Sep 2010 19:40:59 UTC

Message 99637

(moderation:

)

@Matt: the simple answer is.. it depends!

For the current Einstein app it's not of much use to discuss GPU performance, as it uses only a rather small fraction of the GPU anyway. You can use one GPU for several instances of this app (see the other current thread), so adding more GPUs is nor really going to help (unless you've got really many cores, say a Quad AMD 12 core or so). You need a mid to high end GPU though, in order not to slow down the CPU. The slower the CPU the slower the GPU can be.

In most other projects it's like this: you've got certain classes of cards which all perform (more or less) according to their maximum FLOP ratings. Memory bandwidth is mostly not important (and the cards tend to have similar bandwidth per FLOP configurations as this is what they need for gaming). If you take 2 smaller cards whose maximum FLOPS add up to one larger card of the similar class, then they'll also deliver approximately the same RAC.

What I mean by classes: at Milkyway an ATI can use 2/5 of its single precision FLOPs for crunching. Apart from FLOPs all supported cards don't differ much. On the other hand you've got nVidias, which are in a totally different league. There's GT200a/b (CUDA compute capability level [CC] 1.3), which can use 1/8 of its single precision FLOPs at MW. There's GF100 / Fermi (CC 2.0) which can also use 1/8 of its FLOPs, but is fundamentally different (3 of the old maximum theoretical FLOPs are as good as 2 of the new ones). And now we've got GF104 and GF106 (CC 2.1), which can use 1/12 of their single precision FLOPS and are otherwise somewhat similar to GF100 (new FLOPs).

At GPU-Grid at card with CC 1.1 (e.g. the good old G92 chip) needs about 40% more FLOPs to do the same work as a card with CC 1.2 (e.g. GT240) or CC 1.3. The numbers haven't settled yt for the Fermis, i.e. client ipdates will probably improve their performance further.

So you see that the basic features, which the hardware supports (summarized by nVidia in the CC) determines performance as well as raw horse power. I know this is getting messy as more and more different chips and cards enter the playing field - but I don't think there's any further shortcut than what I already described.

MrS

Scanning for our furry friends since Jan 2002

The coming distributed processing explosion

Forums › Cruncher's Corner

Comment viewing options

Forums › Cruncher's Corner