New Einstein@Home Radio Pulsar Search and NVIDIA GPU Code

Submitted on 20 Jan 2011 15:57:08 UTC

Einstein@Home is beginning a new round of searching for radio pulsars in short-orbital-period binary systems.

This is accompanied by the release of a new application (called BRP3). The new application is particularly efficient on NVIDIA Graphics Processor Cards (up to a factor of 20 faster than the CPU-only application). In addition, when running on an NVIDIA GPU card, this new application makes very little use of the CPU (typically around 20% CPU use when the GPU is devoted to Einstein@Home).

The NVIDIA GPU application is initially available for Windows and Linux only. We hope to have a Macintosh version available soon. Due to limitations in the NVIDIA drivers, the Linux version still makes heavy use of the CPU. This will be fixed in Spring 2011, when a new version of the NVIDIA Driver is released. Many thanks to NVIDIA technical support for their assistance!

Because we have exhausted the backlog of data from Arecibo Observatory, this new application is being shipped with data from the Parkes Multibeam Pulsar Survey (from the Parkes Radio Telescope in Australia). In the next weeks we expect to also start using this new application on fresh Arecibo data taken with the latest 'Mock Spectrometer' back-end.

Questions, problems or bug reports related to this new application and search should be reported in this news item thread as a 'Comment'.

Bruce Allen
Director, Einstein@Home

Comments

Jeroen

Joined: 25 Nov 05

Posts: 379

Credit: 740030628

RAC: 0

RE: Thanks! I got the

16 Feb 2011 23:07:00 UTC

Message 103523 in response to message 103522

Quote

(moderation:

)

Quote:

Thanks!

I got the impression that tasks hang only if started with run_client in the background without an actice X desktop, bute rarely if ever stall when started with run_manager and BOINC manager open. Just a first impression, tho.

CU

HB

I'm currently running console only via a small busybox image and run_client instead of run_manager. I did notice yesterday that it took a few minutes for the CUDA tasks to start up. Initially the tasks were in a paused state and then after five minutes or so, the tasks started up.

telegd

Joined: 17 Apr 07

Posts: 91

Credit: 10212522

RAC: 0

RE: The cross validation

16 Feb 2011 23:17:09 UTC

Message 103524 in response to message 103520

Quote

(moderation:

)

Quote:

The cross validation problem is not a driver issue, it will be fixed with the next app version.

I figured as much. Thanks for the confirmation.

Quote:

16 bit Suse 11.2 box

I got a smile from your typo...

I haven't done any proper testing, but I notice that (sometimes) the X-Windows process starts taking a lot of CPU time (almost a full core) when the GPU app is running. Happens for my PrimeGrid app too. I am not sure what triggers it and I can't be sure that it never happened on the 260 drivers. However, I don't think I ever saw it before....

mickydl*

Joined: 7 Oct 08

Posts: 39

Credit: 200374822

RAC: 0

I'm running the 270 driver

16 Feb 2011 23:45:56 UTC

Message 103525 in response to message 103524

Quote

(moderation:

)

I'm running the 270 driver for the past two weeks now. Except for a few errors I got when I was still messing around with the app_info.xml not a single WU has failed, hung or was invalid.

It's running on two machines, one with a GForce9800 and one with a GTX470.
OS is a self compiled LFS Linux 64Bit with 32Bit compat. libs installed. No X-Windows on both machines.

Michael

astrocrab

Joined: 28 Jan 08

Posts: 208

Credit: 429202534

RAC: 0

RE: As for the 270 driver:

17 Feb 2011 17:00:56 UTC

Message 103526 in response to message 103520

Quote

(moderation:

)

Quote:

As for the 270 driver: It runs flawlessly on my 16 bit Suse 11.2 box, but I got hanging CUDA apps since updating the driver on my 64 bit Linux box.

Does anybody see the same on a 64 bit Linux w/ 270 driver?

CU
HB

i run 270 driver on several ubuntu 10.10 x64 machines with gtx 560 for several weeks. no any hangs.

i thought 16 bit machines extinct soon after dynosaurs.

when new app will be available (estimated) =) ?

Bikeman (Heinz-...

Moderator

Joined: 28 Aug 06

Posts: 3522

Credit: 733567153

RAC: 1274099

RE: i thought 16 bit

17 Feb 2011 18:42:01 UTC

Message 103527 in response to message 103526

Quote

(moderation:

)

Quote:

i thought 16 bit machines extinct soon after dynosaurs.

Lol!!!!

Oh my dear...but by typos like that you recognize people who actually have done asm programming on 8 bit [sic] processors and thought that 16 bit was heaven :-)

CU
HB

Donald A. Tevault

Joined: 17 Feb 06

Posts: 439

Credit: 73516529

RAC: 0

RE: RE: i thought 16 bit

17 Feb 2011 19:17:17 UTC

Message 103528 in response to message 103527

Quote

(moderation:

)

Quote:

Quote:

i thought 16 bit machines extinct soon after dynosaurs.

Lol!!!!

Oh my dear...but by typos like that you recognize people who actually have done asm programming on 8 bit [sic] processors and thought that 16 bit was heaven :-)

CU
HB

Have you ever worked with transistorized computers? They're much more fun than the modern ones.

astrocrab

Joined: 28 Jan 08

Posts: 208

Credit: 429202534

RAC: 0

RE: Oh my dear...but by

17 Feb 2011 19:22:16 UTC

Message 103529 in response to message 103527

Quote

(moderation:

)

Quote:

Oh my dear...but by typos like that you recognize people who actually have done asm programming on 8 bit [sic] processors and thought that 16 bit was heaven :-)

CU
HB

i had some programming on Z80 =) and today's many gigabytes software scares me ))
but where did you get 16-bit cpu? even ancient 386 was a 32-bit already.

Bikeman (Heinz-...

Moderator

Joined: 28 Aug 06

Posts: 3522

Credit: 733567153

RAC: 1274099

Not wanting to hijack the

17 Feb 2011 20:20:18 UTC

Message 103530

Quote

(moderation:

)

Not wanting to hijack the thread, but since you asked:

My first assembly program was on the 6502 8 bit CPU of a Commodore VIC 20. Early 80th of the previous century.

Then I had an Intel 8086 based PC (or 8088 can't remember), which was logically a 16 bit CPU.

:-)

Mike Hewson

Moderator

Joined: 1 Dec 05

Posts: 6591

Credit: 319494309

RAC: 424777

RE: Not wanting to hijack

18 Feb 2011 1:56:01 UTC

Message 103531 in response to message 103530

Quote

(moderation:

)

Quote:

Not wanting to hijack the thread, but since you asked:

My first assembly program was on the 6502 8 bit CPU of a Commodore VIC 20. Early 80th of the previous century.

Then I had an Intel 8086 based PC (or 8088 can't remember), which was logically a 16 bit CPU.

:-)

HB

Ah, there are other dinosaurs about that know of what you speak! :-):-)

Yeah I did 6502 as well, on the C-64. Quite a laugh sorting out their indirection/pointer commands as I recall. Pretty well everything bar the power switch was memory mapped, so you got direct access to the lot. I too graduated to the 8088 first then 8086, using MASM and then "Progammer's Work Bench" - a good IDE for it's day. The learning hump for me was understanding stack frames correctly. The 8088 is internally/code the same but had only an 8-bit memory bus accesses, and word alignment was thus a performance issue for 8086.

16 bit was like : "Really? Wow! Can I have a try? Please ..... "

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

tullio

Joined: 22 Jan 05

Posts: 2118

Credit: 61407735

RAC: 0

I have used the Z80 and the

18 Feb 2011 4:38:08 UTC

Message 103532

Quote

(moderation:

)

I have used the Z80 and the Z8000, which was 16 bit. 32-bit Z80000 never appeared and I switched to 68010 and following chips from Motorola,
Tullio

Ver Greeneyes

Joined: 26 Mar 09

Posts: 140

Credit: 9562235

RAC: 0

If you've ever looked at or

18 Feb 2011 7:19:06 UTC

Message 103533 in response to message 103532

Quote

(moderation:

)

If you've ever looked at or created a boot sector for your HDD, even today's CPUs still start up in a legacy 16-bit mode :)

Donald A. Tevault

Joined: 17 Feb 06

Posts: 439

Credit: 73516529

RAC: 0

RE: Not wanting to hijack

18 Feb 2011 14:03:25 UTC

Message 103534 in response to message 103530

Quote

(moderation:

)

Quote:

Not wanting to hijack the thread, but since you asked:

My first assembly program was on the 6502 8 bit CPU of a Commodore VIC 20. Early 80th of the previous century.

Then I had an Intel 8086 based PC (or 8088 can't remember), which was logically a 16 bit CPU.

:-)

HB

Some interesting posts here, but not quite on topic. A new thread, perhaps?

Oliver Behnke

Moderator

Administrator

Joined: 4 Sep 07

Posts: 984

Credit: 25171438

RAC: 23

RE: Update: I checked the

18 Feb 2011 14:23:17 UTC

Message 103535 in response to message 103508

Quote

(moderation:

)

Quote:

Update: I checked the 260.19.36 as well as the 270.18 beta driver. While the first doesn't fix the issue (as expected) the latter does. Well, at least in a first test run that still needs to finish... If all turns out fine we might publish an unofficial release of our Linux 32-bit app that relies on this new driver. You are free to install and run the new driver/app combo manually as you please (using an appropriate app_info.xml file).

Stay tuned for more...

Yet another update: we will release shortly a Linux CUDA app specifically for use with the NVIDIA 270.xx beta driver. As soon as you install this driver our server will send you the new app which behaves like a normal BOINC CUDA app, reducing the CPU consumption as good as possible.

We'll post a tech news item as soon as the new app is released (it's imminent).

Cheers,
Oliver

Einstein@Home Project

Jeroen

Joined: 25 Nov 05

Posts: 379

Credit: 740030628

RAC: 0

RE: Yet another update: we

18 Feb 2011 15:23:20 UTC

Message 103536 in response to message 103535

Quote

(moderation:

)

Quote:

Yet another update: we will release shortly a Linux CUDA app specifically for use with the NVIDIA 270.xx beta driver. As soon as you install this driver our server will send you the new app which behaves like a normal BOINC CUDA app, reducing the CPU consumption as good as possible.

We'll post a tech news item as soon as the new app is released (it's imminent).

Cheers,
Oliver

That is great news. Thanks.

Donald A. Tevault

Joined: 17 Feb 06

Posts: 439

Credit: 73516529

RAC: 0

RE: Yet another update: we

18 Feb 2011 18:06:47 UTC

Message 103537 in response to message 103535

Quote

(moderation:

)

Quote:

Yet another update: we will release shortly a Linux CUDA app specifically for use with the NVIDIA 270.xx beta driver. As soon as you install this driver our server will send you the new app which behaves like a normal BOINC CUDA app, reducing the CPU consumption as good as possible.

We'll post a tech news item as soon as the new app is released (it's imminent).

Cheers,
Oliver

Cool. As it happens, I installed an nVidia card and the 270 driver in one of my machines only yesterday.

astrocrab

Joined: 28 Jan 08

Posts: 208

Credit: 429202534

RAC: 0

i'm sorry, i don't understand

18 Feb 2011 18:32:04 UTC

Message 103538

Quote

(moderation:

)

i'm sorry, i don't understand clear enought when can we use new 1.07 version?

Michael Karlinsky

Joined: 22 Jan 05

Posts: 888

Credit: 23502182

RAC: 0

RE: i'm sorry, i don't

18 Feb 2011 18:37:25 UTC

Message 103539 in response to message 103538

Quote

(moderation:

)

Quote:

i'm sorry, i don't understand clear enought when can we use new 1.07 version?

1.07 for BRP is an official app, which downloads automatically, see apps-page.

Michael

PS: For linux && NVIDIA 270.* beta driver, see Olivers post below.

Team Linux Users Everywhere

astrocrab

Joined: 28 Jan 08

Posts: 208

Credit: 429202534

RAC: 0

RE: PS: For linux &&

18 Feb 2011 18:49:40 UTC

Message 103540 in response to message 103539

Quote

(moderation:

)

Quote:

PS: For linux && NVIDIA 270.* beta driver, see Olivers post below.

below? i can't see any (

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4312

Credit: 250665929

RAC: 34674

There are three new Apps, all

18 Feb 2011 19:00:10 UTC

Message 103541

Quote

(moderation:

)

There are three new Apps, all do have the version number 1.07. All are built from basically the same code that should avoid GPU-CPU cross-validation problems.

One is for Windows, one is for Linux for all drivers but uses a full CPU core. These two you should get automatically from now on.

There is a third one for Linux that will work only with the driver version 270. If you feel you need to, you can already download the executable from here (it will take some work for you to get it to run). However as soon as I come to it I will modify the scheduler, so that Linux users that have installed the 270 driver will get this App automatically.

hth

Bikeman (Heinz-...

Moderator

Joined: 28 Aug 06

Posts: 3522

Credit: 733567153

RAC: 1274099

According to Bernd's tech

18 Feb 2011 19:02:23 UTC

Message 103542

Quote

(moderation:

)

According to Bernd's tech news item, the Linux app taking advantage of the driver bug fix in 270.* Linux drivers will come online on Monday 21st Feb 2011. The Linux and Windows 107 CUDA apps that are distributed starting today fix all known GPU/CPU cross validation problems.

Jeroen

Joined: 25 Nov 05

Posts: 379

Credit: 740030628

RAC: 0

I have the new 270 based app

19 Feb 2011 1:01:44 UTC

Message 103543 in response to message 103541

Quote

(moderation:

)

I have the new 270 based app running currently with my 295. I don't have much runtime yet but so far so good. Here is an updated app_info.xml in case anyone else is interested. CPU load is near zero and the app seems to be performing great from what I have seen so far.

1841 49.3 2.2 83972 90440 ? RNl 17:55 1:24 ../../projects/einstein.phys.uwm.edu/einsteinbinary_BRP3_1.07_i686-pc-linux-gnu__BRP3cuda32nv270
1843 49.7 2.1 83012 88972 ? RNl 17:55 1:24 ../../projects/einstein.phys.uwm.edu/einsteinbinary_BRP3_1.07_i686-pc-linux-gnu__BRP3cuda32nv270

load average: 0.02, 0.02, 0.03

astrocrab

Joined: 28 Jan 08

Posts: 208

Credit: 429202534

RAC: 0

With this new

19 Feb 2011 12:16:12 UTC

Message 103544

Quote

(moderation:

)

With this new einsteinbinary_BRP3_1.07_i686-pc-linux-gnu__BRP3cuda32nv270 application
i got floating cpu load 25-80% (with average about 50%) instead of constantly 100% with previous fullCPU app, but time to complete a WU also raise from ~4000 seconds to ~5000 seconds. =(
this mean what fullcpu app works faster when 270 app.
what am i doing wrong?

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5872

Credit: 117772635331

RAC: 34791331

RE: I have the new 270

19 Feb 2011 12:35:07 UTC

Message 103545 in response to message 103543

Quote

(moderation:

)

Quote:

I have the new 270 based app running currently with my 295. I don't have much runtime yet but so far so good ....

I have a lot of hosts running Linux and a few running Windows but no nVidia cards as yet. I have 12 HD4850s on MWAH and I'm hoping that an OpenCL app might appear soon enough so I've been resisting the urge to buy a few nVidia cards. However, I've been keeping track of the CUDA app development and it's pretty hard to resist the urge to put some cards in a few Linux machines, particularly now that the remaining 'impediments' seem to be disappearing rather quickly.

I have (or have access to) hosts running Linux, MacOSX and Windows and my preference is very much towards Linux and MacOSX with Windows a distant last. They (being unix) much more suit my style of micromanagement (writing shell scripts, etc) :-). I've just finished browsing your linked app_info.xml and I have a few comments you might be interested in.

* You've catered for GC1HF, APB2 and BRP3 but surely you could omit ABP2 since the chances of getting any must be virtually nil.

* Even your most recently returned results are listed as '1.06' - there's no transition to '1.07' showing on the website. Having perused your app_info.xml, I think I can tell you what to do to correct that. Are your most recently downloaded new tasks listed in BOINC Manager also showing as '1.06'? If they are and if your working app_info.xml is similar to the one in the link, all you need to do is swap the order of the two clauses. Just put the one with of 107 first and the 106 one second.

* You appear to be still getting validate errors and some 'inconclusive' matches as well in your recent returns. Looks like there are still problems with the 1.07 nv270 app.

* Your app_info.xml says that you will be doing '1.06' branded tasks with the 1.07 app. This is fine but it also implies that tasks started with 1.06 would have been completed with 1.07. This is also fine in the CPU world (usually) as long as the format of a checkpoint hasn't changed. I don't know about the GPU world but can you perhaps check if the tasks now showing as validate errors were perhaps started with 1.06 and finished with 1.07? Maybe there's a problem doing that.

I've got to go right now so I'll add some more to the above list when I get a chance. Not sure when that will be as I've got a few pressing commitments right now.

Cheers,
Gary.

telegd

Joined: 17 Apr 07

Posts: 91

Credit: 10212522

RAC: 0

RE: With this new

19 Feb 2011 18:02:01 UTC

Message 103546 in response to message 103544

Quote

(moderation:

)

Quote:

With this new einsteinbinary_BRP3_1.07_i686-pc-linux-gnu__BRP3cuda32nv270 application
i got floating cpu load 25-80% (with average about 50%) instead of constantly 100% with previous fullCPU app, but time to complete a WU also raise from ~4000 seconds to ~5000 seconds. =(
this mean what fullcpu app works faster when 270 app.
what am i doing wrong?

I have a couple valid WU's with the new 1.07 and my completion time has also gone up a little. I just figured the old app used teamwork between CPU & GPU to get slightly better time. I don't mind, though - freeing up a core for other work is worth it even for a slight slowdown on the GPU.

My "0.05 CPU" for the new app runs consistently at 20% of an i7-860 core (non-shared). Seems OK to me.

I suppose that means (for people with better cards than me) that you could run about 4 or 5 GPU apps using one CPU. Just a guess...

Jeroen

Joined: 25 Nov 05

Posts: 379

Credit: 740030628

RAC: 0

RE: I have a lot of hosts

19 Feb 2011 21:06:30 UTC

Message 103547 in response to message 103545

Quote

(moderation:

)

Quote:

I have a lot of hosts running Linux and a few running Windows but no nVidia cards as yet. I have 12 HD4850s on MWAH and I'm hoping that an OpenCL app might appear soon enough so I've been resisting the urge to buy a few nVidia cards. However, I've been keeping track of the CUDA app development and it's pretty hard to resist the urge to put some cards in a few Linux machines, particularly now that the remaining 'impediments' seem to be disappearing rather quickly.

Hopefully we will see an OpenCL application to cover the ATI cards as well. Regarding adding CUDA cards, there are some good deals on E-bay for previous two generations of NVIDIA cards going on since people are upgrading to the 5xx series.

Quote:

I have (or have access to) hosts running Linux, MacOSX and Windows and my preference is very much towards Linux and MacOSX with Windows a distant last. They (being unix) much more suit my style of micromanagement (writing shell scripts, etc) :-).

I have the same preference. I prefer not having more Windows systems on my network then necessary due to having to keep them updated and secure. These days I boot my Linux image in via a PXE server and store the project data on via NFS as to not have to have separate disks and OS installs on each system.

Quote:

I've just finished browsing your linked app_info.xml and I have a few comments you might be interested in.
* You've catered for GC1HF, APB2 and BRP3 but surely you could omit ABP2 since the chances of getting any must be virtually nil.
...
I've got to go right now so I'll add some more to the above list when I get a chance. Not sure when that will be as I've got a few pressing commitments right now.

Thanks for all the comments! I went ahead and updated my app_info.xml file with the suggested changes including removing ABP2 and reordering the versions for BRP3. I'll keep an eye on the WU processing to check for WUs that fail validation. Prior to the latest apps, I was seeing anywhere from 6 - 24 invalid WUs per day. When I started running the new app yesterday, there were two work units that were still in process that I switched versions on. Perhaps it would have been better to finish those up with the old app.

Jeroen

Joined: 25 Nov 05

Posts: 379

Credit: 740030628

RAC: 0

RE: With this new

19 Feb 2011 21:14:45 UTC

Message 103548 in response to message 103544

Quote

(moderation:

)

Quote:

With this new einsteinbinary_BRP3_1.07_i686-pc-linux-gnu__BRP3cuda32nv270 application
i got floating cpu load 25-80% (with average about 50%) instead of constantly 100% with previous fullCPU app, but time to complete a WU also raise from ~4000 seconds to ~5000 seconds. =(
this mean what fullcpu app works faster when 270 app.
what am i doing wrong?

I am seeing similar performance difference between the full cpu app and the 270 app. This is running one WU per GPU.

FullCPU App: 2954 seconds
270 App: 3674 seconds

Bikeman (Heinz-...

Moderator

Joined: 28 Aug 06

Posts: 3522

Credit: 733567153

RAC: 1274099

Hi! At what niceness level

19 Feb 2011 22:36:18 UTC

Message 103549

Quote

(moderation:

)

Hi!

At what niceness level is the 270 app running on your host?
(usually the column "NI" in top).

A small performance drop should be expected in return for the lower CPU utilization, but the reported figures seem a bit too slow. If I remember correctly, the app should run with niceness 10 or so , while the other CPU apps should run at nice level 19, to ensure that the CUDA app is a bit more likely to get the CPU once GPU computations are finished.

Note that if you are using your own app_info.xml file, make sure to set avg_ncpus to a value < 1.0 when using the new nv270 app variant, because otherwise BOINC will start it with niceness 19.

CU
HB

mickydl*

Joined: 7 Oct 08

Posts: 39

Credit: 200374822

RAC: 0

On my machine the niceness

20 Feb 2011 11:40:08 UTC

Message 103550 in response to message 103549

Quote

(moderation:

)

On my machine the niceness level seems OK. The CUDA app is running with a nice level of 10 everything else with 19.

Michael

Bikeman (Heinz-...

Moderator

Joined: 28 Aug 06

Posts: 3522

Credit: 733567153

RAC: 1274099

Good to know, thanks. I

20 Feb 2011 12:15:13 UTC

Message 103551

Quote

(moderation:

)

Good to know, thanks.

I did some "back of the envelope" calculations and I'm now less surprised about the runtime increase. Here's the essence of it:

One BRP3 task consists of 4 subunits. Each sub-unit tries ca 12k orbital templates on a Parkes data sample. So every task performs ca 50 k templates.

For every template, several so called CUDA kernels (code executed on the GPU) have to be started in sequence. I don't know the exact number of kernel invocations but from what I do know, it must be > 10. Maybe more like 20, depending on how the FFT part works.

That means there will be > ca 500k kernel invocations per task. if you divide the observed slowdown of ca 1000 seconds (which seems to be pretty independent of GPU speed), you get an increase in CUDA kernel invocation latency of ca 2 milliseconds. This is the same order of magnitude as the time slice of a "niced" process in most Linux kernels.

Not sure what this means for the project tho. Some people don't like the GPU app to occupy a whole core, others don't mind and insist on max productivity. Maybe it would be best to make this configurable somehow.

CU
HB

astrocrab

Joined: 28 Jan 08

Posts: 208

Credit: 429202534

RAC: 0

25% of performance is a too

20 Feb 2011 13:06:01 UTC

Message 103552 in response to message 103551

Quote

(moderation:

)

25% of performance is a too huge piece to ignore it. i think app should be optimized
1. to make less kernel calls
2. to utilise today's powerful GPU core effectively. GTX 580, 570, 560, 480, 470 use only 40-50% of GPU when crunching single WU and we must make magic with app_info.xml to increase output and perform manual upgrade to newer version. we can't make install_and_forget type of machines.

do you agree?

astrocrab

Joined: 28 Jan 08

Posts: 208

Credit: 429202534

RAC: 0

i mean "to utilise today's

20 Feb 2011 14:45:06 UTC

Message 103553 in response to message 103552

Quote

(moderation:

)

i mean "to utilise today's powerful GPU more effectively."

Bikeman (Heinz-...

Moderator

Joined: 28 Aug 06

Posts: 3522

Credit: 733567153

RAC: 1274099

RE: 25% of performance is a

20 Feb 2011 17:00:04 UTC

Message 103554 in response to message 103552

Quote

(moderation:

)

Quote:

25% of performance is a too huge piece to ignore it. i think app should be optimized
1. to make less kernel calls
2. to utilise today's powerful GPU core effectively. GTX 580, 570, 560, 480, 470 use only 40-50% of GPU when crunching single WU and we must make magic with app_info.xml to increase output and perform manual upgrade to newer version. we can't make install_and_forget type of machines.

do you agree?

Well, that's easier said than done:-). You cannot decrease the number of kernel invocations at will, some things have to be computed first by one kernel before another can work on the output. I don't see that much potential for optimization here. Maybe it's possible to reduce kernel invocations by (say) 20 to 25% at most, leaving us with a performance difference of 750 sec instead of 1000s per WU.

The other alternative is, of course, to go back to the full-CPU method: sacrifice a full CPU core per GPU task in order to avoid the increased latency in the GPU processing.

CU
HB

Donald A. Tevault

Joined: 17 Feb 06

Posts: 439

Credit: 73516529

RAC: 0

RE: RE: 25% of

20 Feb 2011 19:23:42 UTC

Message 103555 in response to message 103554

Quote

(moderation:

)

Quote:

Quote:
25% of performance is a too huge piece to ignore it. i think app should be optimized
1. to make less kernel calls
2. to utilise today's powerful GPU core effectively. GTX 580, 570, 560, 480, 470 use only 40-50% of GPU when crunching single WU and we must make magic with app_info.xml to increase output and perform manual upgrade to newer version. we can't make install_and_forget type of machines.

do you agree?

Well, that's easier said than done:-). You cannot decrease the number of kernel invocations at will, some things have to be computed first by one kernel before another can work on the output. I don't see that much potential for optimization here. Maybe it's possible to reduce kernel invocations by (say) 20 to 25% at most, leaving us with a performance difference of 750 sec instead of 1000s per WU.

The other alternative is, of course, to go back to the full-CPU method: sacrifice a full CPU core per GPU task in order to avoid the increased latency in the GPU processing.

CU
HB

Actually, I like the idea of sticking with the full-core method. With modern processors having at least four cores, I don't think that that's much of a sacrifice for increased performance.

Edit: Okay, disregard the above. I just saw Bernd's note in the Technical News section.

Jeroen

Joined: 25 Nov 05

Posts: 379

Credit: 740030628

RAC: 0

RE: Hi! At what niceness

20 Feb 2011 20:25:12 UTC

Message 103556 in response to message 103549

Quote

(moderation:

)

Quote:

Hi!

At what niceness level is the 270 app running on your host?
(usually the column "NI" in top).

A small performance drop should be expected in return for the lower CPU utilization, but the reported figures seem a bit too slow. If I remember correctly, the app should run with niceness 10 or so , while the other CPU apps should run at nice level 19, to ensure that the CUDA app is a bit more likely to get the CPU once GPU computations are finished.

Note that if you are using your own app_info.xml file, make sure to set avg_ncpus to a value < 1.0 when using the new nv270 app variant, because otherwise BOINC will start it with niceness 19.

CU
HB

I left the niceness to default as I am only running the two BRP3 CUDA work units currently. There is nothing else consuming up CPU resources at the moment.

Thanks.

art

Joined: 3 May 07

Posts: 2

Credit: 37715167

RAC: 12821

I'm curious as to how this

25 Feb 2011 22:52:58 UTC

Message 103557

Quote

(moderation:

)

I'm curious as to how this decision was decided on. Will the increased processing speed compensate for removing any computer which doesn't have an NVIDIA GPU, which I would assume is substantial? Of those now shut out, I'm curious as to how many will remove the project and not return.

Would it not have been better to hold off on launching the NVIDIA GPU code until the OpenGL code was ready?

Tony DeBari

Joined: 29 Apr 05

Posts: 30

Credit: 38576823

RAC: 0

RE: I'm curious as to how

26 Feb 2011 6:57:11 UTC

Message 103558 in response to message 103557

Quote

(moderation:

)

Quote:

I'm curious as to how this decision was decided on. Will the increased processing speed compensate for removing any computer which doesn't have an NVIDIA GPU, which I would assume is substantial? Of those now shut out, I'm curious as to how many will remove the project and not return.

No one has been shut out. The BRP3 CPU app is still available to run on computers that do not have a CUDA-capable GPU. It can also run concurrently with the CUDA app on those computers that do. I have one such host, and even though the GPU is tied up with Seti@Home at the moment, the CPU is happily crunching any BRP3 WUs that come its way.

Quote:

Would it not have been better to hold off on launching the NVIDIA GPU code until the OpenGL code was ready?

I'm guessing you meant OpenCL, as OpenGL is for graphics and has nothing do with distributed computing except possibly for rendering the graphics in a screen saver. I see no reason why the release of the CUDA app should have been delayed. In the time it will take to finish the OpenCL app, the CUDA app will have crunched many times more WUs than could have been done by CPUs alone. It would have been of no benefit to the project to leave that processing power untapped.

-- Tony D.

art

Joined: 3 May 07

Posts: 2

Credit: 37715167

RAC: 12821

Well, something seems to be

28 Feb 2011 0:02:10 UTC

Message 103559 in response to message 103558

Quote

(moderation:

)

Well, something seems to be off kilter. I've not had any new jobs from Einstein@home on my ATI based workstation for over a week. Only a message indicating I don't have an NVIDIA GPU.

Is it because there are no other jobs or is there a setting I need to change?

Tony DeBari

Joined: 29 Apr 05

Posts: 30

Credit: 38576823

RAC: 0

RE: Well, something seems

28 Feb 2011 6:37:50 UTC

Message 103560 in response to message 103559

Quote

(moderation:

)

Quote:

Well, something seems to be off kilter. I've not had any new jobs from Einstein@home on my ATI based workstation for over a week. Only a message indicating I don't have an NVIDIA GPU.

Is it because there are no other jobs or is there a setting I need to change?

That message indicates that the host requested GPU work for your ATI card and the project responded (correctly) that the only GPU work available is for nVidia cards.

The thing to check is if the host is requesting CPU work at all. It didn't the last time it contacted the E@H server. (The log of the most recent contact is available here.) It's possible that the host is paying back long-term debt to one of the other projects for which you crunch - my guess would be Seti@Home, which just had an extended outage and continues to have intermittent work distribution issues. If that's the case, BOINC will resume asking for E@H work once the debt evens out.

(Mods: Sorry for the thread hijack. This discussion should probably be moved to Cruncher's Corner at this point.)

-- Tony D.

Oliver Behnke

Moderator

Administrator

Joined: 4 Sep 07

Posts: 984

Credit: 25171438

RAC: 23

FYI, the performance decrease

28 Feb 2011 9:31:00 UTC

Message 103561

Quote

(moderation:

)

FYI, the performance decrease was due to a missing optimization step during build of the 1.07 apps (see this post). Version 1.08 fixes that and performance should be almost on par with the full CPU (260.x driver) version.

Cheers,
Oliver

Einstein@Home Project

Bikeman (Heinz-...

Moderator

Joined: 28 Aug 06

Posts: 3522

Credit: 733567153

RAC: 1274099

Just for completeness: I

28 Feb 2011 20:52:40 UTC

Message 103562 in response to message 103520

Quote

(moderation:

)

Just for completeness:

I wrote earlier:

Quote:

As for the 270 driver: It runs flawlessly on my 16 bit Suse 11.2 box, but I got hanging CUDA apps since updating the driver on my 64 bit Linux box.

It seems that was a hardware issue related to my particular box. Not related to the driver update at all.
I swapped the graphics card in that box and then the app did no longer hang...it produced results that would not validate :-(.

I rebooted and later re-inserted the card....and now it was detected as a PCIe-1x (!!) card, running real sloooooooooow. What .......$&$&$ ????

I reinserted it again with considerable force, gave the box a kick...and now it's working fine again as a PCIe-16x card and validates fine. I won't touch that box again.

CU
HB

Elvis

Joined: 6 Oct 06

Posts: 2

Credit: 142113405

RAC: 22480

Hi There ! I have resume

1 Mar 2011 12:06:23 UTC

Message 103563

Quote

(moderation:

)

Hi There !

I have resume the Einstein@home project and crunching with BOIN after a two year break but still have 80,846 points on Einstein.
I now have a four core CPU and an ATI Radeon HD 4850 Video Card.
BOIN manager tells at boot that this GPU can produce 1120 GFLOPS peak.

Do you know When will Einstein@home use ATI Video card ?
Why are GPUs "so much" More powerfull that CPU and/or producing so much points compared to CPU calculation ?

Thanks and hurry up for the ATI Support ! ;-)

Elvis

rados

Joined: 28 Feb 11

Posts: 1

Credit: 185405

RAC: 0

my settings allows Boinc to

5 Mar 2011 13:02:31 UTC

Message 103564

Quote

(moderation:

)

my settings allows Boinc to run when inactive for 2 min.
thats fine but when i start to use my computer again everything stops execpt Einstein@home cuda32 version tasks

it shows as it is stoped in the manager but i can see it in windows task manager and can understand from the noise of the fan of graphic card...

induktio

Joined: 1 Oct 10

Posts: 15

Credit: 10144774

RAC: 0

This does seem very

18 Mar 2011 12:12:36 UTC

Message 103565

Quote

(moderation:

)

This does seem very interesting. Currently I am doing CPU-only crunching in a Linux environment but once these CUDA drivers mature it would be tempting to acquire a GPU to help in the process.

Going to the point, yesterday I read that Amazon Web Services had begun offering Cluster GPU instances with these specs:

Quote:

The Cluster GPU instance family currently contains a single instance type, the Cluster GPU Quadruple Extra Large with the following specifications:

22 GB of memory
33.5 EC2 Compute Units (2 x Intel Xeon X5570, quad-core â€œNehalemâ€ architecture)
2 x NVIDIA Tesla â€œFermiâ€ M2050 GPUs
1690 GB of instance storage
64-bit platform
I/O Performance: Very High (10 Gigabit Ethernet)
API name: cg1.4xlarge

It's not hard to count 1+1 and see the computing power potential here. Although that is the most expensive instance they offer, the GPUs are supposedly very powerful. One thing I wonder is has this Einstein@Home app been tested to run reliably on the above Tesla M2050 GPU? Do you have any estimates how quickly it could complete binary search workunits?

Oliver Behnke

Moderator

Administrator

Joined: 4 Sep 07

Posts: 984

Credit: 25171438

RAC: 23

RE: One thing I wonder is

18 Mar 2011 16:56:23 UTC

Message 103566 in response to message 103565

Quote

(moderation:

)

Quote:

One thing I wonder is has this Einstein@Home app been tested to run reliably on the above Tesla M2050 GPU? Do you have any estimates how quickly it could complete binary search workunits?

We have lots of C2050 cards (same architecture). The speed-up compared to the Xeon CPU-only performance of their host machines is currently roughly at a factor of 20.

Oliver

Einstein@Home Project

Jeroen

Joined: 25 Nov 05

Posts: 379

Credit: 740030628

RAC: 0

RE: This does seem very

18 Mar 2011 17:31:29 UTC

Message 103567 in response to message 103565

Quote

(moderation:

)

Quote:

This does seem very interesting. Currently I am doing CPU-only crunching in a Linux environment but once these CUDA drivers mature it would be tempting to acquire a GPU to help in the process.

One thing I wonder is has this Einstein@Home app been tested to run reliably on the above Tesla M2050 GPU? Do you have any estimates how quickly it could complete binary search workunits?

The M2050 has 3GB memory. Since it is a Fermi card, you should be able to run at least 3-4 work units at once via each card for improved production. Each work unit needs 300-400MB of GPU memory.

From searching the stats for other users with the Tesla cards, the C2050's are completing work units in 2800-3200 seconds. I am not sure how many work units these GPUs are running at once though.

induktio

Joined: 1 Oct 10

Posts: 15

Credit: 10144774

RAC: 0

RE: The M2050 has 3GB

18 Mar 2011 19:16:10 UTC

Message 103568 in response to message 103567

Quote

(moderation:

)

Quote:

The M2050 has 3GB memory. Since it is a Fermi card, you should be able to run at least 3-4 work units at once via each card for improved production. Each work unit needs 300-400MB of GPU memory.

From searching the stats for other users with the Tesla cards, the C2050's are completing work units in 2800-3200 seconds. I am not sure how many work units these GPUs are running at once though.

I recall seeing some GeForce GTX 580's completing WU's in ~3000 seconds. I'm not sure how those two architectures compare, but both C2050's and GTX 580's seem to have roughly the same number of CUDA cores and memory bandwidth. The GTX 580 seems to also have Fermi capability, so it's likely it was running multiple work units concurrently.

From a practical point of view, GTX 580 seems to deliver the same performance than Tesla C2050 at 1/5th of a cost. It doesn't make much sense to buy Tesla unless one really needs the bigger, ECC-enabled memory (which admittely is required for some serious work).

Jeroen

Joined: 25 Nov 05

Posts: 379

Credit: 740030628

RAC: 0

RE: I recall seeing some

18 Mar 2011 19:30:10 UTC

Message 103569 in response to message 103568

Quote

(moderation:

)

Quote:

I recall seeing some GeForce GTX 580's completing WU's in ~3000 seconds. I'm not sure how those two architectures compare, but both C2050's and GTX 580's seem to have roughly the same number of CUDA cores and memory bandwidth. The GTX 580 seems to also have Fermi capability, so it's likely it was running multiple work units concurrently.

From a practical point of view, GTX 580 seems to deliver the same performance than Tesla C2050 at 1/5th of a cost. It doesn't make much sense to buy Tesla unless one really needs the bigger, ECC-enabled memory (which admittely is required for some serious work).

My 580's are able to complete 3 tasks at once in around 3500-3600 seconds. I would guess the Tesla Fermi cards would perform similar due to similar CUDA cores. EVGA is coming out with a 3GB version of the 580 in early April, which I think will be perfect for this project. The 1.5GB version of the 580 can run four tasks at once in most cases but this uses up almost all the GPU memory and in some cases the fourth task will not run due to memory constraints. I am not sure what the price will be on the 3GB version though.

mickydl*

Joined: 7 Oct 08

Posts: 39

Credit: 200374822

RAC: 0

Another difference of the

20 Mar 2011 11:24:42 UTC

Message 103570 in response to message 103569

Quote

(moderation:

)

Another difference of the Teslas is that they will provide the full double precision FP performance. Compared to the consumer cards this means:

Tesla: FP speed = 1/2 interger Speed
GXT580: FP speed = 1/8 interger Speed

Not that you need it for Einstein but if you are planing on using them for other projects as well...

Michael

egg Films Graphics 2

Joined: 22 Mar 11

Posts: 1

Credit: 453850

RAC: 0

We just installed Quadro 4000

31 Mar 2011 0:30:32 UTC

Message 103571

Quote

(moderation:

)

We just installed Quadro 4000 cards in 4 8-core Mac Pros. Can't wait to see the GPU app and the data we can crunch. Hope the GPU app ships soon!

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5872

Credit: 117772635331

RAC: 34791331

RE: .... Hope the GPU app

31 Mar 2011 1:09:39 UTC

Message 103572 in response to message 103571

Quote

(moderation:

)

Quote:

.... Hope the GPU app ships soon!

Hi,
Welcome to the project.

Depending on what version of OS X you are running, the app is already available. The latest info I recall seeing about this is here.

I just had a look at the computer you have attached to the project. It's not showing as having a compatible GPU.

Cheers,
Gary.