Einstein@Home is beginning a new round of searching for radio pulsars in short-orbital-period binary systems.
This is accompanied by the release of a new application (called BRP3). The new application is particularly efficient on NVIDIA Graphics Processor Cards (up to a factor of 20 faster than the CPU-only application). In addition, when running on an NVIDIA GPU card, this new application makes very little use of the CPU (typically around 20% CPU use when the GPU is devoted to Einstein@Home).
The NVIDIA GPU application is initially available for Windows and Linux only. We hope to have a Macintosh version available soon. Due to limitations in the NVIDIA drivers, the Linux version still makes heavy use of the CPU. This will be fixed in Spring 2011, when a new version of the NVIDIA Driver is released. Many thanks to NVIDIA technical support for their assistance!
Because we have exhausted the backlog of data from Arecibo Observatory, this new application is being shipped with data from the Parkes Multibeam Pulsar Survey (from the Parkes Radio Telescope in Australia). In the next weeks we expect to also start using this new application on fresh Arecibo data taken with the latest 'Mock Spectrometer' back-end.
Questions, problems or bug reports related to this new application and search should be reported in this news item thread as a 'Comment'.
Bruce Allen
Director, Einstein@Home
Copyright © 2024 Einstein@Home. All rights reserved.
Comments
RE: Thanks! I got the
)
I'm currently running console only via a small busybox image and run_client instead of run_manager. I did notice yesterday that it took a few minutes for the CUDA tasks to start up. Initially the tasks were in a paused state and then after five minutes or so, the tasks started up.
RE: The cross validation
)
I figured as much. Thanks for the confirmation.
I got a smile from your typo...
I haven't done any proper testing, but I notice that (sometimes) the X-Windows process starts taking a lot of CPU time (almost a full core) when the GPU app is running. Happens for my PrimeGrid app too. I am not sure what triggers it and I can't be sure that it never happened on the 260 drivers. However, I don't think I ever saw it before....
I'm running the 270 driver
)
I'm running the 270 driver for the past two weeks now. Except for a few errors I got when I was still messing around with the app_info.xml not a single WU has failed, hung or was invalid.
It's running on two machines, one with a GForce9800 and one with a GTX470.
OS is a self compiled LFS Linux 64Bit with 32Bit compat. libs installed. No X-Windows on both machines.
Michael
RE: As for the 270 driver:
)
i run 270 driver on several ubuntu 10.10 x64 machines with gtx 560 for several weeks. no any hangs.
i thought 16 bit machines extinct soon after dynosaurs.
when new app will be available (estimated) =) ?
RE: i thought 16 bit
)
Lol!!!!
Oh my dear...but by typos like that you recognize people who actually have done asm programming on 8 bit [sic] processors and thought that 16 bit was heaven :-)
CU
HB
RE: RE: i thought 16 bit
)
Have you ever worked with transistorized computers? They're much more fun than the modern ones.
RE: Oh my dear...but by
)
i had some programming on Z80 =) and today's many gigabytes software scares me ))
but where did you get 16-bit cpu? even ancient 386 was a 32-bit already.
Not wanting to hijack the
)
Not wanting to hijack the thread, but since you asked:
My first assembly program was on the 6502 8 bit CPU of a Commodore VIC 20. Early 80th of the previous century.
Then I had an Intel 8086 based PC (or 8088 can't remember), which was logically a 16 bit CPU.
:-)
HB
RE: Not wanting to hijack
)
Ah, there are other dinosaurs about that know of what you speak! :-):-)
Yeah I did 6502 as well, on the C-64. Quite a laugh sorting out their indirection/pointer commands as I recall. Pretty well everything bar the power switch was memory mapped, so you got direct access to the lot. I too graduated to the 8088 first then 8086, using MASM and then "Progammer's Work Bench" - a good IDE for it's day. The learning hump for me was understanding stack frames correctly. The 8088 is internally/code the same but had only an 8-bit memory bus accesses, and word alignment was thus a performance issue for 8086.
16 bit was like : "Really? Wow! Can I have a try? Please ..... "
Cheers, Mike.
I have made this letter longer than usual because I lack the time to make it shorter ...
... and my other CPU is a Ryzen 5950X :-) Blaise Pascal
I have used the Z80 and the
)
I have used the Z80 and the Z8000, which was 16 bit. 32-bit Z80000 never appeared and I switched to 68010 and following chips from Motorola,
Tullio
If you've ever looked at or
)
If you've ever looked at or created a boot sector for your HDD, even today's CPUs still start up in a legacy 16-bit mode :)
RE: Not wanting to hijack
)
Some interesting posts here, but not quite on topic. A new thread, perhaps?
RE: Update: I checked the
)
Yet another update: we will release shortly a Linux CUDA app specifically for use with the NVIDIA 270.xx beta driver. As soon as you install this driver our server will send you the new app which behaves like a normal BOINC CUDA app, reducing the CPU consumption as good as possible.
We'll post a tech news item as soon as the new app is released (it's imminent).
Cheers,
Oliver
Einstein@Home Project
RE: Yet another update: we
)
That is great news. Thanks.
RE: Yet another update: we
)
Cool. As it happens, I installed an nVidia card and the 270 driver in one of my machines only yesterday.
i'm sorry, i don't understand
)
i'm sorry, i don't understand clear enought when can we use new 1.07 version?
RE: i'm sorry, i don't
)
1.07 for BRP is an official app, which downloads automatically, see apps-page.
Michael
PS: For linux && NVIDIA 270.* beta driver, see Olivers post below.
Team Linux Users Everywhere
RE: PS: For linux &&
)
below? i can't see any (
There are three new Apps, all
)
There are three new Apps, all do have the version number 1.07. All are built from basically the same code that should avoid GPU-CPU cross-validation problems.
One is for Windows, one is for Linux for all drivers but uses a full CPU core. These two you should get automatically from now on.
There is a third one for Linux that will work only with the driver version 270. If you feel you need to, you can already download the executable from here (it will take some work for you to get it to run). However as soon as I come to it I will modify the scheduler, so that Linux users that have installed the 270 driver will get this App automatically.
hth
BM
BM
According to Bernd's tech
)
According to Bernd's tech news item, the Linux app taking advantage of the driver bug fix in 270.* Linux drivers will come online on Monday 21st Feb 2011. The Linux and Windows 107 CUDA apps that are distributed starting today fix all known GPU/CPU cross validation problems.
HB
I have the new 270 based app
)
I have the new 270 based app running currently with my 295. I don't have much runtime yet but so far so good. Here is an updated app_info.xml in case anyone else is interested. CPU load is near zero and the app seems to be performing great from what I have seen so far.
1841 49.3 2.2 83972 90440 ? RNl 17:55 1:24 ../../projects/einstein.phys.uwm.edu/einsteinbinary_BRP3_1.07_i686-pc-linux-gnu__BRP3cuda32nv270
1843 49.7 2.1 83012 88972 ? RNl 17:55 1:24 ../../projects/einstein.phys.uwm.edu/einsteinbinary_BRP3_1.07_i686-pc-linux-gnu__BRP3cuda32nv270
load average: 0.02, 0.02, 0.03
With this new
)
With this new einsteinbinary_BRP3_1.07_i686-pc-linux-gnu__BRP3cuda32nv270 application
i got floating cpu load 25-80% (with average about 50%) instead of constantly 100% with previous fullCPU app, but time to complete a WU also raise from ~4000 seconds to ~5000 seconds. =(
this mean what fullcpu app works faster when 270 app.
what am i doing wrong?
RE: I have the new 270
)
I have a lot of hosts running Linux and a few running Windows but no nVidia cards as yet. I have 12 HD4850s on MWAH and I'm hoping that an OpenCL app might appear soon enough so I've been resisting the urge to buy a few nVidia cards. However, I've been keeping track of the CUDA app development and it's pretty hard to resist the urge to put some cards in a few Linux machines, particularly now that the remaining 'impediments' seem to be disappearing rather quickly.
I have (or have access to) hosts running Linux, MacOSX and Windows and my preference is very much towards Linux and MacOSX with Windows a distant last. They (being unix) much more suit my style of micromanagement (writing shell scripts, etc) :-). I've just finished browsing your linked app_info.xml and I have a few comments you might be interested in.
* You've catered for GC1HF, APB2 and BRP3 but surely you could omit ABP2 since the chances of getting any must be virtually nil.
* Even your most recently returned results are listed as '1.06' - there's no transition to '1.07' showing on the website. Having perused your app_info.xml, I think I can tell you what to do to correct that. Are your most recently downloaded new tasks listed in BOINC Manager also showing as '1.06'? If they are and if your working app_info.xml is similar to the one in the link, all you need to do is swap the order of the two clauses. Just put the one with of 107 first and the 106 one second.
* You appear to be still getting validate errors and some 'inconclusive' matches as well in your recent returns. Looks like there are still problems with the 1.07 nv270 app.
* Your app_info.xml says that you will be doing '1.06' branded tasks with the 1.07 app. This is fine but it also implies that tasks started with 1.06 would have been completed with 1.07. This is also fine in the CPU world (usually) as long as the format of a checkpoint hasn't changed. I don't know about the GPU world but can you perhaps check if the tasks now showing as validate errors were perhaps started with 1.06 and finished with 1.07? Maybe there's a problem doing that.
I've got to go right now so I'll add some more to the above list when I get a chance. Not sure when that will be as I've got a few pressing commitments right now.
Cheers,
Gary.
RE: With this new
)
I have a couple valid WU's with the new 1.07 and my completion time has also gone up a little. I just figured the old app used teamwork between CPU & GPU to get slightly better time. I don't mind, though - freeing up a core for other work is worth it even for a slight slowdown on the GPU.
My "0.05 CPU" for the new app runs consistently at 20% of an i7-860 core (non-shared). Seems OK to me.
I suppose that means (for people with better cards than me) that you could run about 4 or 5 GPU apps using one CPU. Just a guess...
RE: I have a lot of hosts
)
Hopefully we will see an OpenCL application to cover the ATI cards as well. Regarding adding CUDA cards, there are some good deals on E-bay for previous two generations of NVIDIA cards going on since people are upgrading to the 5xx series.
I have the same preference. I prefer not having more Windows systems on my network then necessary due to having to keep them updated and secure. These days I boot my Linux image in via a PXE server and store the project data on via NFS as to not have to have separate disks and OS installs on each system.
Thanks for all the comments! I went ahead and updated my app_info.xml file with the suggested changes including removing ABP2 and reordering the versions for BRP3. I'll keep an eye on the WU processing to check for WUs that fail validation. Prior to the latest apps, I was seeing anywhere from 6 - 24 invalid WUs per day. When I started running the new app yesterday, there were two work units that were still in process that I switched versions on. Perhaps it would have been better to finish those up with the old app.
RE: With this new
)
I am seeing similar performance difference between the full cpu app and the 270 app. This is running one WU per GPU.
FullCPU App: 2954 seconds
270 App: 3674 seconds
Hi! At what niceness level
)
Hi!
At what niceness level is the 270 app running on your host?
(usually the column "NI" in top).
A small performance drop should be expected in return for the lower CPU utilization, but the reported figures seem a bit too slow. If I remember correctly, the app should run with niceness 10 or so , while the other CPU apps should run at nice level 19, to ensure that the CUDA app is a bit more likely to get the CPU once GPU computations are finished.
Note that if you are using your own app_info.xml file, make sure to set avg_ncpus to a value < 1.0 when using the new nv270 app variant, because otherwise BOINC will start it with niceness 19.
CU
HB
On my machine the niceness
)
On my machine the niceness level seems OK. The CUDA app is running with a nice level of 10 everything else with 19.
Michael
Good to know, thanks. I
)
Good to know, thanks.
I did some "back of the envelope" calculations and I'm now less surprised about the runtime increase. Here's the essence of it:
One BRP3 task consists of 4 subunits. Each sub-unit tries ca 12k orbital templates on a Parkes data sample. So every task performs ca 50 k templates.
For every template, several so called CUDA kernels (code executed on the GPU) have to be started in sequence. I don't know the exact number of kernel invocations but from what I do know, it must be > 10. Maybe more like 20, depending on how the FFT part works.
That means there will be > ca 500k kernel invocations per task. if you divide the observed slowdown of ca 1000 seconds (which seems to be pretty independent of GPU speed), you get an increase in CUDA kernel invocation latency of ca 2 milliseconds. This is the same order of magnitude as the time slice of a "niced" process in most Linux kernels.
Not sure what this means for the project tho. Some people don't like the GPU app to occupy a whole core, others don't mind and insist on max productivity. Maybe it would be best to make this configurable somehow.
CU
HB
25% of performance is a too
)
25% of performance is a too huge piece to ignore it. i think app should be optimized
1. to make less kernel calls
2. to utilise today's powerful GPU core effectively. GTX 580, 570, 560, 480, 470 use only 40-50% of GPU when crunching single WU and we must make magic with app_info.xml to increase output and perform manual upgrade to newer version. we can't make install_and_forget type of machines.
do you agree?
i mean "to utilise today's
)
i mean "to utilise today's powerful GPU more effectively."
RE: 25% of performance is a
)
Well, that's easier said than done:-). You cannot decrease the number of kernel invocations at will, some things have to be computed first by one kernel before another can work on the output. I don't see that much potential for optimization here. Maybe it's possible to reduce kernel invocations by (say) 20 to 25% at most, leaving us with a performance difference of 750 sec instead of 1000s per WU.
The other alternative is, of course, to go back to the full-CPU method: sacrifice a full CPU core per GPU task in order to avoid the increased latency in the GPU processing.
CU
HB
RE: RE: 25% of
)
Actually, I like the idea of sticking with the full-core method. With modern processors having at least four cores, I don't think that that's much of a sacrifice for increased performance.
Edit: Okay, disregard the above. I just saw Bernd's note in the Technical News section.
RE: Hi! At what niceness
)
I left the niceness to default as I am only running the two BRP3 CUDA work units currently. There is nothing else consuming up CPU resources at the moment.
Thanks.
I'm curious as to how this
)
I'm curious as to how this decision was decided on. Will the increased processing speed compensate for removing any computer which doesn't have an NVIDIA GPU, which I would assume is substantial? Of those now shut out, I'm curious as to how many will remove the project and not return.
Would it not have been better to hold off on launching the NVIDIA GPU code until the OpenGL code was ready?
RE: I'm curious as to how
)
No one has been shut out. The BRP3 CPU app is still available to run on computers that do not have a CUDA-capable GPU. It can also run concurrently with the CUDA app on those computers that do. I have one such host, and even though the GPU is tied up with Seti@Home at the moment, the CPU is happily crunching any BRP3 WUs that come its way.
I'm guessing you meant OpenCL, as OpenGL is for graphics and has nothing do with distributed computing except possibly for rendering the graphics in a screen saver. I see no reason why the release of the CUDA app should have been delayed. In the time it will take to finish the OpenCL app, the CUDA app will have crunched many times more WUs than could have been done by CPUs alone. It would have been of no benefit to the project to leave that processing power untapped.
-- Tony D.
Well, something seems to be
)
Well, something seems to be off kilter. I've not had any new jobs from Einstein@home on my ATI based workstation for over a week. Only a message indicating I don't have an NVIDIA GPU.
Is it because there are no other jobs or is there a setting I need to change?
RE: Well, something seems
)
That message indicates that the host requested GPU work for your ATI card and the project responded (correctly) that the only GPU work available is for nVidia cards.
The thing to check is if the host is requesting CPU work at all. It didn't the last time it contacted the E@H server. (The log of the most recent contact is available here.) It's possible that the host is paying back long-term debt to one of the other projects for which you crunch - my guess would be Seti@Home, which just had an extended outage and continues to have intermittent work distribution issues. If that's the case, BOINC will resume asking for E@H work once the debt evens out.
(Mods: Sorry for the thread hijack. This discussion should probably be moved to Cruncher's Corner at this point.)
-- Tony D.
FYI, the performance decrease
)
FYI, the performance decrease was due to a missing optimization step during build of the 1.07 apps (see this post). Version 1.08 fixes that and performance should be almost on par with the full CPU (260.x driver) version.
Cheers,
Oliver
Einstein@Home Project
Just for completeness: I
)
Just for completeness:
I wrote earlier:
It seems that was a hardware issue related to my particular box. Not related to the driver update at all.
I swapped the graphics card in that box and then the app did no longer hang...it produced results that would not validate :-(.
I rebooted and later re-inserted the card....and now it was detected as a PCIe-1x (!!) card, running real sloooooooooow. What .......$&$&$ ????
I reinserted it again with considerable force, gave the box a kick...and now it's working fine again as a PCIe-16x card and validates fine. I won't touch that box again.
CU
HB
Hi There ! I have resume
)
Hi There !
I have resume the Einstein@home project and crunching with BOIN after a two year break but still have 80,846 points on Einstein.
I now have a four core CPU and an ATI Radeon HD 4850 Video Card.
BOIN manager tells at boot that this GPU can produce 1120 GFLOPS peak.
Do you know When will Einstein@home use ATI Video card ?
Why are GPUs "so much" More powerfull that CPU and/or producing so much points compared to CPU calculation ?
Thanks and hurry up for the ATI Support ! ;-)
Elvis
my settings allows Boinc to
)
my settings allows Boinc to run when inactive for 2 min.
thats fine but when i start to use my computer again everything stops execpt Einstein@home cuda32 version tasks
it shows as it is stoped in the manager but i can see it in windows task manager and can understand from the noise of the fan of graphic card...
This does seem very
)
This does seem very interesting. Currently I am doing CPU-only crunching in a Linux environment but once these CUDA drivers mature it would be tempting to acquire a GPU to help in the process.
Going to the point, yesterday I read that Amazon Web Services had begun offering Cluster GPU instances with these specs:
It's not hard to count 1+1 and see the computing power potential here. Although that is the most expensive instance they offer, the GPUs are supposedly very powerful. One thing I wonder is has this Einstein@Home app been tested to run reliably on the above Tesla M2050 GPU? Do you have any estimates how quickly it could complete binary search workunits?
RE: One thing I wonder is
)
We have lots of C2050 cards (same architecture). The speed-up compared to the Xeon CPU-only performance of their host machines is currently roughly at a factor of 20.
Oliver
Einstein@Home Project
RE: This does seem very
)
The M2050 has 3GB memory. Since it is a Fermi card, you should be able to run at least 3-4 work units at once via each card for improved production. Each work unit needs 300-400MB of GPU memory.
From searching the stats for other users with the Tesla cards, the C2050's are completing work units in 2800-3200 seconds. I am not sure how many work units these GPUs are running at once though.
RE: The M2050 has 3GB
)
I recall seeing some GeForce GTX 580's completing WU's in ~3000 seconds. I'm not sure how those two architectures compare, but both C2050's and GTX 580's seem to have roughly the same number of CUDA cores and memory bandwidth. The GTX 580 seems to also have Fermi capability, so it's likely it was running multiple work units concurrently.
From a practical point of view, GTX 580 seems to deliver the same performance than Tesla C2050 at 1/5th of a cost. It doesn't make much sense to buy Tesla unless one really needs the bigger, ECC-enabled memory (which admittely is required for some serious work).
RE: I recall seeing some
)
My 580's are able to complete 3 tasks at once in around 3500-3600 seconds. I would guess the Tesla Fermi cards would perform similar due to similar CUDA cores. EVGA is coming out with a 3GB version of the 580 in early April, which I think will be perfect for this project. The 1.5GB version of the 580 can run four tasks at once in most cases but this uses up almost all the GPU memory and in some cases the fourth task will not run due to memory constraints. I am not sure what the price will be on the 3GB version though.
Another difference of the
)
Another difference of the Teslas is that they will provide the full double precision FP performance. Compared to the consumer cards this means:
Tesla: FP speed = 1/2 interger Speed
GXT580: FP speed = 1/8 interger Speed
Not that you need it for Einstein but if you are planing on using them for other projects as well...
Michael
We just installed Quadro 4000
)
We just installed Quadro 4000 cards in 4 8-core Mac Pros. Can't wait to see the GPU app and the data we can crunch. Hope the GPU app ships soon!
RE: .... Hope the GPU app
)
Hi,
Welcome to the project.
Depending on what version of OS X you are running, the app is already available. The latest info I recall seeing about this is here.
I just had a look at the computer you have attached to the project. It's not showing as having a compatible GPU.
Cheers,
Gary.