Due to the excellent work of our French volunteer Christophe Choquet we finally have a working OpenCL version of the Gravitational Wave search ("S6CasA") application. Thank you Christophe!
This App version is currently considered 'Beta' and being tested on Einstein@Home. To participate in the Beta test, you need to edit your Einstein@Home preferences, and set "Run beta/test application versions?" to "yes".
It is currently available for Windows (32 Bit) and Linux (64 Bit) only, and you should have a card which supports double precision FP in hardware.
BM
Copyright © 2024 Einstein@Home. All rights reserved.
Comments
Gravitational Wave search GPU App version
)
On my Win 7 x64 i5-3210M/GT650M/Intel_Graphics_HD4000 host I'm getting:
I've already got CPU beta CasA 1.06 work from this host, it is on the home venue, and Run beta is set to yes at that venue,
generally on Windows x64 projects supply x32 Cuda apps because there is no need for 64bit addressing, and there is a slowdown running 64bit Cuda apps because of the 64bit addressing.
Claggy
Thanks for reporting.
)
Thanks for reporting.
Should work now.
BM
BM
After editing my prefs and
)
After editing my prefs and upping the cache a bit I managed to get a few S6 tasks assigned to my GTX660Ti, I then suspended other GPU tasks is queue to try on out while I'm here to check on things.
All tasks immediately got a computational error with the following in the stderr:
(unknown error) - exit code -1073741515 (0xc0000135)
Usually a sign of a missing file or .dll etc. The only file downloaded was:
einstein_S6CasA_1.06_windows_intelx86__GWopencl-nvidia-Beta.exe
Next step will be to try a driver upgrade to make sure all files are present and accounted for.
Other GPU work (BRP4G and FGRP3) run OK.
Edit: Updated the Nvidia driver to 335.23 via clean install but that did not change things, still getting instant error with the above error message.
Testing on hold until further notice.
RE: After editing my prefs
)
That can sometimes be unravelled by using dependency walker.
Is the Windows version
)
Is the Windows version running successfully anywhere?
BM
BM
Got this with dependency
)
Got this with dependency walker when opening "einstein_S6CasA_1.06_windows_intelx86__GWopencl-nvidia-Beta.exe":
The swedish phrase "Det går inte att hitta filen" translates to "File not found".
RE: Is the Windows version
)
I'll test too.
Sorry, but why this is not
)
Sorry, but why this is not done at Albert@home?
Got a similar but rather
)
Got a similar but rather shorter list of missing files with the 32-bit version of dependency walker (bitness matters, with that tool).
Host is host 5744895 - 64-bit Windows 7 with NV GTX 670, driver 335.23 (about 4 weeks ago).
Thanks. Looks like something
)
Thanks. Looks like something went wrong with the build. While investigating, I'll disable the current Windows Beta App version.
BM
BM
Googling suggests the problem
)
Googling suggests the problem might be related to missing Microsoft Visual Studio runtime redistributable packages. Are you using either VS 2008 or VS 2010 - if so, which?
(tasks are erroring, as Holmis described, but I'll save some for testing later)
RE: Are you using either VS
)
None - MinGW.
BM
BM
OK, those API- exports are
)
OK, those API- exports are probably not relevant, then.
Maybe these are more significant, if you can recognise any of them?
[D? ] DCOMP.DLL
Import Ordinal Hint Function Entry Point
------ ------------- ---- ------------------------ -----------
[OE ] 1017 (0x03F9) N/A N/A Not Bound
[CE ] N/A N/A DCompositionCreateDevice Not Bound
[D? ] GPSVC.DLL
Import Ordinal Hint Function Entry Point
------ ------- ---- ------------------------------------- -----------
[CE ] N/A N/A ProcessGroupPolicyCompletedExInternal Not Bound
[CE ] N/A N/A RsopAccessCheckByTypeInternal Not Bound
[CE ] N/A N/A RsopFileAccessCheckInternal Not Bound
[CE ] N/A N/A RsopSetPolicySettingStatusInternal Not Bound
[CE ] N/A N/A ProcessGroupPolicyCompletedInternal Not Bound
[CE ] N/A N/A RsopResetPolicySettingStatusInternal Not Bound
[D? ] IESHIMS.DLL
Import Ordinal Hint Function Entry Point
------ ------- ---- ------------------------------------ -----------
[CE ] N/A N/A IEShims_Initialize Not Bound
[CE ] N/A N/A IEShims_InDllMainContext Not Bound
[CE ] N/A N/A IEShims_GetOriginatingThreadId Not Bound
[CE ] N/A N/A IEShims_CreateWindowEx Not Bound
[CE ] N/A N/A IEShims_SetRedirectRegistryForThread Not Bound
The problem seems to be that
)
The problem seems to be that the libstdc++-6.dll is not linked statically into the App. The version of the MinGW compiler that I used for the first time apparently requires an addional option for this (-static-libstdc++).
Will build a new App, however I doubt that I can publish it before Monday.
BM
BM
just a heads-up - my work
)
just a heads-up - my work host recently downloaded 2 S6casA 1.06 (SSE2 Beta) tasks, and this host most certainly is not set to accept beta/test applications. also, it shows as SSE2, not OpenCL...are there now SSE2 Beta tasks that i haven't yet seen mentioned on the boards?
RE: just a heads-up - my
)
Yes, the CPU beta apps were announced in the Tech news section here.
RE: Sorry, but why this is
)
I'm also curious about this. Isn't albert@home the beta test platform for einstein@home?
RE: Sorry, but why this is
)
Albert@Home was originally set up to test server side code, including changes that were necessary to support new applications (e.g. locality scheduling). Historically we were testing Beta App versions on Einstein as "anonymous platform packages" as long as we had only one search (i.e. application). With the introduction of a second search (BRP), maintainig app_info.xml files became a hazzle, so we switched to testing new app versions on Albert. This, however, has its own drawbacks:
- Providing the right amount of "work" of the right type (that belongs to the App version that we want to test) is getting increasingly difficult. We e.g. need to maintain a separate line of code for each workunit generator.
- The computing power and thus throughput on Albert is not very high, so you need to repeatedly reconfigure the system to not waste any on applications that you don't want to test right now.
- When you issue a new series of application versions and generate work for it, validation compares results of these only to that of versions of the same series. A comparison to the result of an established version - which is what we really want - happens only in very rare, accidental cases.
- the variation of different systems attached to Albert is not at all representative for what's running on Einstein.
- due to the low throughput, especially when tasks are assigned with additional constraints (locality scheduling), feedback (i.e. validation) is very slow, slowing down development.
- App version testing is not independent from the server code. Occasionally the need for testing application version prevented us from testing server side changes on Albert, or at least required us to postpone these.
For us these were enough arguments to shift testing of application versions back to Einstein. Server code and new applications will still be tested on Albert, and I think for some time at least BRP app versions, too.
BM
BM
RE: and you should have a
)
Ouch - this could mean serious trouble! Does the entire calculation need DP or is it only for a few calculations? In the former case the app will work on all modern GPUs (except Intel), but will be very slow on all but a few high end chips.
You're surely aware of this, but most crunchers are not (as evidenced by the many people running Milkyway on hardware which is really not suitable for that task). Ideally any GPU slow in DP should rather stick with BRP calculations, as long as E@H has enough of those. Leave the DP stuff for the "big guns" and CPUs (ideally with SSE3/4 or AVX1/2).
But I don't know how to handle this properly and educate users. It would be a shame if they only found out about this after their RAC dropped to 1/2, 1/4 or even worse.
MrS
Scanning for our furry friends since Jan 2002
Double Precision is used to
)
Double Precision is used to rebuild a reference grid with correct numerical accuracy. This is called about 0.2% of the loops.
99.8% of the loops are SP and heavy intensive memory algorithm.It pushed the GPU to it's memory I/O limits.
But the rebuild of the grid is about 1/5 of total execution time.
The performance comes from:
- the memory bandwith. This is by far the limiting factor, and the time to execute a task is linear to GPU memory throughput. Even 4 years old card with fast DDR5 performs well.
- SP if not memory bounded. Unlikely to happen, because the inner loop is 4 adds , one compare and 4 single float load. Loading a float4 takes ages compared to performing the maths.
DP is just required for accuracy but don't play a significant role in execution time.
I tried to reduce as much memory writes as I can to speed up the calculations. Data filtering algorithm is used to reduce memory writes and GPU->CPU transfert to the lowest possible values.
Christophe
RE: just a heads-up - my
)
This needs to be fixed.
[EDIT] Actually that was already fixed [EDIT]
For the moment the beta test app versions are all disabled, so no new work should be distributed for them.
I spotted on one of my Macs that a Mac Intel 32 bit beta app was actually hanging (no progress in two days), if volunteers spot this on their Macs: feel free to abort the task.
But we did get some valuable clues from the beta test so far, thanks to all who (voluntarily or by accident...sorry for that) participated in it. I've seen run times for some CasA units below 800 sec from an NVIDIA card which is great!! It also demonstrates that (as Christophe explained) the double precision requirement doesn't hurt performance significantly (it was a card that, like most if not all NVIDIA consumer cards, has a rather pathetic double precision performance compared to AMD cards of the same price & performance range).
Cheers
HB
RE: just a heads-up - my
)
There were a few problems in the server scheduler code related to Beta App versions, but these should have been fixed around noon (CEST) on Friday.
When precisely did your host get these tasks? And more importantly: does this still happen?
BM
BM
RE: And more importantly:
)
That shouldn't be possible at the moment:
My i7-2600K has just picked up normal (CasA) v1.05 (SSE2) work even through Run beta is selected.
Claggy
RE: RE: just a heads-up -
)
unfortunately i had to restart my machine yesterday, so my BOINC event log got wiped clean and i can no longer pinpoint exactly when those two SSE2 Beta tasks were downloaded. that said, i just scrolled all the way to the bottom of my work buffer, and while several S6casA SSE2 tasks have been downloaded since then, i'm happy to say that i've not gotten anymore S6casA SSE2 Beta tasks...so what ever you did fixed the problem and prevented my host from receiving Beta tasks when it shouldn't.
thanks Bernd!
You can still access the
)
You can still access the event log messages from the file stdoutdae.txt located in the Boinc data directory, if you want even older messages check the file stdoutdae.old.
Thanks for the detailed
)
Thanks for the detailed explanation, Christophe! This sounds really good and well developed.. not that I would have expected anything else from Einstein ;)
Concerning the memory bandwidth requirements: it will be interesting too see how much the significantly larger L2 cache of Maxwell helps (assuming the other chips will share this property of GM107).
MrS
Scanning for our furry friends since Jan 2002
Large L2 caches won't
)
Large L2 caches won't probably help that much in this case (unless the whole dataset can be cached).
The dataset used for the inner loop is 50-100 Mb. Each iteration, frequency shifted data are fetched almost linearly by all possible GPU memory controllers.
Given a memory frequency, the best is to have a larger bus width. 384+ bus width will do well.
-c
Just adding one thing to
)
Just adding one thing to NVIDIA Maxwell architecture and 2Mb L2 caches.
Currently, the GPU internal caches are rather small. So every tuning guide tells to read memory in consecutive locations (no gaps).
But for GW search, we reload about 40-100 times the same dataset slightly shifted each time.
So I was thinking of changing the order of the internal loops: using random reads inside the L2 cache, with a computation window moving across the 100k sample dataset.
Once the HW is released, it would be interesting to see what how a change on the code can improve performance.
After all, it works very well on CPU with big L2, so one could expect the same on GPU. Worth a try!
-c
That sounds like a good idea,
)
That sounds like a good idea, if it works well! Especially if it was also at least as fast as the current version on regular GPUs.
BTW: the first Maxwells are available as GTX750/Ti. I'd expect the bigger chips to feature comparable compute to memory bandwidth balance, but don't know if and how the amount of L2 will be scaled.
MrS
Scanning for our furry friends since Jan 2002
I find that description by
)
I find that description by Christophe very interesting. It would be great to have a summary of basic characteristics for other applications as well, something roughly like this:
Application: S6CasA GPU
Required: DP capability, 1GB GPU RAM
Important: Memory bandwidth
Less important: SP performance, GPU interface speed
Negligible: DP performance
That could make the choice of applications or hardware much easier.
RE: Just adding one thing
)
Hi,
I have a GTX750ti but no system with win x86 installed.
Will there be a version for win 64 in the near future?
Alexander
The 32 bit executable should
)
The 32 bit executable should run under 64 bit Win, unless paths or libraries get messed up. Or am I missing something here?
MrS
Scanning for our furry friends since Jan 2002
We launched a new round of
)
We launched a new round of beta GPU (OpenCL) app versions for the Gravitational Wave search:
Version 1.07
Linux: 64 bit, NVIDIA, AMD/ATI
Windows 32 bit & 64 bit, NVIDIA, AMD/ATI
Hardware requirements are unchanged. You should only get these versions if "beta test" app versions are enabled in your web preferences.
For the moment we will focus on testing the OpenCL app versions, when those are stable we will roll out beta-test updated versions for the CPU app versions as well.
If you are unsure whether your card supports double precision, I recommend checking Wikipedia's list of GPU cards, e.g.
http://en.wikipedia.org/wiki/Radeon_HD_7000_Series
http://en.wikipedia.org/wiki/Radeon_HD_6000_Series
http://en.wikipedia.org/wiki/Radeon_HD_5000_Series
http://en.wikipedia.org/wiki/GeForce_200_Series
GeForce 400 , 500, 600, 700 cards should ALL support double precision (at least with current drivers).
Those are lists for desktop GPUs, mobile GPUs and GPUs integrated into a CPU/APU have similar Wikipedia entries.
Many thanks for testing,
HBE
I'm running a test on a
)
I'm running a test on a Tahiti XT now.
Looks weird, since the task shows running state and progress is incremented, but GPU usage is 0%, GPU clock in the lowest state. I've observed only very few and small peaks to a slightly higher state, but most of the time it looks like idle...
This
)
This host
http://einsteinathome.org/host/6801076/tasks
runs the gw opencl nvidia app on nvidia GTX750ti.
The wu is not displayed in normal tasks view, one needs to click to the gw (CaSa) tasks to make the wu visible.
two igpu arecibu wu's are also running, so if you want to get 'clean' results subtract 5-8% from displayed crunching time.
Alexander
Edit
This is what gpu-z shows:
https://dl.dropboxusercontent.com/u/50246791/GW%20opencl%20nvidia%20gpu%20usage.PNG
OK, found the problem with
)
OK, found the problem with low load - CPUs were saturated. After freeing some cores, the GPU usage is now ~62 %.
But the progress jumps up/down. Was already >50 % completed, then jumped down to 9, 14, 45, 53 %...
----
Reached 99%, then stayed a while there, but after completion ended up with error:
Machine http://einsteinathome.org/account/tasks&offset=0&show_names=1&state=0&appid=24
---
Runtime on R9 280X: 685 secs
WIN 7 64, i7 377K ATI 7970
)
WIN 7 64, i7 377K ATI 7970 driver 14.3 -> Error while computing
http://einsteinathome.org/host/10952185/tasks&offset=0&show_names=1&state=5&appid=24
First result shows a runtime
)
First result shows a runtime of 1736.59 sec for GTX750ti
First completed unit is in:
)
First completed unit is in: WUID #187876671, second copy is unsent so it might take a while to get this one verified...
My observations during crunching this task:
* RAM usage: ~400MB
* Task set to use 0.5 GPU + 1 GPU, it actually used one full core for the whole duration of the task.
* Almost 4 min. from start of task until the GPU kicked in, this causes newer Boinc versions to start to increase the percentage done until the app actually reports some progress causing the percentage to first go up and the reset to what the app reported.
I assume that delay at the beginning are caused by some initial preparations that has to be done at startup. Second task took a shorter time but I was composing this at the time so don't know for sure how long it took until the GPU started working.
* GPU load is a sigsaw pattern with lows of ~55% and highs of 99% load. About 4-5 seconds between the shifts. When the load goes down the memory controller load goes up from 1% to ~50%. Considering Christophe's comments earlier about the bus with I believe that the 192-bit bus on my 660Ti is quite a bottleneck.
* When the GPU load hits 99% I experience quite severe screen lags, enough to not run this while using the computer for watching video. That's not a problem on this machine when I run the other searches x2 on the GPU.
* And finally the run time, it came in at a whopping 1,744.70s, compared to the 36000s that my CPU (i7 HT=On) can manage when averaging the 13 latest valid results! Impressive!
Edit: Second task, WUID 187873674 completed in 1,545.44s when not running the iGPU but one core free.
Edit2: First run was also with one core free but one task running on the iGPU.
My first wu successfully
)
My first wu successfully finished after 450 sec on a ATI r9 280x, GPU utilization showed 82%.
Now waiting for my wingman to see if it validates.
Thanks for the feedback so
)
Thanks for the feedback so far.
Strange, we see a lot of errors in the phase AFTER the GPU computation is already finished, when results are to be written to the disk. We'll have to look a bit deeper to find the root cause.
HB
RE: My first wu
)
Amazing!!!
The ca. 320 GB/s memory bandwidth of this card seems to help a lot.
HB
First attempt, WU 431061981
)
First attempt, WU 431061981 ended in error.
Radeon HD 7950 and AMD Phemom II x4 965
Win7 x64
GPU kicked in at 10% completion at stayed at 70%-75% load until 99% completion.
Total time 703s.
Have now first results from
)
Have now first results from HD5830
http://einsteinathome.org/host/10105883/tasks
1590 - 1640 sec.
Edit: first finished wu from AMD A10 7700K APU is in:
http://einsteinathome.org/host/10283382/tasks
3539 sec !!!
Edit:
HD7850 XT (1536 Shaders) failed after 99%
GT 640 (GK208) finished in 6546 sec.
Might be worth to try it on Intel GPU's ?
Linux 64-bit, GT 640 ->
)
Linux 64-bit, GT 640 -> Error:
http://einsteinathome.org/task/430282776
I've checked this file's permissions, it's as expected (644)...
Four up, four down. All WUs
)
Four up, four down. All WUs have resulted in errors for me at the end of computation on my AMD machine.
No errors thus far on six
)
No errors thus far on six completed on a WinXP machine, with each GPU having an E8400 core.
The first one on my GTX 650 Ti completed in 2,591 seconds (validated).
My GTX 660 is now averaging about 1900 seconds, with the GTX 650 Ti running BRP4G/BRP5s. A faster CPU would of course help this a little.
5 of 5 completed successfully
)
5 of 5 completed successfully on my other machine.
GTX 780Ti and i7-3770k
Win7 x64
Average time = 933s. Perhaps due to less DP capability of Nvidia vs AMD?
How often do these
)
How often do these checkpoint? I had been running for over 30 minutes and exited BOINC, and the WU started over from the beginning.
Thanks.
So far I've run through 8
)
So far I've run through 8 workunits with the NVidia program, and all have processed correctly, within an hour or less, I'm just waiting for validation from wingmen.
N.B.
Also, before the SSE2-Beta app was pulled, I got 2 results validated and I'm working on the final 3 workunits now. But compared to the opencl program version, the CPU versions take forever!