LOL! Well, I can guess what the Einstein developers will be saying in the general direction of DA on Monday morning, even if they don't say it to his face.
Meanwhile, having got my app_config.xml sorted out, I can confirm that putting --device 0 into the command line for the app_version overcomes the API v7.5.0 issue - one task branded v1.47 Beta picked up cleanly from part-way through, and another has run from start to finish with that combination of settings.
Only a stopgap and a proof-of-concept, of course. We need to handle automatic assignment for --device 1, --device 2 etc. where they exist
I know the focus will be on sorting out the issues with the CUDA version of the app but I thought I'd post some very encouraging findings with the OpenCL app running on HD7850 GPUs. I have a large number of these GPUs in a variety of different hosts and I've been running BRP5 (4 concurrent tasks) on all of them using app_config.xml to control concurrency. When BRP6 started, I added a BRP6 clause to all the app_config.xml files so that BRP6 would also run 4x to start with. My intention is to experiment with this when things settle down.
The bulk of my HD7850 GPUs are in quite modern hosts and I'm seeing nice performance improvements (but not spectacular) when going from 1.39 -> 1.41 -> 1.47. I have several GPUs in older hardware and it's the improvement here that is quite spectacular.
The previous BRP5 1.39 app performs very well in systems whose PCIe bus is v2+ but quite poorly in v1.x systems. I have 18 5+ year old hosts with Q8400 quad core CPUs (v1.x PCIe bus) that have run CPU apps only for their entire life. Last year I played around with putting HD7850s in a few of these and achieved about 56% of the performance I could get from the same card in an Ivy Bridge i3 with 4 virtual cores - obviously a quite disappointing result.
When I ran the BRP6 1.41 app on these PCIe-v1.x hosts, I saw commensurate performance - 33% longer crunch times for 33% bigger tasks giving 33% more credit. I had a suspicion that the BRP6 1.47-beta app might improve this so about 6 hours ago I set about deploying the beta app to these hosts and rebranding all the 1.41 tasks in their caches to 1.47. The first four 1.47 tasks done mostly or entirely with the beta app on the first machine are now just finished and I'm quite amazed at the results.
[pre]
Search App Ver Elapsed time CPU time Notes
====== ======= ============ ========= =====
BRP5 1.39 28,500 sec 3,750 sec Long term averages for BRP5
BRP6 1.41 38,500 sec 4,950 sec Averages for 20+ tasks for BRP6 v1.41
BRP6-beta 1.47 21,980 sec 2,230 sec 1st task - 18% on v1.41 and 82% on v1.47 app
BRP6-beta 1.47 20,567 sec 2,027 sec 2nd task - 100% on v1.47 app
BRP6-beta 1.47 19,053 sec 1,760 sec 3rd task - 100% on v1.47 app
BRP6-beta 1.47 22,509 sec 2,405 sec 4th task - 100% on v1.47 app
BRP6-beta 1.47 19,099 sec 2,072 sec 1st v1.47 task on different host but identical hardware
The nice thing is that these old clunkers are now able to get performance on the HD7850 that is much closer to what a Haswell refresh based system can do. I just grabbed a recently completed result from such a machine and put it in the above table for comparison.
However, the runtime is more data-dependent now, so there are "lucky" and "not-so-lucky" workunits wrt runtime. The overall speedup might be worse.
I have another idea up my sleeves that could improve the situation a bit for the "not-so-lucky" WUs. That will come later,tho, first we need to fix this CUDA device assignment problem,
Keep an eye on (Windows) host 5744895 - all Parkes tasks are being run under Beta v1.47 as discussed, two-at-a-time on a GTX 670.
The variability in runtime is very marked, and correlates with high CPU usage as well. Two 'high CPU' tasks running together are sufficient to slow down the Arecibo BRP4 tasks running on the Intel HD 4000 at the same time.
Variable runtime on this scale is a particular problem, affecting work fetch and caching, while this project is still running DCF. The best of luck in sorting it out with your new idea - more power to your elbow (if that English idiom translates safely into German).
Activated exception handling...
[08:57:44][4948][INFO ] Starting data processing...
-------------------
Error occured on Tuesday, March 3, 2015 at 08:57:44.
C:\ProgramData\BOINC\projects\einstein.phys.uwm.edu\einsteinbinary_BRP6_1.49_windows_intelx86__BRP6-Beta-cuda32-nv301.exe caused an Access Violation at location 00000000 Reading from location 00000000.
LOL! Well, I can guess what
)
LOL! Well, I can guess what the Einstein developers will be saying in the general direction of DA on Monday morning, even if they don't say it to his face.
Meanwhile, having got my app_config.xml sorted out, I can confirm that putting --device 0 into the command line for the app_version overcomes the API v7.5.0 issue - one task branded v1.47 Beta picked up cleanly from part-way through, and another has run from start to finish with that combination of settings.
Only a stopgap and a proof-of-concept, of course. We need to handle automatic assignment for --device 1, --device 2 etc. where they exist
I know the focus will be on
)
I know the focus will be on sorting out the issues with the CUDA version of the app but I thought I'd post some very encouraging findings with the OpenCL app running on HD7850 GPUs. I have a large number of these GPUs in a variety of different hosts and I've been running BRP5 (4 concurrent tasks) on all of them using app_config.xml to control concurrency. When BRP6 started, I added a BRP6 clause to all the app_config.xml files so that BRP6 would also run 4x to start with. My intention is to experiment with this when things settle down.
The bulk of my HD7850 GPUs are in quite modern hosts and I'm seeing nice performance improvements (but not spectacular) when going from 1.39 -> 1.41 -> 1.47. I have several GPUs in older hardware and it's the improvement here that is quite spectacular.
The previous BRP5 1.39 app performs very well in systems whose PCIe bus is v2+ but quite poorly in v1.x systems. I have 18 5+ year old hosts with Q8400 quad core CPUs (v1.x PCIe bus) that have run CPU apps only for their entire life. Last year I played around with putting HD7850s in a few of these and achieved about 56% of the performance I could get from the same card in an Ivy Bridge i3 with 4 virtual cores - obviously a quite disappointing result.
When I ran the BRP6 1.41 app on these PCIe-v1.x hosts, I saw commensurate performance - 33% longer crunch times for 33% bigger tasks giving 33% more credit. I had a suspicion that the BRP6 1.47-beta app might improve this so about 6 hours ago I set about deploying the beta app to these hosts and rebranding all the 1.41 tasks in their caches to 1.47. The first four 1.47 tasks done mostly or entirely with the beta app on the first machine are now just finished and I'm quite amazed at the results.
[pre]
Search App Ver Elapsed time CPU time Notes
====== ======= ============ ========= =====
BRP5 1.39 28,500 sec 3,750 sec Long term averages for BRP5
BRP6 1.41 38,500 sec 4,950 sec Averages for 20+ tasks for BRP6 v1.41
BRP6-beta 1.47 21,980 sec 2,230 sec 1st task - 18% on v1.41 and 82% on v1.47 app
BRP6-beta 1.47 20,567 sec 2,027 sec 2nd task - 100% on v1.47 app
BRP6-beta 1.47 19,053 sec 1,760 sec 3rd task - 100% on v1.47 app
BRP6-beta 1.47 22,509 sec 2,405 sec 4th task - 100% on v1.47 app
BRP6-beta 1.47 19,099 sec 2,072 sec 1st v1.47 task on different host but identical hardware
BRP6-beta 1.47 17,880 sec 664 sec Pentium dual core G3258 (Haswell refresh) with HD7850 4x[/pre]
The nice thing is that these old clunkers are now able to get performance on the HD7850 that is much closer to what a Haswell refresh based system can do. I just grabbed a recently completed result from such a machine and put it in the above table for comparison.
Cheers,
Gary.
Thanks Gary ! I also
)
Thanks Gary !
I also noticed my old ancient Core 2 cpu with Version 1 pcie
seemed to be running a whole lot faster with my radeon 7970 )
Bill
Hi! Thx for the feedback.
)
Hi!
Thx for the feedback. Looks good.
However, the runtime is more data-dependent now, so there are "lucky" and "not-so-lucky" workunits wrt runtime. The overall speedup might be worse.
I have another idea up my sleeves that could improve the situation a bit for the "not-so-lucky" WUs. That will come later,tho, first we need to fix this CUDA device assignment problem,
Cheers
HB
Keep an eye on (Windows) host
)
Keep an eye on (Windows) host 5744895 - all Parkes tasks are being run under Beta v1.47 as discussed, two-at-a-time on a GTX 670.
The variability in runtime is very marked, and correlates with high CPU usage as well. Two 'high CPU' tasks running together are sufficient to slow down the Arecibo BRP4 tasks running on the Intel HD 4000 at the same time.
Variable runtime on this scale is a particular problem, affecting work fetch and caching, while this project is still running DCF. The best of luck in sorting it out with your new idea - more power to your elbow (if that English idiom translates safely into German).
I am having problems with
)
I am having problems with these. A 970 and 660ti on i7-3770k.
All fail with after 2.5 seconds:
7.4.27
Recursion too deep; the stack overflowed.
(0x3e9) - exit code 1001 (0x3e9)
Activated exception handling...
[22:49:54][11804][INFO ] Starting data processing...
[22:49:54][11804][ERROR] No suitable CUDA device available!
[22:49:54][11804][ERROR] Demodulation failed (error: 1001)!
22:49:54 (11804): called boinc_finish(1001)
]]>
New (BRP6 Beta) CUDA version
)
New (BRP6 Beta) CUDA version 1.49 is out. This is meant to fix the API problem, no change to the actual computation code.
BM
BM
Works for me, thank you.
)
Works for me, thank you.
Hi Bernd Just got a v1.49
)
Hi Bernd
Just got a v1.49 on host 10698787 that failed at 1 second...
Stderr output
7.4.36
(unknown error) - exit code -1073741819 (0xc0000005)
Activated exception handling...
[08:57:44][4948][INFO ] Starting data processing...
-------------------
Error occured on Tuesday, March 3, 2015 at 08:57:44.
C:\ProgramData\BOINC\projects\einstein.phys.uwm.edu\einsteinbinary_BRP6_1.49_windows_intelx86__BRP6-Beta-cuda32-nv301.exe caused an Access Violation at location 00000000 Reading from location 00000000.
Registers:
eax=00000000 ebx=0028fda0 ecx=76a998da edx=030846a0 esi=00000000 edi=007aae87
eip=00000000 esp=0027c1a0 ebp=00000000 iopl=0 nv up ei pl nz na pe nc
cs=0023 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00010202
Call stack:
00000000
After an initial download
)
After an initial download failure (quickly resolved), I have one running under BOINC v7.4.36 on 5744895 - no sign of error.
But I still have the --device command line in place to finish a running v1.47, so it's not a true test yet. More later.