Gamma-ray pulsar binary search #1 on GPUs

Mad_Max
Mad_Max
Joined: 2 Jan 10
Posts: 154
Credit: 2213591387
RAC: 390827

Christian Beer wrote:We just

Christian Beer wrote:
We just switched over to a new dataset with an increased "payload" of science. These tasks are designed to run 5 times longer than the previous tasks. I'm going to monitor runtimes a bit over the next days and plan to refine the settings for flops_estimation and credit calculation a bit.

Can you also change app speed ? I mean not real running speed but internal BOINC value which used to derive runtime estimation from flops_estimation for particular app?

I see huge mismatch between Gamma-ray pulsar for GPU and other Einstein@Home subprojects (especially Multi-Directed Gravitational Wave Search) and wild DCF swings.
For example with short batch of FGRPB1G WUs with flops_estimation = 105 000 GFLOPs i got real run-times in 15-20 minutes range. Initial run-time estimation was > 1.5 hours. After lot of FGRPB1G WUs done BOINC lover DCF down to ~0.2 range. So run-time estimation and real run-times were brought to the same level.
But with such low 0.2 DCF other tasks for example O1MD1G (flops_estimation = 144 000) get unrealistic low runtime estimation in ~1.5 hour range, while real run-times in 7-9 hours range. This leads to situation where BOINC download too many O1MD1G tasks (~5 times more compared to the amount that is actually required to fill CPU queue).
And after completing few O1MD1G WUs with actual runtimes at least 5 times longer compared to estimated values BOINC reset DFC to >1. And can fall in "panic mode" - after all the excess jobs downloaded in previous step will have real runtime estimation it realize what it probable can not finish them all in time (before deadline).
But with each FGRPB1G finished DCF (and so estimated run-times) goes down and after some time DCF in 0.2-0.25 range and we start new circle again.

I thinks this is because FGRPB1G app is really fast, so flops_estimation and credit should keep the same. Because we have good etalon (reference) - same WUs done on CPU too and have same flops_estimation and same credit rating but MUCH longer run-times due CPU computation is much slower.
I do not see any reasons why one and the same work done should be evaluated in different ways, only on the basis of which tool(CPU or GPU) used to complete it.
I do know how to set it, but it is in BOINC client_state.xml files:

<app_version>
    <app_name>einstein_O1MD1G</app_name>
    <version_num>100</version_num>
    <platform>windows_x86_64</platform>
    <avg_ncpus>1.000000</avg_ncpus>
    <max_ncpus>1.000000</max_ncpus>
    <flops>5739338488.370526</flops>
    <plan_class>AVX</plan_class>
    <api_version>7.1.0</api_version>

 

<app_version>
    <app_name>hsgamma_FGRPB1G</app_name>
    <version_num>105</version_num>
    <platform>windows_intelx86</platform>
    <avg_ncpus>1.000000</avg_ncpus>
    <max_ncpus>1.000000</max_ncpus>
    <flops>3188521382.428070</flops>
    <plan_class>FGRPSSE</plan_class>
    <api_version>7.3.0</api_version>

<app_version>
    <app_name>hsgamma_FGRPB1G</app_name>
    <version_num>117</version_num>
    <platform>windows_x86_64</platform>
    <avg_ncpus>0.500000</avg_ncpus>
    <max_ncpus>1.000000</max_ncpus>
    <flops>22319649676.996490</flops>
    <plan_class>FGRPopencl-ati</plan_class>
    <api_version>7.3.0</api_version>
    <file_ref>
        <file_name>hsgamma_FGRPB1G_1.17_windows_x86_64__FGRPopencl-ati.exe</file_name>
        <main_program/>
    </file_ref>

GPU app got only 22 GFLOPS rating, so only 4-7 times faster compared to CPU version (3.1-5.7 GFLOPS). While real difference is more like ~20-25 times faster (~same tasks run 6-8 hours on CPU, and only 15-20 mins on GPU)

And this is the real reason of incorrect calculations of estimate runtimes and DCF swings. Instead of DFC lowered for entire project or flops_estimation lowered for WUs wee need higher <flops></flops> value for this specific app (hsgamma_FGRPB1G)

 P.S.

AFAIK BOINC use this formula:

Estimated run-time = flops_estimation / flops * DCF

flops_estimation - can be set on individual WU level at server side

flops - taken from application settings (not sure how set this)

DCF - calculated by local BOINC client for each connected project (one value for entire project)

Jeroen
Jeroen
Joined: 25 Nov 05
Posts: 379
Credit: 740030628
RAC: 0

I averaged the run time of 10

I averaged the run time of 10 samples each of the smaller tasks and the new larger tasks. This includes samples processed on my AMD 7970 and R9-290x cards in Linux.

What I am seeing is that the run time for the larger tasks is approximately 3.4 times longer on the 7970 and 3.43 times longer on the R9-290x compared to the smaller tasks from before.

I forgot to mention last time that the progress counter is working well with the latest Linux and Windows applications. The counter runs up to 89.997% and then halts for a short while before completion. The CPU core load generally is at 25-35% per task and goes up to 50-60% during the final processing at the 89.997% mark.

I have also found that GPU load is more sporadic when running 1 task per GPU and bounces around in the range of 0-100% according to aticonfig. When running 2 tasks per GPU, the GPU load generally is in the range of 92-100%.

LunaticM
LunaticM
Joined: 6 Dec 15
Posts: 3
Credit: 16953703
RAC: 0

On my system with 3 tasks

On my system with 3 tasks running simultaneously on GTX 1070, the run time cost increased from about 730s to 2500s. I would recommend multiply the granted credits by 3-4.

 

Also, even though there're 3 tasks, GPU utilization rate still won't max out. Sometimes it will drop to 50%.

Christian Beer
Christian Beer
Joined: 9 Feb 05
Posts: 595
Credit: 188455145
RAC: 247758

I fixed the Credit issue. You

I fixed the Credit issue. You should now also get 5 times the Credit than before, sorry for that. I'm looking into the DCF/speedup issue now. I'm also running some benchmark tasks on a GTX750 Ti and compare them to the BRP4G search. I'm trying to adjust credit so it can be compared to BRP4G. I already got stable runtimes of 1h for BRP4G (1x) on this system.

The overall problem is that the FGRPB1G search is behaving a little bit different than BRP4G so we have to turn the knobs a little bit from time to time and see what the result is. If some of you could monitor your DCF and report any changes, that would be great.

Christian Beer
Christian Beer
Joined: 9 Feb 05
Posts: 595
Credit: 188455145
RAC: 247758

Other changes I just did

Other changes I just did while I wait on the benchmarks to finish: I doubled the speedup of all FGRPB1G apps so you should see an effect on DCF within the next days.

I'm looking into deploying Beta apps that are the same as the current 1.17 versions but have a lower minimum VRAM requirement so GPUs with more than 766 MB VRAM can be tested. Currently the limit is 1GB VRAM because a task typically uses 750 MB and we want some buffer for normal operations.

freestman
freestman
Joined: 16 Jun 08
Posts: 33
Credit: 1993852370
RAC: 90468

Nvidia card run was so slow?

Nvidia card run was so slow?GTX760

sig-1531.png

poppageek
poppageek
Joined: 13 Aug 10
Posts: 259
Credit: 2473733872
RAC: 0

Thank you CB and BM for all

Thank you CB and BM for all your hard work and keeping us well informed.

 

I hope you have a chance over the holidays to rest and enjoy yourselves. :-)

 

Happy Holidays everyone!!

AgentB
AgentB
Joined: 17 Mar 12
Posts: 915
Credit: 513211304
RAC: 0

Christian Beer wrote:I'm

Christian Beer wrote:
I'm looking into deploying Beta apps that are the same as the current 1.17 versions but have a lower minimum VRAM requirement so GPUs with more than 766 MB VRAM can be tested. Currently the limit is 1GB VRAM because a task typically uses 750 MB and we want some buffer for normal operations.

Thanks Christian, i'll keep an eye out for these.  There are number of different values for VRAM being reported, so i'm not sure which one is the reference value. 

https://einsteinathome.org/host/4918234/log

      [version] OpenCL GPU RAM required min: 1071644672.000000, supplied: 743047168

bonc.log reports

22-Dec-2016 09:38:53 [---] CUDA: NVIDIA GPU 0: GeForce GTX 460 (driver version 367.57, CUDA version 8.0, compute capability 2.1, 701MB, 291MB available, 978 GFLOPS peak)

22-Dec-2016 09:38:53 [---] CUDA: NVIDIA GPU 1: GeForce GTX 460 (driver version 367.57, CUDA version 8.0, compute capability 2.1, 709MB, 683MB available, 978 GFLOPS peak)

22-Dec-2016 09:38:53 [---] OpenCL: NVIDIA GPU 0: GeForce GTX 460 (driver version 367.57, device version OpenCL 1.1 CUDA, 701MB, 291MB available, 978 GFLOPS peak)

22-Dec-2016 09:38:53 [---] OpenCL: NVIDIA GPU 1: GeForce GTX 460 (driver version 367.57, device version OpenCL 1.1 CUDA, 709MB, 683MB available, 978 GFLOPS peak)

and clinfo reports

  Global memory size                              735379456 (701.3MiB)   Max memory allocation                           183844864 (175.3MiB)

  Global memory size                              743047168 (708.6MiB)
  Max memory allocation                           185761792 (177.2MiB)

and Nvidia Control panel

  Total Memory                            768MB (which is the normal advertised value)

   Total Dedicated Memory matches the boinc values 701/708 MB

 

So it looks like its the maximum (of two cards) clinfo value.

Jonathan Jeckell
Jonathan Jeckell
Joined: 11 Nov 04
Posts: 114
Credit: 1341977519
RAC: 941

Weird.  I haven't noticed a

Weird.  I haven't noticed a significant increase in invalid work units on either my Hackintosh or my genuine MacBookPro NVIDIA cards running CUDA 8 on macOS Sierra. 

 

But I am seeing that the CPU units seem to run 12-20% slower than on the Macs than on Linux.  In the case of the Hackintosh, this was observed on the exact same hardware configuration.

TimeLord04
TimeLord04
Joined: 8 Sep 06
Posts: 1442
Credit: 72378840
RAC: 0

Jonathan Jeckell

Jonathan Jeckell wrote:

Weird.  I haven't noticed a significant increase in invalid work units on either my Hackintosh or my genuine MacBookPro NVIDIA cards running CUDA 8 on macOS Sierra. 

 

But I am seeing that the CPU units seem to run 12-20% slower than on the Macs than on Linux.  In the case of the Hackintosh, this was observed on the exact same hardware configuration.

Jonathan,

 

Running CUDA, you won't notice the problem.  CUDA is a different process than OpenCL.  The Bug is strictly with OpenCL Units. If you are using TBar's CUDA75 App, or Special CUDA75 app by Petri33, at SETI then there will be few to no Invalids at SETI.  Also, there was talk about having SETI "block" OpenCL Units from going to MACs with Darwin 15.4.0 or newer.  I don't know if that got implemented...

 

If; however, you are crunching OpenCL Units, you WILL eventually see more and more Invalids in your Results List.

 

TL

TimeLord04
Have TARDIS, will travel...
Come along K-9!
Join SETI Refugees

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.