Gamma-ray pulsar binary search #1 on GPUs

Mad_Max

Joined: 2 Jan 10

Posts: 165

Credit: 2265535478

RAC: 648700

Christian Beer wrote:We just

22 Dec 2016 3:33:29 UTC

Message 153217 in response to message 153199

(moderation:

)

Christian Beer wrote:

We just switched over to a new dataset with an increased "payload" of science. These tasks are designed to run 5 times longer than the previous tasks. I'm going to monitor runtimes a bit over the next days and plan to refine the settings for flops_estimation and credit calculation a bit.

Can you also change app speed ? I mean not real running speed but internal BOINC value which used to derive runtime estimation from flops_estimation for particular app?

I see huge mismatch between Gamma-ray pulsar for GPU and other Einstein@Home subprojects (especially Multi-Directed Gravitational Wave Search) and wild DCF swings.
For example with short batch of FGRPB1G WUs with flops_estimation = 105 000 GFLOPs i got real run-times in 15-20 minutes range. Initial run-time estimation was > 1.5 hours. After lot of FGRPB1G WUs done BOINC lover DCF down to ~0.2 range. So run-time estimation and real run-times were brought to the same level.
But with such low 0.2 DCF other tasks for example O1MD1G (flops_estimation = 144 000) get unrealistic low runtime estimation in ~1.5 hour range, while real run-times in 7-9 hours range. This leads to situation where BOINC download too many O1MD1G tasks (~5 times more compared to the amount that is actually required to fill CPU queue).
And after completing few O1MD1G WUs with actual runtimes at least 5 times longer compared to estimated values BOINC reset DFC to >1. And can fall in "panic mode" - after all the excess jobs downloaded in previous step will have real runtime estimation it realize what it probable can not finish them all in time (before deadline).
But with each FGRPB1G finished DCF (and so estimated run-times) goes down and after some time DCF in 0.2-0.25 range and we start new circle again.

I thinks this is because FGRPB1G app is really fast, so flops_estimation and credit should keep the same. Because we have good etalon (reference) - same WUs done on CPU too and have same flops_estimation and same credit rating but MUCH longer run-times due CPU computation is much slower.
I do not see any reasons why one and the same work done should be evaluated in different ways, only on the basis of which tool(CPU or GPU) used to complete it.
I do know how to set it, but it is in BOINC client_state.xml files:

<app_version>
    <app_name>einstein_O1MD1G</app_name>
    <version_num>100</version_num>
    <platform>windows_x86_64</platform>
    <avg_ncpus>1.000000</avg_ncpus>
    <max_ncpus>1.000000</max_ncpus>
    <flops>5739338488.370526</flops>
    <plan_class>AVX</plan_class>
    <api_version>7.1.0</api_version>

<app_version>
    <app_name>hsgamma_FGRPB1G</app_name>
    <version_num>105</version_num>
    <platform>windows_intelx86</platform>
    <avg_ncpus>1.000000</avg_ncpus>
    <max_ncpus>1.000000</max_ncpus>
    <flops>3188521382.428070</flops>
    <plan_class>FGRPSSE</plan_class>
    <api_version>7.3.0</api_version>

<app_version>
    <app_name>hsgamma_FGRPB1G</app_name>
    <version_num>117</version_num>
    <platform>windows_x86_64</platform>
    <avg_ncpus>0.500000</avg_ncpus>
    <max_ncpus>1.000000</max_ncpus>
    <flops>22319649676.996490</flops>
    <plan_class>FGRPopencl-ati</plan_class>
    <api_version>7.3.0</api_version>
    <file_ref>
        <file_name>hsgamma_FGRPB1G_1.17_windows_x86_64__FGRPopencl-ati.exe</file_name>
        <main_program/>
    </file_ref>

GPU app got only 22 GFLOPS rating, so only 4-7 times faster compared to CPU version (3.1-5.7 GFLOPS). While real difference is more like ~20-25 times faster (~same tasks run 6-8 hours on CPU, and only 15-20 mins on GPU)

And this is the real reason of incorrect calculations of estimate runtimes and DCF swings. Instead of DFC lowered for entire project or flops_estimation lowered for WUs wee need higher <flops></flops> value for this specific app (hsgamma_FGRPB1G)

P.S.

AFAIK BOINC use this formula:

Estimated run-time = flops_estimation / flops * DCF

flops_estimation - can be set on individual WU level at server side

flops - taken from application settings (not sure how set this)

DCF - calculated by local BOINC client for each connected project (one value for entire project)

Jeroen

Joined: 25 Nov 05

Posts: 379

Credit: 740030628

RAC: 0

I averaged the run time of 10

22 Dec 2016 2:34:06 UTC

Message 153236

(moderation:

)

I averaged the run time of 10 samples each of the smaller tasks and the new larger tasks. This includes samples processed on my AMD 7970 and R9-290x cards in Linux.

What I am seeing is that the run time for the larger tasks is approximately 3.4 times longer on the 7970 and 3.43 times longer on the R9-290x compared to the smaller tasks from before.

I forgot to mention last time that the progress counter is working well with the latest Linux and Windows applications. The counter runs up to 89.997% and then halts for a short while before completion. The CPU core load generally is at 25-35% per task and goes up to 50-60% during the final processing at the 89.997% mark.

I have also found that GPU load is more sporadic when running 1 task per GPU and bounces around in the range of 0-100% according to aticonfig. When running 2 tasks per GPU, the GPU load generally is in the range of 92-100%.

LunaticM

Joined: 6 Dec 15

Posts: 3

Credit: 16953703

RAC: 0

On my system with 3 tasks

22 Dec 2016 7:47:06 UTC

Message 153240

(moderation:

)

On my system with 3 tasks running simultaneously on GTX 1070, the run time cost increased from about 730s to 2500s. I would recommend multiply the granted credits by 3-4.

Also, even though there're 3 tasks, GPU utilization rate still won't max out. Sometimes it will drop to 50%.

Christian Beer

Joined: 9 Feb 05

Posts: 595

Credit: 197680219

RAC: 17311

I fixed the Credit issue. You

22 Dec 2016 8:22:25 UTC

Message 153242

(moderation:

)

I fixed the Credit issue. You should now also get 5 times the Credit than before, sorry for that. I'm looking into the DCF/speedup issue now. I'm also running some benchmark tasks on a GTX750 Ti and compare them to the BRP4G search. I'm trying to adjust credit so it can be compared to BRP4G. I already got stable runtimes of 1h for BRP4G (1x) on this system.

The overall problem is that the FGRPB1G search is behaving a little bit different than BRP4G so we have to turn the knobs a little bit from time to time and see what the result is. If some of you could monitor your DCF and report any changes, that would be great.

Christian Beer

Joined: 9 Feb 05

Posts: 595

Credit: 197680219

RAC: 17311

Other changes I just did

22 Dec 2016 9:47:36 UTC

Message 153249

(moderation:

)

Other changes I just did while I wait on the benchmarks to finish: I doubled the speedup of all FGRPB1G apps so you should see an effect on DCF within the next days.

I'm looking into deploying Beta apps that are the same as the current 1.17 versions but have a lower minimum VRAM requirement so GPUs with more than 766 MB VRAM can be tested. Currently the limit is 1GB VRAM because a task typically uses 750 MB and we want some buffer for normal operations.

freestman

Joined: 16 Jun 08

Posts: 33

Credit: 1996665422

RAC: 47720

Nvidia card run was so slow?

22 Dec 2016 11:12:33 UTC

Message 153252

(moderation:

)

Nvidia card run was so slow?GTX760

poppageek

Joined: 13 Aug 10

Posts: 259

Credit: 2473733872

RAC: 0

Thank you CB and BM for all

22 Dec 2016 11:41:40 UTC

Message 153255

(moderation:

)

Thank you CB and BM for all your hard work and keeping us well informed.

I hope you have a chance over the holidays to rest and enjoy yourselves. :-)

Happy Holidays everyone!!

AgentB

Joined: 17 Mar 12

Posts: 915

Credit: 513211304

RAC: 0

Christian Beer wrote:I'm

22 Dec 2016 17:32:36 UTC

Message 153262 in response to message 153249

(moderation:

)

Christian Beer wrote:

I'm looking into deploying Beta apps that are the same as the current 1.17 versions but have a lower minimum VRAM requirement so GPUs with more than 766 MB VRAM can be tested. Currently the limit is 1GB VRAM because a task typically uses 750 MB and we want some buffer for normal operations.

Thanks Christian, i'll keep an eye out for these. There are number of different values for VRAM being reported, so i'm not sure which one is the reference value.

https://einsteinathome.org/host/4918234/log

      [version] OpenCL GPU RAM required min: 1071644672.000000, supplied: 743047168

bonc.log reports

22-Dec-2016 09:38:53 [---] CUDA: NVIDIA GPU 0: GeForce GTX 460 (driver version 367.57, CUDA version 8.0, compute capability 2.1, 701MB, 291MB available, 978 GFLOPS peak)

22-Dec-2016 09:38:53 [---] CUDA: NVIDIA GPU 1: GeForce GTX 460 (driver version 367.57, CUDA version 8.0, compute capability 2.1, 709MB, 683MB available, 978 GFLOPS peak)

22-Dec-2016 09:38:53 [---] OpenCL: NVIDIA GPU 0: GeForce GTX 460 (driver version 367.57, device version OpenCL 1.1 CUDA, 701MB, 291MB available, 978 GFLOPS peak)

22-Dec-2016 09:38:53 [---] OpenCL: NVIDIA GPU 1: GeForce GTX 460 (driver version 367.57, device version OpenCL 1.1 CUDA, 709MB, 683MB available, 978 GFLOPS peak)

and clinfo reports

Global memory size                              735379456 (701.3MiB) Max memory allocation                           183844864 (175.3MiB)

Global memory size                              743047168 (708.6MiB)
Max memory allocation                           185761792 (177.2MiB)

and Nvidia Control panel

Total Memory                            768MB (which is the normal advertised value)

   Total Dedicated Memory matches the boinc values 701/708 MB

So it looks like its the maximum (of two cards) clinfo value.

Jonathan Jeckell

Joined: 11 Nov 04

Posts: 114

Credit: 1342033957

RAC: 774

Weird. I haven't noticed a

22 Dec 2016 15:51:11 UTC

Message 153263 in response to message 152832

(moderation:

)

Weird. I haven't noticed a significant increase in invalid work units on either my Hackintosh or my genuine MacBookPro NVIDIA cards running CUDA 8 on macOS Sierra.

But I am seeing that the CPU units seem to run 12-20% slower than on the Macs than on Linux. In the case of the Hackintosh, this was observed on the exact same hardware configuration.

TimeLord04

Joined: 8 Sep 06

Posts: 1442

Credit: 72378840

RAC: 0

Jonathan Jeckell

22 Dec 2016 18:13:44 UTC

Message 153268 in response to message 153263

(moderation:

)

Jonathan Jeckell wrote:

Weird. I haven't noticed a significant increase in invalid work units on either my Hackintosh or my genuine MacBookPro NVIDIA cards running CUDA 8 on macOS Sierra.

But I am seeing that the CPU units seem to run 12-20% slower than on the Macs than on Linux. In the case of the Hackintosh, this was observed on the exact same hardware configuration.

Jonathan,

Running CUDA, you won't notice the problem. CUDA is a different process than OpenCL. The Bug is strictly with OpenCL Units. If you are using TBar's CUDA75 App, or Special CUDA75 app by Petri33, at SETI then there will be few to no Invalids at SETI. Also, there was talk about having SETI "block" OpenCL Units from going to MACs with Darwin 15.4.0 or newer. I don't know if that got implemented...

If; however, you are crunching OpenCL Units, you WILL eventually see more and more Invalids in your Results List.

TimeLord04
Have TARDIS, will travel...
Come along K-9!
Join SETI Refugees

Gamma-ray pulsar binary search #1 on GPUs

Forums › Technical News

Comment viewing options

Forums › Technical News