Gamma-ray pulsar binary search #1 on GPUs

LunaticM
LunaticM
Joined: 6 Dec 15
Posts: 3
Credit: 16953703
RAC: 0

The granted credit of my last

The granted credit of my last result is abnormal... any suggestion why?

TASK ID WORKUNIT ID COMPUTER SENT TIME REPORTED OR DEADLINE STATUS RUN TIME CPU TIME CLAIMED CREDIT GRANTED CREDIT APPLICATION
266821383 12460666 22 Dec 2016, 18:02:32 UTC 22 Dec 2016, 21:19:05 UTC Completed and validated 2,519.50 2,402.97 24.97 1,365.00 Gamma-ray pulsar binary search #1 on GPUs v1.17 (FGRPopencl-nvidia) windows_x86_64
266744089 12460666 22 Dec 2016, 3:20:56 UTC 22 Dec 2016, 18:02:31 UTC Completed and validated 2,694.85 2,395.45 24.89 3,465.00 Gamma-ray pulsar binary search #1 on GPUs v1.17 (FGRPopencl-nvidia) windows_x86_64
266744036 12460666 22 Dec 2016, 3:20:56 UTC 22 Dec 2016, 16:57:50 UTC Completed and validated 2,682.26 2,358.56 24.51 3,465.00 Gamma-ray pulsar binary search #1 on GPUs v1.17 (FGRPopencl-nvidia) windows_x86_64
Mad_Max
Mad_Max
Joined: 2 Jan 10
Posts: 154
Credit: 2213614720
RAC: 389951

Yes, my observation and

LunaticM wrote:

On my system with 3 tasks running simultaneously on GTX 1070, the run time cost increased from about 730s to 2500s. I would recommend multiply the granted credits by 3-4.

 

Also, even though there're 3 tasks, GPU utilization rate still won't max out. Sometimes it will drop to 50%.

Yes, my observation and statistic about same - on AMD GPUs new WU batch  running 3-3.5 times longer.

I also open WU internal data and looks like old WUs include processing 350 binary points from Fermi data per each WU, and new batch include processing of 1255 binary points from Fermi data set per each WUs.

So 1255/350 = 3.58x times more scientific data processed per WU and ~3.5 times longer runtimes is a good linear scaling.

granted credits should probably set at ~3.58x level too.  x5 more CR looks like moderate overvaluation.

Mad_Max
Mad_Max
Joined: 2 Jan 10
Posts: 154
Credit: 2213614720
RAC: 389951

Christian Beer wrote:Other

Christian Beer wrote:

Other changes I just did while I wait on the benchmarks to finish: I doubled the speedup of all FGRPB1G apps so you should see an effect on DCF within the next days.

If some of you could monitor your DCF and report any changes, that would be great.

Yes i see <flops></flops> count in rising ~2 times for  <app_name>hsgamma_FGRPB1G</app_name>

In my case it is from ~22 GFLOPS to ~45 GFLOPS. DCF is correcting too but not much - because you also increase flops_estimation for new WUs batch from 105 000 GFLOPs to 525 000 GFLOPs (x5 times) while real runtimes only ~3.5 times longer.

So DCF goes up from ~0.2 to ~0.3 only as a total result while BOINC run FGRPB1G. (and DCF is still >1 while BOINC run CPU tasks from E@H)

Look like we need at least another 3x increase in speedup (estimation of FGRPB1G app speed) if you want keep DCF near ~1 and matched to other E@H subprojects. Or lower  flops_estimation for FGRPB1G WUs.

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7223684931
RAC: 1001898

Christian Beer wrote:If some

Christian Beer wrote:
If some of you could monitor your DCF and report any changes, that would be great.

I have four machines running this work.  I've spent some hours since you wrote this request checking the DCF now and again, and recording the min and max values seen on each machine.

In all four cases there is reason to expect the DCF value to breathe.  Three of the four run purely GPU work, only of this type, but each has two GPU units, of somewhat unmatched speed.  I run the GPUs at 2X, and as none of the machines has more than four (even virtual) cores, I currently run zero CPU tasks on them.  The fourth machine has only a single GPU, but as it has eight virtual cores (four physical) I do run a couple of GW CPU tasks on it in addition to the two FGRPB1 tasks.  That one, of course, "breathes" in DCF depending on how many GPU tasks have finished since the most recent CPU task. 

So here are the answers:

Min Max  GPU1 GPU2  CPU_tasks
.65 1.08 1050 none  2
.46  .61 1060 1050  none
.57  .73 970  750Ti none
.20  .25 1070 1060  none

Before the recent change the 1070 + 1060 machine got DCF values as low as not much above 0.1!

I suspect DCF results from machines with a single GPU, not running CPU tasks, and not running work from other projects will stabilize to a tighter range, and may be more useful for the intended purpose.  Still, I suspect the results may vary quite a lot with particular GPU model, and also enough to matter with host platform characteristics.  More reports are needed.

 

Stranger7777
Stranger7777
Joined: 17 Mar 05
Posts: 436
Credit: 429823520
RAC: 78147

All my GPUs are still waiting

All my GPUs are still waiting for 32bit version since no other GPU work is available for them now.

Christian Beer
Christian Beer
Joined: 9 Feb 05
Posts: 595
Credit: 188455838
RAC: 247221

I'm not sure if plan_class

I'm not sure if plan_class changes are instantaneous or take some time. So I'll wait some more before I change anything. Archae's values look a lot like what I would have expected from doubling the speedup. Maybe Mad_Max's values will also stabilize around 0.4.

As I said I've done some benchmarks in the meantime on a system with a GTX 750 Ti (and an idle CPU). The BRP4G workunits took almost exactly 1h to finish. Using the p_fpops value of this system and the formulas in the scheduler I calculated a theoretical speedup of 20 for the BRP4G app while the real speedup was set to 15. Since this was running for quite some time and worked well I'm trying to emulate this ratio for FGRPB1 too although this new search is using much more CPU than BRP4G which is not considered in the formulas used to calculate estimated runtimes. The theoretical speedup of FGRPB1G is 29 (runtime on reference system was 1h 18 min) while it currently is set to 14. Considering the same ratio for theoretical to real value as in BRP4G my next step is to set speedup to 20 and see what that does to your estimation and DCF calculation. But that will not happen today.

To also clear up the credit issue. The values we used in the beginning (~700 per task) where just some rough estimates and rather arbitrary. When we increased the science payload by 5 we also increased the credit by 5 (to ~3500) without checking runtimes. A check in the validator than prevented this new value from being used and Credit was clamped to 700 per task. Soon after I fixed that I finished the benchmark and calculated that the new credit per task value, based on BRP4G (1000 Credits for 1h GPU time), should be much lower. Since the amount of credit is added when a workunit is created there are now three different credit values in the system. I fixed that by clamping the value to 1365 in the validator just now. So in terms of credit we should be back at the same level we had with the BRP4G search. Of course you need to give the system some time until your RAC will show that and stabilizes again.

The issue with wrong estimated time and DCF is still under investigation.

AgentB
AgentB
Joined: 17 Mar 12
Posts: 915
Credit: 513211304
RAC: 0

Thanks Christian, will the

Thanks Christian, will the FGRPSSE (CPU tasks) values remain the same? 

Christian Beer
Christian Beer
Joined: 9 Feb 05
Posts: 595
Credit: 188455838
RAC: 247221

AgentB wrote:Thanks

AgentB wrote:
Thanks Christian, will the FGRPSSE (CPU tasks) values remain the same? 

We didn't touch the CPU search so there is no need to change anything there.

rbpeake
rbpeake
Joined: 18 Jan 05
Posts: 266
Credit: 1132317797
RAC: 755693

Just curious, what is the

Just curious, what is the difference between the CPU and the GPU searches, scientifically?  Thanks!

Filipe
Filipe
Joined: 10 Mar 05
Posts: 186
Credit: 406537509
RAC: 361659

Stranger7777 wrote:All my

Stranger7777 wrote:
All my GPUs are still waiting for 32bit version since no other GPU work is available for them now.

 

+1

 

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.