On my system with 3 tasks running simultaneously on GTX 1070, the run time cost increased from about 730s to 2500s. I would recommend multiply the granted credits by 3-4.
Also, even though there're 3 tasks, GPU utilization rate still won't max out. Sometimes it will drop to 50%.
Yes, my observation and statistic about same - on AMD GPUs new WU batch running 3-3.5 times longer.
I also open WU internal data and looks like old WUs include processing 350 binary points from Fermi data per each WU, and new batch include processing of 1255 binary points from Fermi data set per each WUs.
So 1255/350 = 3.58x times more scientific dataprocessed per WU and ~3.5 times longer runtimes is a good linear scaling.
granted credits should probably set at ~3.58x level too. x5 more CR looks like moderate overvaluation.
Other changes I just did while I wait on the benchmarks to finish: I doubled the speedup of all FGRPB1G apps so you should see an effect on DCF within the next days.
If some of you could monitor your DCF and report any changes, that would be great.
Yes i see <flops></flops> count in rising ~2 times for <app_name>hsgamma_FGRPB1G</app_name>
In my case it is from ~22 GFLOPS to ~45 GFLOPS. DCF is correcting too but not much - because you also increase flops_estimation for new WUs batch from 105 000 GFLOPs to 525 000 GFLOPs (x5 times) while real runtimes only ~3.5 times longer.
So DCF goes up from ~0.2 to ~0.3 only as atotal result while BOINC run FGRPB1G. (and DCF is still >1 while BOINC run CPU tasks from E@H)
Look like we need at least another 3x increase in speedup (estimation of FGRPB1G app speed) if you want keep DCF near ~1 and matched to other E@H subprojects. Or lower flops_estimation for FGRPB1G WUs.
If some of you could monitor your DCF and report any changes, that would be great.
I have four machines running this work. I've spent some hours since you wrote this request checking the DCF now and again, and recording the min and max values seen on each machine.
In all four cases there is reason to expect the DCF value to breathe. Three of the four run purely GPU work, only of this type, but each has two GPU units, of somewhat unmatched speed. I run the GPUs at 2X, and as none of the machines has more than four (even virtual) cores, I currently run zero CPU tasks on them. The fourth machine has only a single GPU, but as it has eight virtual cores (four physical) I do run a couple of GW CPU tasks on it in addition to the two FGRPB1 tasks. That one, of course, "breathes" in DCF depending on how many GPU tasks have finished since the most recent CPU task.
Before the recent change the 1070 + 1060 machine got DCF values as low as not much above 0.1!
I suspect DCF results from machines with a single GPU, not running CPU tasks, and not running work from other projects will stabilize to a tighter range, and may be more useful for the intended purpose. Still, I suspect the results may vary quite a lot with particular GPU model, and also enough to matter with host platform characteristics. More reports are needed.
I'm not sure if plan_class changes are instantaneous or take some time. So I'll wait some more before I change anything. Archae's values look a lot like what I would have expected from doubling the speedup. Maybe Mad_Max's values will also stabilize around 0.4.
As I said I've done some benchmarks in the meantime on a system with a GTX 750 Ti (and an idle CPU). The BRP4G workunits took almost exactly 1h to finish. Using the p_fpops value of this system and the formulas in the scheduler I calculated a theoretical speedup of 20 for the BRP4G app while the real speedup was set to 15. Since this was running for quite some time and worked well I'm trying to emulate this ratio for FGRPB1 too although this new search is using much more CPU than BRP4G which is not considered in the formulas used to calculate estimated runtimes. The theoretical speedup of FGRPB1G is 29 (runtime on reference system was 1h 18 min) while it currently is set to 14. Considering the same ratio for theoretical to real value as in BRP4G my next step is to set speedup to 20 and see what that does to your estimation and DCF calculation. But that will not happen today.
To also clear up the credit issue. The values we used in the beginning (~700 per task) where just some rough estimates and rather arbitrary. When we increased the science payload by 5 we also increased the credit by 5 (to ~3500) without checking runtimes. A check in the validator than prevented this new value from being used and Credit was clamped to 700 per task. Soon after I fixed that I finished the benchmark and calculated that the new credit per task value, based on BRP4G (1000 Credits for 1h GPU time), should be much lower. Since the amount of credit is added when a workunit is created there are now three different credit values in the system. I fixed that by clamping the value to 1365 in the validator just now. So in terms of credit we should be back at the same level we had with the BRP4G search. Of course you need to give the system some time until your RAC will show that and stabilizes again.
The issue with wrong estimated time and DCF is still under investigation.
The granted credit of my last
)
The granted credit of my last result is abnormal... any suggestion why?
Yes, my observation and
)
Yes, my observation and statistic about same - on AMD GPUs new WU batch running 3-3.5 times longer.
I also open WU internal data and looks like old WUs include processing 350 binary points from Fermi data per each WU, and new batch include processing of 1255 binary points from Fermi data set per each WUs.
So 1255/350 = 3.58x times more scientific data processed per WU and ~3.5 times longer runtimes is a good linear scaling.
granted credits should probably set at ~3.58x level too. x5 more CR looks like moderate overvaluation.
Christian Beer wrote:Other
)
Yes i see <flops></flops> count in rising ~2 times for <app_name>hsgamma_FGRPB1G</app_name>
In my case it is from ~22 GFLOPS to ~45 GFLOPS. DCF is correcting too but not much - because you also increase flops_estimation for new WUs batch from 105 000 GFLOPs to 525 000 GFLOPs (x5 times) while real runtimes only ~3.5 times longer.
So DCF goes up from ~0.2 to ~0.3 only as a total result while BOINC run FGRPB1G. (and DCF is still >1 while BOINC run CPU tasks from E@H)
Look like we need at least another 3x increase in speedup (estimation of FGRPB1G app speed) if you want keep DCF near ~1 and matched to other E@H subprojects. Or lower flops_estimation for FGRPB1G WUs.
Christian Beer wrote:If some
)
I have four machines running this work. I've spent some hours since you wrote this request checking the DCF now and again, and recording the min and max values seen on each machine.
In all four cases there is reason to expect the DCF value to breathe. Three of the four run purely GPU work, only of this type, but each has two GPU units, of somewhat unmatched speed. I run the GPUs at 2X, and as none of the machines has more than four (even virtual) cores, I currently run zero CPU tasks on them. The fourth machine has only a single GPU, but as it has eight virtual cores (four physical) I do run a couple of GW CPU tasks on it in addition to the two FGRPB1 tasks. That one, of course, "breathes" in DCF depending on how many GPU tasks have finished since the most recent CPU task.
So here are the answers:
Before the recent change the 1070 + 1060 machine got DCF values as low as not much above 0.1!
I suspect DCF results from machines with a single GPU, not running CPU tasks, and not running work from other projects will stabilize to a tighter range, and may be more useful for the intended purpose. Still, I suspect the results may vary quite a lot with particular GPU model, and also enough to matter with host platform characteristics. More reports are needed.
All my GPUs are still waiting
)
All my GPUs are still waiting for 32bit version since no other GPU work is available for them now.
I'm not sure if plan_class
)
I'm not sure if plan_class changes are instantaneous or take some time. So I'll wait some more before I change anything. Archae's values look a lot like what I would have expected from doubling the speedup. Maybe Mad_Max's values will also stabilize around 0.4.
As I said I've done some benchmarks in the meantime on a system with a GTX 750 Ti (and an idle CPU). The BRP4G workunits took almost exactly 1h to finish. Using the p_fpops value of this system and the formulas in the scheduler I calculated a theoretical speedup of 20 for the BRP4G app while the real speedup was set to 15. Since this was running for quite some time and worked well I'm trying to emulate this ratio for FGRPB1 too although this new search is using much more CPU than BRP4G which is not considered in the formulas used to calculate estimated runtimes. The theoretical speedup of FGRPB1G is 29 (runtime on reference system was 1h 18 min) while it currently is set to 14. Considering the same ratio for theoretical to real value as in BRP4G my next step is to set speedup to 20 and see what that does to your estimation and DCF calculation. But that will not happen today.
To also clear up the credit issue. The values we used in the beginning (~700 per task) where just some rough estimates and rather arbitrary. When we increased the science payload by 5 we also increased the credit by 5 (to ~3500) without checking runtimes. A check in the validator than prevented this new value from being used and Credit was clamped to 700 per task. Soon after I fixed that I finished the benchmark and calculated that the new credit per task value, based on BRP4G (1000 Credits for 1h GPU time), should be much lower. Since the amount of credit is added when a workunit is created there are now three different credit values in the system. I fixed that by clamping the value to 1365 in the validator just now. So in terms of credit we should be back at the same level we had with the BRP4G search. Of course you need to give the system some time until your RAC will show that and stabilizes again.
The issue with wrong estimated time and DCF is still under investigation.
Thanks Christian, will the
)
Thanks Christian, will the FGRPSSE (CPU tasks) values remain the same?
AgentB wrote:Thanks
)
We didn't touch the CPU search so there is no need to change anything there.
Just curious, what is the
)
Just curious, what is the difference between the CPU and the GPU searches, scientifically? Thanks!
Stranger7777 wrote:All my
)
+1