We just switched over to a new dataset with an increased "payload" of science. These tasks are designed to run 5 times longer than the previous tasks. I'm going to monitor runtimes a bit over the next days and plan to refine the settings for flops_estimation and credit calculation a bit.
Can you also change app speed ? I mean not real running speed but internal BOINC value which used to derive runtime estimation from flops_estimation for particular app?
I see huge mismatch between Gamma-ray pulsar for GPU and other Einstein@Home subprojects (especially Multi-Directed Gravitational Wave Search) and wild DCF swings.
For example with short batch of FGRPB1G WUs with flops_estimation = 105 000 GFLOPs i got real run-times in 15-20 minutes range. Initial run-time estimation was > 1.5 hours. After lot of FGRPB1G WUs done BOINC lover DCF down to ~0.2 range. So run-time estimation and real run-times were brought to the same level.
But with such low 0.2 DCF other tasks for example O1MD1G (flops_estimation = 144 000) get unrealistic low runtime estimation in ~1.5 hour range, while real run-times in 7-9 hours range. This leads to situation where BOINC download too many O1MD1G tasks (~5 times more compared to the amount that is actually required to fill CPU queue).
And after completing few O1MD1G WUs with actual runtimes at least 5 times longer compared to estimated values BOINC reset DFC to >1. And can fall in "panic mode" - after all the excess jobs downloaded in previous step will have real runtime estimation it realize what it probable can not finish them all in time (before deadline).
But with each FGRPB1G finished DCF (and so estimated run-times) goes down and after some time DCF in 0.2-0.25 range and we start new circle again.
I thinks this is because FGRPB1G app is really fast, so flops_estimation and credit should keep the same. Because we have good etalon (reference) - same WUs done on CPU too and have same flops_estimation and same credit rating but MUCH longer run-times due CPU computation is much slower.
I do not see any reasons why one and the same work done should be evaluated in different ways, only on the basis of which tool(CPU or GPU) used to complete it.
I do know how to set it, but it is in BOINC client_state.xml files:
GPU app got only 22 GFLOPS rating, so only 4-7 times faster compared to CPU version (3.1-5.7 GFLOPS). While real difference is more like ~20-25 times faster (~same tasks run 6-8 hours on CPU, and only 15-20 mins on GPU)
Andthis is the real reasonof incorrect calculations of estimate runtimes and DCF swings. Instead of DFC lowered for entire project or flops_estimation lowered for WUs wee need higher <flops></flops> value for this specific app (hsgamma_FGRPB1G)
I averaged the run time of 10 samples each of the smaller tasks and the new larger tasks. This includes samples processed on my AMD 7970 and R9-290x cards in Linux.
What I am seeing is that the run time for the larger tasks is approximately 3.4 times longer on the 7970 and 3.43 times longer on the R9-290x compared to the smaller tasks from before.
I forgot to mention last time that the progress counter is working well with the latest Linux and Windows applications. The counter runs up to 89.997% and then halts for a short while before completion. The CPU core load generally is at 25-35% per task and goes up to 50-60% during the final processing at the 89.997% mark.
I have also found that GPU load is more sporadic when running 1 task per GPU and bounces around in the range of 0-100% according to aticonfig. When running 2 tasks per GPU, the GPU load generally is in the range of 92-100%.
On my system with 3 tasks running simultaneously on GTX 1070, the run time cost increased from about 730s to 2500s. I would recommend multiply the granted credits by 3-4.
Also, even though there're 3 tasks, GPU utilization rate still won't max out. Sometimes it will drop to 50%.
I fixed the Credit issue. You should now also get 5 times the Credit than before, sorry for that. I'm looking into the DCF/speedup issue now. I'm also running some benchmark tasks on a GTX750 Ti and compare them to the BRP4G search. I'm trying to adjust credit so it can be compared to BRP4G. I already got stable runtimes of 1h for BRP4G (1x) on this system.
The overall problem is that the FGRPB1G search is behaving a little bit different than BRP4G so we have to turn the knobs a little bit from time to time and see what the result is. If some of you could monitor your DCF and report any changes, that would be great.
Other changes I just did while I wait on the benchmarks to finish: I doubled the speedup of all FGRPB1G apps so you should see an effect on DCF within the next days.
I'm looking into deploying Beta apps that are the same as the current 1.17 versions but have a lower minimum VRAM requirement so GPUs with more than 766 MB VRAM can be tested. Currently the limit is 1GB VRAM because a task typically uses 750 MB and we want some buffer for normal operations.
I'm looking into deploying Beta apps that are the same as the current 1.17 versions but have a lower minimum VRAM requirement so GPUs with more than 766 MB VRAM can be tested. Currently the limit is 1GB VRAM because a task typically uses 750 MB and we want some buffer for normal operations.
Thanks Christian, i'll keep an eye out for these. There are number of different values for VRAM being reported, so i'm not sure which one is the reference value.
Weird. I haven't noticed a significant increase in invalid work units on either my Hackintosh or my genuine MacBookPro NVIDIA cards running CUDA 8 on macOS Sierra.
But I am seeing that the CPU units seem to run 12-20% slower than on the Macs than on Linux. In the case of the Hackintosh, this was observed on the exact same hardware configuration.
Weird. I haven't noticed a significant increase in invalid work units on either my Hackintosh or my genuine MacBookPro NVIDIA cards running CUDA 8 on macOS Sierra.
But I am seeing that the CPU units seem to run 12-20% slower than on the Macs than on Linux. In the case of the Hackintosh, this was observed on the exact same hardware configuration.
Jonathan,
Running CUDA, you won't notice the problem. CUDA is a different process than OpenCL. The Bug is strictly with OpenCL Units. If you are using TBar's CUDA75 App, or Special CUDA75 app by Petri33, at SETI then there will be few to no Invalids at SETI. Also, there was talk about having SETI "block" OpenCL Units from going to MACs with Darwin 15.4.0 or newer. I don't know if that got implemented...
If; however, you are crunching OpenCL Units, you WILL eventually see more and more Invalids in your Results List.
Christian Beer wrote:We just
)
Can you also change app speed ? I mean not real running speed but internal BOINC value which used to derive runtime estimation from flops_estimation for particular app?
I see huge mismatch between Gamma-ray pulsar for GPU and other Einstein@Home subprojects (especially Multi-Directed Gravitational Wave Search) and wild DCF swings.
For example with short batch of FGRPB1G WUs with flops_estimation = 105 000 GFLOPs i got real run-times in 15-20 minutes range. Initial run-time estimation was > 1.5 hours. After lot of FGRPB1G WUs done BOINC lover DCF down to ~0.2 range. So run-time estimation and real run-times were brought to the same level.
But with such low 0.2 DCF other tasks for example O1MD1G (flops_estimation = 144 000) get unrealistic low runtime estimation in ~1.5 hour range, while real run-times in 7-9 hours range. This leads to situation where BOINC download too many O1MD1G tasks (~5 times more compared to the amount that is actually required to fill CPU queue).
And after completing few O1MD1G WUs with actual runtimes at least 5 times longer compared to estimated values BOINC reset DFC to >1. And can fall in "panic mode" - after all the excess jobs downloaded in previous step will have real runtime estimation it realize what it probable can not finish them all in time (before deadline).
But with each FGRPB1G finished DCF (and so estimated run-times) goes down and after some time DCF in 0.2-0.25 range and we start new circle again.
I thinks this is because FGRPB1G app is really fast, so flops_estimation and credit should keep the same. Because we have good etalon (reference) - same WUs done on CPU too and have same flops_estimation and same credit rating but MUCH longer run-times due CPU computation is much slower.
I do not see any reasons why one and the same work done should be evaluated in different ways, only on the basis of which tool(CPU or GPU) used to complete it.
I do know how to set it, but it is in BOINC client_state.xml files:
<app_version>
<app_name>einstein_O1MD1G</app_name>
<version_num>100</version_num>
<platform>windows_x86_64</platform>
<avg_ncpus>1.000000</avg_ncpus>
<max_ncpus>1.000000</max_ncpus>
<flops>5739338488.370526</flops>
<plan_class>AVX</plan_class>
<api_version>7.1.0</api_version>
<app_version>
<app_name>hsgamma_FGRPB1G</app_name>
<version_num>105</version_num>
<platform>windows_intelx86</platform>
<avg_ncpus>1.000000</avg_ncpus>
<max_ncpus>1.000000</max_ncpus>
<flops>3188521382.428070</flops>
<plan_class>FGRPSSE</plan_class>
<api_version>7.3.0</api_version>
<app_version>
<app_name>hsgamma_FGRPB1G</app_name>
<version_num>117</version_num>
<platform>windows_x86_64</platform>
<avg_ncpus>0.500000</avg_ncpus>
<max_ncpus>1.000000</max_ncpus>
<flops>22319649676.996490</flops>
<plan_class>FGRPopencl-ati</plan_class>
<api_version>7.3.0</api_version>
<file_ref>
<file_name>hsgamma_FGRPB1G_1.17_windows_x86_64__FGRPopencl-ati.exe</file_name>
<main_program/>
</file_ref>
GPU app got only 22 GFLOPS rating, so only 4-7 times faster compared to CPU version (3.1-5.7 GFLOPS). While real difference is more like ~20-25 times faster (~same tasks run 6-8 hours on CPU, and only 15-20 mins on GPU)
And this is the real reason of incorrect calculations of estimate runtimes and DCF swings. Instead of DFC lowered for entire project or flops_estimation lowered for WUs wee need higher <flops></flops> value for this specific app (hsgamma_FGRPB1G)
P.S.
AFAIK BOINC use this formula:
Estimated run-time = flops_estimation / flops * DCF
flops_estimation - can be set on individual WU level at server side
flops - taken from application settings (not sure how set this)
DCF - calculated by local BOINC client for each connected project (one value for entire project)
I averaged the run time of 10
)
I averaged the run time of 10 samples each of the smaller tasks and the new larger tasks. This includes samples processed on my AMD 7970 and R9-290x cards in Linux.
What I am seeing is that the run time for the larger tasks is approximately 3.4 times longer on the 7970 and 3.43 times longer on the R9-290x compared to the smaller tasks from before.
I forgot to mention last time that the progress counter is working well with the latest Linux and Windows applications. The counter runs up to 89.997% and then halts for a short while before completion. The CPU core load generally is at 25-35% per task and goes up to 50-60% during the final processing at the 89.997% mark.
I have also found that GPU load is more sporadic when running 1 task per GPU and bounces around in the range of 0-100% according to aticonfig. When running 2 tasks per GPU, the GPU load generally is in the range of 92-100%.
On my system with 3 tasks
)
On my system with 3 tasks running simultaneously on GTX 1070, the run time cost increased from about 730s to 2500s. I would recommend multiply the granted credits by 3-4.
Also, even though there're 3 tasks, GPU utilization rate still won't max out. Sometimes it will drop to 50%.
I fixed the Credit issue. You
)
I fixed the Credit issue. You should now also get 5 times the Credit than before, sorry for that. I'm looking into the DCF/speedup issue now. I'm also running some benchmark tasks on a GTX750 Ti and compare them to the BRP4G search. I'm trying to adjust credit so it can be compared to BRP4G. I already got stable runtimes of 1h for BRP4G (1x) on this system.
The overall problem is that the FGRPB1G search is behaving a little bit different than BRP4G so we have to turn the knobs a little bit from time to time and see what the result is. If some of you could monitor your DCF and report any changes, that would be great.
Other changes I just did
)
Other changes I just did while I wait on the benchmarks to finish: I doubled the speedup of all FGRPB1G apps so you should see an effect on DCF within the next days.
I'm looking into deploying Beta apps that are the same as the current 1.17 versions but have a lower minimum VRAM requirement so GPUs with more than 766 MB VRAM can be tested. Currently the limit is 1GB VRAM because a task typically uses 750 MB and we want some buffer for normal operations.
Nvidia card run was so slow?
)
Nvidia card run was so slow?GTX760
Thank you CB and BM for all
)
Thank you CB and BM for all your hard work and keeping us well informed.
I hope you have a chance over the holidays to rest and enjoy yourselves. :-)
Happy Holidays everyone!!
Christian Beer wrote:I'm
)
Thanks Christian, i'll keep an eye out for these. There are number of different values for VRAM being reported, so i'm not sure which one is the reference value.
https://einsteinathome.org/host/4918234/log
bonc.log reports
and clinfo reports
So it looks like its the maximum (of two cards) clinfo value.
Weird. I haven't noticed a
)
Weird. I haven't noticed a significant increase in invalid work units on either my Hackintosh or my genuine MacBookPro NVIDIA cards running CUDA 8 on macOS Sierra.
But I am seeing that the CPU units seem to run 12-20% slower than on the Macs than on Linux. In the case of the Hackintosh, this was observed on the exact same hardware configuration.
Jonathan Jeckell
)
Jonathan,
Running CUDA, you won't notice the problem. CUDA is a different process than OpenCL. The Bug is strictly with OpenCL Units. If you are using TBar's CUDA75 App, or Special CUDA75 app by Petri33, at SETI then there will be few to no Invalids at SETI. Also, there was talk about having SETI "block" OpenCL Units from going to MACs with Darwin 15.4.0 or newer. I don't know if that got implemented...
If; however, you are crunching OpenCL Units, you WILL eventually see more and more Invalids in your Results List.
TL
TimeLord04
Have TARDIS, will travel...
Come along K-9!
Join SETI Refugees