from what I can tell, there's almost no significant memory transfers happening during the recalc time.
with 1.14 GPU memory bus load drops to 0-1% during recalc (core utilization stays around 50-75%), where it's ~80-100% during the main analysis sections.
What makes this part slow is the rather random memory access,
Which means that indeed running multiple instances in parallel which are in the same phase would likely slow down things significantly. Actually we only test and measure on single task runs.
from what I can tell, there's almost no significant memory transfers happening during the recalc time.
with 1.14 GPU memory bus load drops to 0-1% during recalc (core utilization stays around 50-75%), where it's ~80-100% during the main analysis sections.
Yep, that's the problem. The data actually read from memory is pretty small, but it's distributed randomly and not at all consecutive, so the GPU takes a lot of time to address it with no actual transfer happening.
Could it be that 1.14 is slower than 1.08 (only) if you run multiple instances/tasks in parallel?
I usually run multiples. and that sees a better effective production rate. multiples aren't necessarily the reason for worse production.
On my host with 4x Titan Vs:
1x task with v1.08 = not tested
1x task (no MPS) on the v1.14 app = 15-16min runtime
3x task (MPS @ 70%) on the v1.08 app = about 31min runtime, 10.3min effective
3x task (MPS @ 70%) on the v1.14 app = about 24min runtime, 8min per task effective. that's a 2x increase in production vs 1x task config.
v1.14 is ~30% faster than v1.08 on this system
On my host with 6x RTX 3080Ti:
i didn't run 1x, since I know my setup is more productive with multiples
4x task (MPS @ 40%) on the v1.08 app = about 24min runtime, 6min per task effective.
4x task (MPS @ 40%) on the v1.14 app = about 30min runtime, 7.5min per task effective.
3x task (MPS @ 70%) on the v1.08 app = about 22min runtime, 7.3min per task effective v1.14 is ~20% slower than v1.08 on this system
I saw similar slowdown on my GTX 1060 6GB test bench system, with 1.08 still being faster
this dichotomy in observed behaviors between systems is what had me puzzled and the basis for my question regarding the precision used. since the Titan V does have strong FP64 performance and the 3080Ti doesnt. maybe it's the latency of the Titan V HBM? not sure.
This version is just the pits. The original app had my RTX 3080 running nearly twice a fast as it does now. Weirdly the RTX 3080 ran almost twice as fast as my RTX 3060 back then but is now only about 20% faster. In other words the 3080 took a much bigger hit than the 3060 for some reason.
yeah I'm not sure why, but 1.14 does better on my Titan Vs. and 1.08 works better on my 3080Tis, all with very similar CPUs. so I'm sticking with that config.
How do I run or even get one of the older versions? Is the original one that took more memory but ran faster still compatible?
Thank you.
you can download the 1.08 app from the link in the first post of this thread.
the OLD old version of the app which used more memory (and was OpenCL, not CUDA) is not compatible with recent work, which is designed to run in this two stage method. you can't run the old app with the new tasks.
I promoted the 1.14 out of Beta status and re-issued 1.08 as 1.15 (Beta). You can still decide what version you get (1.14 with RecalcGPU or 1.15 with RecalcCPU) by means of the "Beta work" switch, but now reversed as before. The logic behind this is that people that just run BOINC with more or less default settings (including no Beta work) and don't actively manage their configuration will only run one such task per GPU, and for those the 1.14 should work better.
from what I can tell, there's
)
from what I can tell, there's almost no significant memory transfers happening during the recalc time.
with 1.14 GPU memory bus load drops to 0-1% during recalc (core utilization stays around 50-75%), where it's ~80-100% during the main analysis sections.
_________________________________________________________________________
What makes this part slow is
)
Which means that indeed running multiple instances in parallel which are in the same phase would likely slow down things significantly. Actually we only test and measure on single task runs.
BM
Ian&Steve C. wrote:from
)
Yep, that's the problem. The data actually read from memory is pretty small, but it's distributed randomly and not at all consecutive, so the GPU takes a lot of time to address it with no actual transfer happening.
BM
Bernd Machenschalk
)
I usually run multiples. and that sees a better effective production rate. multiples aren't necessarily the reason for worse production.
On my host with 4x Titan Vs:
1x task with v1.08 = not tested
1x task (no MPS) on the v1.14 app = 15-16min runtime
3x task (MPS @ 70%) on the v1.08 app = about 31min runtime, 10.3min effective
3x task (MPS @ 70%) on the v1.14 app = about 24min runtime, 8min per task effective. that's a 2x increase in production vs 1x task config.
v1.14 is ~30% faster than v1.08 on this system
On my host with 6x RTX 3080Ti:
i didn't run 1x, since I know my setup is more productive with multiples
4x task (MPS @ 40%) on the v1.08 app = about 24min runtime, 6min per task effective.
4x task (MPS @ 40%) on the v1.14 app = about 30min runtime, 7.5min per task effective.
3x task (MPS @ 70%) on the v1.08 app = about 22min runtime, 7.3min per task effective
v1.14 is ~20% slower than v1.08 on this system
I saw similar slowdown on my GTX 1060 6GB test bench system, with 1.08 still being faster
this dichotomy in observed behaviors between systems is what had me puzzled and the basis for my question regarding the precision used. since the Titan V does have strong FP64 performance and the 3080Ti doesnt. maybe it's the latency of the Titan V HBM? not sure.
_________________________________________________________________________
This version is just the
)
This version is just the pits. The original app had my RTX 3080 running nearly twice a fast as it does now. Weirdly the RTX 3080 ran almost twice as fast as my RTX 3060 back then but is now only about 20% faster. In other words the 3080 took a much bigger hit than the 3060 for some reason.
yeah I'm not sure why, but
)
yeah I'm not sure why, but 1.14 does better on my Titan Vs. and 1.08 works better on my 3080Tis, all with very similar CPUs. so I'm sticking with that config.
_________________________________________________________________________
How do I run or even get one
)
How do I run or even get one of the older versions? Is the original one that took more memory but ran faster still compatible?
Thank you.
Ben Scott wrote: How do I
)
you can download the 1.08 app from the link in the first post of this thread.
the OLD old version of the app which used more memory (and was OpenCL, not CUDA) is not compatible with recent work, which is designed to run in this two stage method. you can't run the old app with the new tasks.
_________________________________________________________________________
I promoted the 1.14 out of
)
I promoted the 1.14 out of Beta status and re-issued 1.08 as 1.15 (Beta). You can still decide what version you get (1.14 with RecalcGPU or 1.15 with RecalcCPU) by means of the "Beta work" switch, but now reversed as before. The logic behind this is that people that just run BOINC with more or less default settings (including no Beta work) and don't actively manage their configuration will only run one such task per GPU, and for those the 1.14 should work better.
BM
So there's no difference
)
So there's no difference between the 1.08 & 1.15 apps?