CUDA application for the O3ASHF search

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 4081
Credit: 48691639571
RAC: 34778012

from what I can tell, there's

from what I can tell, there's almost no significant memory transfers happening during the recalc time.

with 1.14 GPU memory bus load drops to 0-1% during recalc (core utilization stays around 50-75%), where it's ~80-100% during the main analysis sections.

_________________________________________________________________________

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4332
Credit: 252259945
RAC: 35005

What makes this part slow is

What makes this part slow is the rather random memory access,

Which means that indeed running multiple instances in parallel which are in the same phase would likely slow down things significantly. Actually we only test and measure on single task runs.

BM

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4332
Credit: 252259945
RAC: 35005

Ian&Steve C. wrote:from

Ian&Steve C. wrote:

from what I can tell, there's almost no significant memory transfers happening during the recalc time.

with 1.14 GPU memory bus load drops to 0-1% during recalc (core utilization stays around 50-75%), where it's ~80-100% during the main analysis sections.

Yep, that's the problem. The data actually read from memory is pretty small, but it's distributed randomly and not at all consecutive, so the GPU takes a lot of time to address it with no actual transfer happening.

BM

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 4081
Credit: 48691639571
RAC: 34778012

Bernd Machenschalk

Bernd Machenschalk wrote:

Could it be that 1.14 is slower than 1.08 (only) if you run multiple instances/tasks in parallel?



I usually run multiples. and that sees a better effective production rate. multiples aren't necessarily the reason for worse production.

On my host with 4x Titan Vs:
1x task with v1.08 = not tested
1x task (no MPS) on the v1.14 app = 15-16min runtime
3x task (MPS @ 70%) on the v1.08 app = about 31min runtime, 10.3min effective
3x task (MPS @ 70%) on the v1.14 app = about 24min runtime, 8min per task effective. that's a 2x increase in production vs 1x task config. 
v1.14 is ~30% faster than v1.08 on this system


On my host with 6x RTX 3080Ti:
i didn't run 1x, since I know my  setup is more productive with multiples
4x task (MPS @ 40%) on the v1.08 app = about 24min runtime, 6min per task effective.
4x task (MPS @ 40%) on the v1.14 app = about 30min runtime, 7.5min per task effective.
3x task (MPS @ 70%) on the v1.08 app = about 22min runtime, 7.3min per task effective
v1.14 is ~20% slower than v1.08 on this system

I saw similar slowdown on my GTX 1060 6GB test bench system, with 1.08 still being faster


this dichotomy in observed behaviors between systems is what had me puzzled and the basis for my question regarding the precision used. since the Titan V does have strong FP64 performance and the 3080Ti doesnt. maybe it's the latency of the Titan V HBM? not sure.
 

_________________________________________________________________________

Ben Scott
Ben Scott
Joined: 30 Mar 20
Posts: 54
Credit: 1762442979
RAC: 2930240

This version is just the

This version is just the pits. The original app had my RTX 3080 running nearly twice a fast as it does now. Weirdly the RTX 3080 ran almost twice as fast as my RTX 3060 back then but is now only about 20% faster. In other words the 3080 took  a much bigger hit than the 3060 for some reason.

 

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 4081
Credit: 48691639571
RAC: 34778012

yeah I'm not sure why, but

yeah I'm not sure why, but 1.14 does better on my Titan Vs. and 1.08 works better on my 3080Tis, all with very similar CPUs. so I'm sticking with that config.

_________________________________________________________________________

Ben Scott
Ben Scott
Joined: 30 Mar 20
Posts: 54
Credit: 1762442979
RAC: 2930240

How do I run or even get one

How do I run or even get one of the older versions? Is the original one that took more memory but ran faster still compatible?

 

Thank you.

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 4081
Credit: 48691639571
RAC: 34778012

Ben Scott wrote: How do I

Ben Scott wrote:

How do I run or even get one of the older versions? Is the original one that took more memory but ran faster still compatible?

 

Thank you.



you can download the 1.08 app from the link in the first post of this thread.

the OLD old version of the app which used more memory (and was OpenCL, not CUDA) is not compatible with recent work, which is designed to run in this two stage method. you can't run the old app with the new tasks.

_________________________________________________________________________

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4332
Credit: 252259945
RAC: 35005

I promoted the 1.14 out of

I promoted the 1.14 out of Beta status and re-issued 1.08 as 1.15 (Beta). You can still decide what version you get (1.14 with RecalcGPU or 1.15 with RecalcCPU) by means of the "Beta work" switch, but now reversed as before. The logic behind this is that people that just run BOINC with more or less default settings (including no Beta work) and don't actively manage their configuration will only run one such task per GPU, and for those the 1.14 should work better.

BM

JohnDK
JohnDK
Joined: 25 Jun 10
Posts: 120
Credit: 2623174106
RAC: 562512

So there's no difference

So there's no difference between the 1.08 & 1.15 apps?

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.