CUDA application for the O3ASHF search

Ian&Steve C.

Joined: 19 Jan 20

Posts: 4153

Credit: 50070054888

RAC: 42285667

from what I can tell, there's

13 Mar 2024 14:25:30 UTC

Message 223230

(moderation:

)

from what I can tell, there's almost no significant memory transfers happening during the recalc time.

with 1.14 GPU memory bus load drops to 0-1% during recalc (core utilization stays around 50-75%), where it's ~80-100% during the main analysis sections.

_________________________________________________________________________

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4349

Credit: 253664394

RAC: 34929

What makes this part slow is

13 Mar 2024 14:48:00 UTC

Message 223233

(moderation:

)

What makes this part slow is the rather random memory access,

Which means that indeed running multiple instances in parallel which are in the same phase would likely slow down things significantly. Actually we only test and measure on single task runs.

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4349

Credit: 253664394

RAC: 34929

Ian&Steve C. wrote:from

13 Mar 2024 14:56:00 UTC

Message 223234 in response to message 223230

(moderation:

)

Ian&Steve C. wrote:

from what I can tell, there's almost no significant memory transfers happening during the recalc time.

with 1.14 GPU memory bus load drops to 0-1% during recalc (core utilization stays around 50-75%), where it's ~80-100% during the main analysis sections.

Yep, that's the problem. The data actually read from memory is pretty small, but it's distributed randomly and not at all consecutive, so the GPU takes a lot of time to address it with no actual transfer happening.

Ian&Steve C.

Joined: 19 Jan 20

Posts: 4153

Credit: 50070054888

RAC: 42285667

Bernd Machenschalk

13 Mar 2024 15:25:30 UTC

Message 223236 in response to message 223223

(moderation:

)

Bernd Machenschalk wrote:

Could it be that 1.14 is slower than 1.08 (only) if you run multiple instances/tasks in parallel?

I usually run multiples. and that sees a better effective production rate. multiples aren't necessarily the reason for worse production.

On my host with 4x Titan Vs:
1x task with v1.08 = not tested
1x task (no MPS) on the v1.14 app = 15-16min runtime
3x task (MPS @ 70%) on the v1.08 app = about 31min runtime, 10.3min effective
3x task (MPS @ 70%) on the v1.14 app = about 24min runtime, 8min per task effective. that's a 2x increase in production vs 1x task config.
v1.14 is ~30% faster than v1.08 on this system

On my host with 6x RTX 3080Ti:
i didn't run 1x, since I know my setup is more productive with multiples
4x task (MPS @ 40%) on the v1.08 app = about 24min runtime, 6min per task effective.
4x task (MPS @ 40%) on the v1.14 app = about 30min runtime, 7.5min per task effective.
3x task (MPS @ 70%) on the v1.08 app = about 22min runtime, 7.3min per task effective
v1.14 is ~20% slower than v1.08 on this system

I saw similar slowdown on my GTX 1060 6GB test bench system, with 1.08 still being faster

this dichotomy in observed behaviors between systems is what had me puzzled and the basis for my question regarding the precision used. since the Titan V does have strong FP64 performance and the 3080Ti doesnt. maybe it's the latency of the Titan V HBM? not sure.

_________________________________________________________________________

Ben Scott

Joined: 30 Mar 20

Posts: 54

Credit: 1863596790

RAC: 2856482

This version is just the

17 Mar 2024 22:04:42 UTC

Message 223354

(moderation:

)

This version is just the pits. The original app had my RTX 3080 running nearly twice a fast as it does now. Weirdly the RTX 3080 ran almost twice as fast as my RTX 3060 back then but is now only about 20% faster. In other words the 3080 took a much bigger hit than the 3060 for some reason.

Ian&Steve C.

Joined: 19 Jan 20

Posts: 4153

Credit: 50070054888

RAC: 42285667

yeah I'm not sure why, but

17 Mar 2024 22:38:12 UTC

Message 223356

(moderation:

)

yeah I'm not sure why, but 1.14 does better on my Titan Vs. and 1.08 works better on my 3080Tis, all with very similar CPUs. so I'm sticking with that config.

_________________________________________________________________________

Ben Scott

Joined: 30 Mar 20

Posts: 54

Credit: 1863596790

RAC: 2856482

How do I run or even get one

17 Mar 2024 23:09:35 UTC

Message 223357

(moderation:

)

How do I run or even get one of the older versions? Is the original one that took more memory but ran faster still compatible?

Thank you.

Ian&Steve C.

Joined: 19 Jan 20

Posts: 4153

Credit: 50070054888

RAC: 42285667

Ben Scott wrote: How do I

18 Mar 2024 0:23:06 UTC

Message 223359 in response to message 223357

(moderation:

)

Ben Scott wrote:

How do I run or even get one of the older versions? Is the original one that took more memory but ran faster still compatible?

Thank you.

you can download the 1.08 app from the link in the first post of this thread.

the OLD old version of the app which used more memory (and was OpenCL, not CUDA) is not compatible with recent work, which is designed to run in this two stage method. you can't run the old app with the new tasks.

_________________________________________________________________________

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4349

Credit: 253664394

RAC: 34929

I promoted the 1.14 out of

28 Mar 2024 9:17:25 UTC

Message 223686

(moderation:

)

I promoted the 1.14 out of Beta status and re-issued 1.08 as 1.15 (Beta). You can still decide what version you get (1.14 with RecalcGPU or 1.15 with RecalcCPU) by means of the "Beta work" switch, but now reversed as before. The logic behind this is that people that just run BOINC with more or less default settings (including no Beta work) and don't actively manage their configuration will only run one such task per GPU, and for those the 1.14 should work better.

JohnDK

Joined: 25 Jun 10

Posts: 121

Credit: 2670947322

RAC: 1445535

So there's no difference

28 Mar 2024 15:26:49 UTC

Message 223693

(moderation:

)

So there's no difference between the 1.08 & 1.15 apps?

CUDA application for the O3ASHF search

Forums › Technical News

Comment viewing options

Forums › Technical News