I hope this will be a good place for anyone to discuss the video cards we are using to process E@H data. My (new to me) current constraint is no more than 2 gpus per system. So I thought I would specifically start a Forum thread that doesn't specifically exclude anyone less than 3 gpus :) (like my other one started out too).
I now have a pair of Rx 5700's under Windows 10 that seems to be peaking at just past 900,000 RAC / GPU. That is with two tasks per GPU. I have just bumped it to 3 to see if I can squeeze out a bit more RAC
I believe I am getting this level of performance due to the version 1.28 Gamma Ray app the Petri/Ian&SteveC./Bernard has introduced for Windows.
Tom M
A Proud member of the O.F.A. (Old Farts Association).
Copyright © 2024 Einstein@Home. All rights reserved.
I looked at your first 5
)
I looked at your first 5 pages of valid data and sorted the estimated completion time ascending
6 values 334-350 I am guessing this represents a single workunit
80 values 551-585 Can I assume a pair of concurrent work units?
13 values 814-830 This is probably the time to do 3 concurrent work units?
It looks to me that running 3 work units concurrently takes about 822/3 = 274
If I did the math right and am correct about your concurrency, then running 3 concurrent tasks finishes each task about 70 seconds faster.
I just tried two concurrent tasks on my NVidia p102-100 and my completion time more than doubled. So for me there is no benefit to running more than one tasks on the equivalent of GTX-1080Ti in Ubuntu with drive 470.
elapsed time (cpu time)
00:10:35 (00:10:33) 1C + 0.5NV (d2) 99.69 Reported: OK
00:09:46 (00:09:44) 1C + 0.5NV (d0) 99.66 Reported: OK
00:10:41 (00:10:40) 1C + 0.5NV (d1) 99.84 Reported: OK
00:10:36 (00:10:33) 1C + 0.5NV (d2) 99.53 Reported: OK
00:10:41 (00:10:38) 1C + 0.5NV (d1) 99.53 Reported: OK
00:10:08 (00:10:07) 1C + 1NV (d0) 99.84 Reported: OK
00:10:19 (00:10:17) 1C + 0.5NV (d0) 99.68 Reported: OK
00:07:15 (00:07:13) 1C + 1NV (d2) 99.54 Reported: OK
00:04:38 (00:04:36) 1C + 1NV (d1) 99.28 Reported: OK
00:04:39 (00:04:37) 1C + 1NV (d1) 99.28 Reported: OK
00:04:41 (00:04:39) 1C + 1NV (d2) 99.29 Reported: OK
00:04:40 (00:04:38) 1C + 1NV (d0) 99.29 Reported: OK
00:04:39 (00:04:37) 1C + 1NV (d1) 99.28 Reported: OK
nvidia-smi showed 2x as much video memory used but the gpu utilization went from 96% to 100% so obviously already maxed out.
AMD cards see a benefit with
)
AMD cards see a benefit with running more tasks at a time, as the CPU component is fairly low, like 20-30%. Allowing to run many tasks at once on a single CPU thread.
Nvidia tasks use 100% of a CPU thread per task. running multiples on nvidia cards is not only wasteful for CPU resources, but also results in overall lower GPU production.
1x is better for Nvidia with Gamma Ray tasks.
Gravitational Wave tasks can see a small benefit with 2x but only if you properly stagger them to cover the CPU-only portion of the computation.
_________________________________________________________________________
Ian&Steve C.
)
They do. But it doesn't really impact performance on whether you run the task when full core is available or the CPU is sharing the core by SMT/Hyperthreading and other tasks. I tested this by resetting some nice values and putting my 4-core system with HT to a load of 24.00. No changes on GPU task runtime. I only encountered performance slowing down when starting new CPU tasks (the "warmup phase" before actually starting to get work done), which seems more demanding and looks like a memory bandwidth issue.
I disagree. I've seen tasks
)
I disagree. I've seen tasks slowdown when CPU is 100% occupied + trying to run additional GPU tasks. some projects are more effected than others.
I make it a point to always allocate a full CPU thread for any nvidia GPU tasks. I even leave 1-2 threads doing nothing to allow for background tasks to have some threads and not impact BOINC computation.
_________________________________________________________________________
Ian&Steve C. wrote:I
)
Really depends on your scheduler. I use the zen kernel which (among others like bfq IO scheduler)reduces latencies in the scheduler. Really keeps the important threads at full capacity during loads>=cpu_count. Might be the difference, considering I also have rather old cores going.
My new homeserver with a 5900x is coming this week, so I will also check this with 24 CPU threads + VMs + IO at the same time. Proxmox ships with stable or LTS, but as I'm testrunning everything anyway, I may just grab different kernels too.