I discovered that on recent data files running multiple WUs concurrently actually reduces performance.
On 2/1, with my GTX 1080 Ti running 3 WUs, the elapsed time is 1268s. 423s per WU. I don't have the record for running 1 WU or 2 WUs. But they are definitely longer.
On recent data files the crunch time is significantly longer. However when I switch to single WU the crunch time became much better. Here's the data.
Concurrency | Elapsed time | Time/WU |
3 | 1668 | 556 |
2 | 1140 | 570 |
1 | 455 | 455 |
On my host with GTX 1080 I observed the same behavior.
Concurrency | Elapsed time | Time/WU |
3 | 2381 | 794 |
2 | 1680 | 840 |
1 | 680 | 680 |
However on GTX 980 Ti and Radeon VII, running 3 WUs is still optimal. Here's the data for Radeon VII.
Concurrency | Elapsed time | Time/WU |
3 | 495 | 165 |
2 | 356 | 178 |
1 | 208 | 208 |
All of these GPUs have enough CPU resources.
Does anyone observe the same phenomenon?
Updated on 2/28:
I finally found the cause of the problem. SLI is the culprit.
When SLI is enabled, on my host with 6700K and 2 GTX 1080 Ti running 2 CPU tasks and 6 GPU tasks, the CPU utilization of hsgamma_FGRPB1G is 4~5%. The CPU time is about one third of the run time.
When SLI is disabled, the CPU utilization of hsgamma_FGRPB1G is 11~12%. The CPU time is close to the run time. The GPUs don't have to wait for the CPU.
I don't know how SLI affects this.
Updated on 3/1:
After disabling SLI, my initial conclusion remains true. Running 3x with SLI off isn't as bad as running 3x with SLI on. But it's still worse than running 1x.
On my host with 6700K and GTX 1080 Ti, with SLI on:
Task | Concurrency | Run Time | CPU Time | Time/WU |
LATeah1049M | 1 | 452 | 440 | 452 |
LATeah1049M | 3 | 1641 | 514 | 547 |
With SLI off:
Task | Concurrency | Run Time | CPU Time | Time/WU |
LATeah1049M | 3 | 1494 | 1400 | 498 |
LATeah1049M | 2 | 1090 | 1019 | 545 |
LATeah1049M | 1 | 466 | 432 | 466 |
Others are welcome to post their findings.
Copyright © 2024 Einstein@Home. All rights reserved.
When I used to run windows I
)
When I used to run windows I ran 3 per card as the times were faster than single work units.
There are a lot of factors that influence this. Which cards, which MoBo, which CPU, Ram Speed, OS.
The last thing you need to make sure of is that all the work units come from the same source. Since we know that
some work units are faster than others, it doesn't make sense to compare a fast work unit to 2 slow+1 fast on a GPU.
Linux on the other hand is the fastest no matter what you are running.
The data were taken within
)
The data were taken within ten days. The data files range from LATeah1043L to LATeah1049M. They don't have much variation in complexity and crunch time. I am interested in others' results.
What is your GPU utilization
)
What is your GPU utilization with 1x task?
By "All of these GPUs have enough CPU resources." do you mean there is an open core for the GPU tasks are did you actually permanently set the CPU affinity for CPU and GPU tasks? It makes a difference. If I let windows handle the affinity with like 75%/2 open CPU threads the GPU utilization is worse than if I set the affinity with Process Lasso and keep all other CPU processes away from the GPU exe processes.
I mention this as the tasks on 12699683 have CPU run time variation when you're running 1 task or multiple
These tasks nearly identical CPU time even though the 2nd task looks to have ran at 3x.
https://einsteinathome.org/task/827048827
https://einsteinathome.org/task/830422253
When running 1x the GPU
)
When running 1x the GPU utilization is 90% constantly. When running 2x/3x the utilization is 95%+.
I leave empty threads for GPU tasks. I don't think "CPU run time variation " is the problem. If the GPU doesn't get enough CPU support, the utilization drops.
https://einsteinathome.org/host/12699683/tasks/4/40?sort=desc&order=Run+time
When I run the 3x tasks reported on Feb 27th on this page, I suspended all CPU tasks. But the CPU time is still much less than the run time. And the run time 2220 is still longer than 3x660 when running 1x.
Or does the cpu_usage set in app_config not only affects how BOINC arranges tasks, but also how much CPU time GPU tasks ask? I thought they just grab what they need no matter what it's set in app_config.
shuhui1990 wrote:Does anyone
)
I do not. In commissioning my Nvidia RTX 2080 on a Windows 10 host in October, I observed productivity improvement for 2X over 1X. Far more recently (yesterday) in commissioning an AMD RX 570 on a different Windows 10 host, I saw a clear performance improvement of 2X over 1X (about 8% throughput improvement, at a system power cost of only 2%, so a clear modest win overall). All my comments regard Einstein GPU GRP work on Windows, running an application which has not changed in many months.
I think your adverse results are not a consequence of some change in the behavior of recent data files, but the symptom of some fixable configuration problem in your system.
shuhui1990 wrote:When running
)
At 90% I would expect some improvement with concurrent tasks. I run 2x and some people have reported slight gains with 3x and 4x but the returns really start to diminish past 2x.
The cpu_usage in an app_config only changes how many tasks BOINC will allow to run. If you have it set to 7 on an 8 thread machine then to run 2 tasks you would need 14 threads. Thats not possible so even if 0.5 GPUs you're still limited to 1 concurrent task as BOINC is requesting another 7 CPU threads. In no way does it change the CPU usage of a GPU exe.
mmonnin wrote:What is your
)
I tried Process Lasso but it didn't make a difference. What made a difference is disabling SLI. When SLI is enabled, the CPU time is a fraction of the run time.
archae86 wrote:shuhui1990
)
You're right. It's not about the data files. It's about SLI. I don't know how exactly SLI comes into play though.
mmonnin wrote:shuhui1990
)
Initially I thought I messed up the app_config. I used app_config to stop BOINC from going into panic mode where the CPU tasks pile up and it only allows one GPU task. But it turned out what changes the CPU usage is SLI.
After disabling SLI, my
)
After disabling SLI, my initial conclusion remains true. Running 3x with SLI off isn't as bad as running 3x with SLI on. But it's still worse than running 1x. See the updated thread for data.