Running multiple WUs reduces performance

shuhui1990
shuhui1990
Joined: 16 Sep 06
Posts: 27
Credit: 3631456971
RAC: 0
Topic 218278

I discovered that on recent data files running multiple WUs concurrently actually reduces performance.

On 2/1, with my GTX 1080 Ti running 3 WUs, the elapsed time is 1268s. 423s per WU. I don't have the record for running 1 WU or 2 WUs. But they are definitely longer.

 On recent data files the crunch time is significantly longer. However when I switch to single WU the crunch time became much better. Here's the data.

Concurrency Elapsed time Time/WU
3 1668 556
2 1140 570
1 455 455

On my host with GTX 1080 I observed the same behavior.

Concurrency Elapsed time Time/WU
3 2381 794
2 1680 840
1 680 680

However on GTX 980 Ti and Radeon VII, running 3 WUs is still optimal. Here's the data for Radeon VII.

Concurrency Elapsed time Time/WU
3 495 165
2 356 178
1 208 208

All of these GPUs have enough CPU resources.

Does anyone observe the same phenomenon?

 

Updated on 2/28:

I finally found the cause of the problem. SLI is the culprit.

When SLI is enabled, on my host with 6700K and 2 GTX 1080 Ti running 2 CPU tasks and 6 GPU tasks, the CPU utilization of hsgamma_FGRPB1G is 4~5%. The CPU time is about one third of the run time.

When SLI is disabled, the CPU utilization of hsgamma_FGRPB1G is 11~12%. The CPU time is close to the run time. The GPUs don't have to wait for the CPU.

I don't know how SLI affects this.

 

Updated on 3/1:

After disabling SLI, my initial conclusion remains true. Running 3x with SLI off isn't as bad as running 3x with SLI on. But it's still worse than running 1x.

On my host with 6700K and GTX 1080 Ti, with SLI on:

Task Concurrency Run Time CPU Time Time/WU
LATeah1049M 1 452  440  452 
LATeah1049M 3 1641  514  547 

With SLI off:

Task Concurrency Run Time CPU Time Time/WU
LATeah1049M 3 1494  1400  498 
LATeah1049M 2 1090  1019  545 
LATeah1049M 1 466  432  466 

Others are welcome to post their findings.

Zalster
Zalster
Joined: 26 Nov 13
Posts: 3117
Credit: 4050672230
RAC: 0

When I used to run windows I

When I used to run windows I ran 3 per card as the times were faster than single work units. 

There are a lot of factors that influence this.  Which cards, which MoBo, which CPU, Ram Speed, OS.

The last thing you need to make sure of is that all the work units come from the same source. Since we know that

some work units are faster than others, it doesn't make sense to compare a fast work unit to 2 slow+1 fast on a GPU. 

Linux on the other hand is the fastest no matter what you are running.

 

shuhui1990
shuhui1990
Joined: 16 Sep 06
Posts: 27
Credit: 3631456971
RAC: 0

The data were taken within

The data were taken within ten days. The data files range from LATeah1043L to LATeah1049M. They don't have much variation in complexity and crunch time. I am interested in others' results.

mmonnin
mmonnin
Joined: 29 May 16
Posts: 291
Credit: 3372836540
RAC: 2831264

What is your GPU utilization

What is your GPU utilization with 1x task? 

By "All of these GPUs have enough CPU resources." do you mean there is an open core for the GPU tasks are did you actually permanently set the CPU affinity for CPU and GPU tasks? It makes a difference. If I let windows handle the affinity with like 75%/2 open CPU threads the GPU utilization is worse than if I set the affinity with Process Lasso and keep all other CPU processes away from the GPU exe processes.

 

I mention this as the tasks on 12699683 have CPU run time variation when you're running 1 task or multiple

These tasks nearly identical CPU time even though the 2nd task looks to have ran at 3x.

762 594

2,346 597

https://einsteinathome.org/task/827048827

https://einsteinathome.org/task/830422253

 

 

shuhui1990
shuhui1990
Joined: 16 Sep 06
Posts: 27
Credit: 3631456971
RAC: 0

When running 1x the GPU

When running 1x the GPU utilization is 90% constantly. When running 2x/3x the utilization is 95%+.

I leave empty threads for GPU tasks. I don't think "CPU run time variation " is the problem. If the GPU doesn't get enough CPU support, the utilization drops.

https://einsteinathome.org/host/12699683/tasks/4/40?sort=desc&order=Run+time

When I run the 3x tasks reported on Feb 27th on this page, I suspended all CPU tasks. But the CPU time is still much less than the run time.  And the run time 2220 is still longer than 3x660 when running 1x.

Or does the cpu_usage set in app_config not only affects how BOINC arranges tasks, but also how much CPU time GPU tasks ask? I thought they just grab what they need no matter what it's set in app_config.

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7213324931
RAC: 965210

shuhui1990 wrote:Does anyone

shuhui1990 wrote:
Does anyone observe the same phenomenon?

I do not.  In commissioning my Nvidia RTX 2080 on a Windows 10 host in October, I observed productivity improvement for 2X over 1X.  Far more recently (yesterday) in commissioning an AMD RX 570 on a different Windows 10 host, I saw a clear performance improvement of 2X over 1X (about 8% throughput improvement, at a system power cost of only 2%, so a clear modest win overall).  All my comments regard Einstein GPU GRP work on Windows, running an application which has not changed in many months.

I think your adverse results are not a consequence of some change in the behavior of recent data files, but the symptom of some fixable configuration problem in your system.

mmonnin
mmonnin
Joined: 29 May 16
Posts: 291
Credit: 3372836540
RAC: 2831264

shuhui1990 wrote:When running

shuhui1990 wrote:

When running 1x the GPU utilization is 90% constantly. When running 2x/3x the utilization is 95%+.

I leave empty threads for GPU tasks. I don't think "CPU run time variation " is the problem. If the GPU doesn't get enough CPU support, the utilization drops.

https://einsteinathome.org/host/12699683/tasks/4/40?sort=desc&order=Run+time

When I run the 3x tasks reported on Feb 27th on this page, I suspended all CPU tasks. But the CPU time is still much less than the run time.  And the run time 2220 is still longer than 3x660 when running 1x.

Or does the cpu_usage set in app_config not only affects how BOINC arranges tasks, but also how much CPU time GPU tasks ask? I thought they just grab what they need no matter what it's set in app_config.

At 90% I would expect some improvement with concurrent tasks. I run 2x and some people have reported slight gains with 3x and 4x but the returns really start to diminish past 2x.

The cpu_usage in an app_config only changes how many tasks BOINC will allow to run. If you have it set to 7 on an 8 thread machine then to run 2 tasks you would need 14 threads. Thats not possible so even if 0.5 GPUs you're still limited to 1 concurrent task as BOINC is requesting another 7 CPU threads. In no way does it change the CPU usage of a GPU exe.

 

shuhui1990
shuhui1990
Joined: 16 Sep 06
Posts: 27
Credit: 3631456971
RAC: 0

mmonnin wrote:What is your

mmonnin wrote:

What is your GPU utilization with 1x task? 

By "All of these GPUs have enough CPU resources." do you mean there is an open core for the GPU tasks are did you actually permanently set the CPU affinity for CPU and GPU tasks? It makes a difference. If I let windows handle the affinity with like 75%/2 open CPU threads the GPU utilization is worse than if I set the affinity with Process Lasso and keep all other CPU processes away from the GPU exe processes.

 

I mention this as the tasks on 12699683 have CPU run time variation when you're running 1 task or multiple

These tasks nearly identical CPU time even though the 2nd task looks to have ran at 3x.

762 594

2,346 597

https://einsteinathome.org/task/827048827

https://einsteinathome.org/task/830422253

 

 

I tried Process Lasso but it didn't make a difference. What made a difference is disabling SLI. When SLI is enabled, the CPU time is a fraction of the run time.

shuhui1990
shuhui1990
Joined: 16 Sep 06
Posts: 27
Credit: 3631456971
RAC: 0

archae86 wrote:shuhui1990

archae86 wrote:
shuhui1990 wrote:
Does anyone observe the same phenomenon?

I do not.  In commissioning my Nvidia RTX 2080 on a Windows 10 host in October, I observed productivity improvement for 2X over 1X.  Far more recently (yesterday) in commissioning an AMD RX 570 on a different Windows 10 host, I saw a clear performance improvement of 2X over 1X (about 8% throughput improvement, at a system power cost of only 2%, so a clear modest win overall).  All my comments regard Einstein GPU GRP work on Windows, running an application which has not changed in many months.

I think your adverse results are not a consequence of some change in the behavior of recent data files, but the symptom of some fixable configuration problem in your system.

You're right. It's not about the data files. It's about SLI. I don't know how exactly SLI comes into play though.

shuhui1990
shuhui1990
Joined: 16 Sep 06
Posts: 27
Credit: 3631456971
RAC: 0

mmonnin wrote:shuhui1990

mmonnin wrote:
shuhui1990 wrote:

When running 1x the GPU utilization is 90% constantly. When running 2x/3x the utilization is 95%+.

I leave empty threads for GPU tasks. I don't think "CPU run time variation " is the problem. If the GPU doesn't get enough CPU support, the utilization drops.

https://einsteinathome.org/host/12699683/tasks/4/40?sort=desc&order=Run+time

When I run the 3x tasks reported on Feb 27th on this page, I suspended all CPU tasks. But the CPU time is still much less than the run time.  And the run time 2220 is still longer than 3x660 when running 1x.

Or does the cpu_usage set in app_config not only affects how BOINC arranges tasks, but also how much CPU time GPU tasks ask? I thought they just grab what they need no matter what it's set in app_config.

At 90% I would expect some improvement with concurrent tasks. I run 2x and some people have reported slight gains with 3x and 4x but the returns really start to diminish past 2x.

The cpu_usage in an app_config only changes how many tasks BOINC will allow to run. If you have it set to 7 on an 8 thread machine then to run 2 tasks you would need 14 threads. Thats not possible so even if 0.5 GPUs you're still limited to 1 concurrent task as BOINC is requesting another 7 CPU threads. In no way does it change the CPU usage of a GPU exe.

 

Initially I thought I messed up the app_config. I used app_config to stop BOINC from going into panic mode where the CPU tasks pile up and it only allows one GPU task. But it turned out what changes the CPU usage is SLI.

shuhui1990
shuhui1990
Joined: 16 Sep 06
Posts: 27
Credit: 3631456971
RAC: 0

After disabling SLI, my

After disabling SLI, my initial conclusion remains true. Running 3x with SLI off isn't as bad as running 3x with SLI on. But it's still worse than running 1x. See the updated thread for data.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.