Odd behavior with dual graphics cards

Nick
Nick
Joined: 12 Oct 13
Posts: 27
Credit: 8949649
RAC: 0
Topic 207635

I run 2 Nvidia cards. One GTX 1050 Ti and one GTX 650 Ti. They are set to run 2 instances for a total of 4 instances running. Frequently only one card will be running and the status column under Tasks will display 2 tasks running on either device 0 or device 1. Other times 4 tasks will be running with Tasks showing 2 tasks running on (device 0) and 2 running on (device 1). I have set

<use_all_gpus>1</use_all_gpus>

in the cc_config.xml file in the BOINC directory and

<gpu_usage>0.5</gpu_usage>

in the app_config.xml in the einstein.phys.uwm.edu directory and clearly it uses both GPUs about half the time but why doesn't it use both GPUs all the time?

 

 

 

 

archae86
archae86
Joined: 6 Dec 05
Posts: 3145
Credit: 7052844931
RAC: 1625743

Your host is running both CPU

Your host is running both CPU jobs (recently of the Continuous Gravitational Wave search Galactic Center Tuning lowFreq v1.01 (AVX) windows_x86_64 flavor) and GPU jobs (currently of the Gamma-ray pulsar binary search #1 on GPUs v1.20 (FGRPopencl1K-nvidia) windows_x86_64 flavor).

The project distributes the current GPU work of your type with the scheduler notation that associates a full CPU core worth of support with it.  The behavior you describe resembles what I would expect if you had a 4 core CPU, and your scheduler was sometimes scheduling four GPU tasks (leaving no room for any CPU tasks, which thus get older and staler and approach deadline trouble) and sometimes promoting some CPU tasks to displace some GPU work in order to ward off CPU task deadline trouble

However you have a 16 core CPU reported.  Perhaps you have used the preference mechanism to restrict scheduling to use only "25% of the processors"?

I have hosts with two GPUs, and I noticed when I allowed the recent Continuous gravity jobs to run as my first CPU tasks in a while, that I got into a state in which two tasks would run on the higher rated GPU, none on the slower one, plus two CPU tasks.

If you wish to keep both GPUs busy, only to allocate four CPU cores to BOINC, and to run a mix of CPU and GPU work, you may wish to drop back to 1X instead of 2X running on the GPUs.  That will only modestly reduce GPU output.  I did this on all three of my primary hosts recently in response to my similar experience when I started running the new CPU work.

Alternately, if you in fact have restricted BOINC to use only four of your 16 cores, perhaps you may wish to raise that to six, which on my diagnosis I predict would soon have you running 2 GPU tasks for each GPU, plus two pure CPU tasks. 

The scheduler may still bang around somewhat, as the mix of CPU and GPU tasks causes it to "hunt" quite a lot.  That is most easily managed by setting very low work queue request amounts.  Try 0.1 day, for example.

 

Nick
Nick
Joined: 12 Oct 13
Posts: 27
Credit: 8949649
RAC: 0

ARCHA86 I have my local

ARCHA86

I have my local configuration preferences set to use at most 100% of the CPUs and 100% of the CPU time. Is this what you mean?

 

floyd
floyd
Joined: 12 Sep 11
Posts: 133
Credit: 186610495
RAC: 0

Nick, you have set your work

Nick,

you have set your work cache to 3 days, and while the GPU returns are in line with that, your CPU tasks take 10 days or so. That is close enough to the deadline for BOINC to regularly take measures to accelerate them, by withdrawing CPUs from GPU support.
If you haven't made changes lately that could cause this accumulation of tasks, this would mean that your CPU works much slower than expected. First question is, is it working slower than it should be or (IMO more likely) does BOINC expect too much? If it's the latter, that's probably because of a speed discrepancy between GPU and CPU. It would help to speed up the CPU (hardly possible) or slow down the GPU by running more tasks. If that's not enough, or not possible for some reason, I'd reduce the work cache to 1 day or 1.5 at most and wait until normal operation resumes. Maybe abort the oldest cached tasks to speed things up.

Nick
Nick
Joined: 12 Oct 13
Posts: 27
Credit: 8949649
RAC: 0

Floyd I've reset my cache to

Floyd

I've reset my cache to 1 day. I'll see if that helps.

 

Thanks.

 

archae86
archae86
Joined: 6 Dec 05
Posts: 3145
Credit: 7052844931
RAC: 1625743

Nick, How many BOINC tasks

Nick,

How many BOINC tasks of which types does BOINCMgr show in running status?  If you have adequate RAM capacity, I'd suppose you might have had 12 CPU tasks running when you had 4 GPU tasks, and 14 CPU tasks when you saw only 2 GPU tasks running.  If so, then Floyd's diagnosis and suggestion may apply.  Given the settings you described, It seems unlikely that my first suggestion is useful.

I think you'll find in general that 3 days queue request is enough to give you trouble when mixing Einstein CPU and GPU tasks.  I have good hope you'll find 1 day to work better (after a little while).

On the other hand the latest CPU work type here is pretty RAM-hungry.  I don't know how the scheduler prioritizes task starting when it thinks you are running out of RAM.  How much RAM does your system have?  Various tools such are Process Explorer can tell you something about how much RAM the OS thinks is still available to it.

Nick
Nick
Joined: 12 Oct 13
Posts: 27
Credit: 8949649
RAC: 0

Archae86   I am only using

Archae86

 

I am only using 12 of 24Gig so memory is not an issue.

Currently, I am running 16 CPU tasks, 2 GPU tasks on (Device 0) and 0 on (Device 1).

The CPU tasks are Gamma-ray pulsar binary search #1 1.05 (FGRPSSE)

The GPU tasks are Gamma-ray pulsar binary search #1 on GPUs 1.20 (FGRPopencl1K-nvidia)

 

Zalster
Zalster
Joined: 26 Nov 13
Posts: 3117
Credit: 4050672230
RAC: 0

What's the deadline on the

What's the deadline on the CPU task, how many do you have in your cache?  

My guess is he has a bunch of CPU task with deadline relatively soon. Bonic manager is probably thinking he will not finish all of the CPU task in the allotted time and is giving higher priority to the CPU task over his GPU ones. 

Nick
Nick
Joined: 12 Oct 13
Posts: 27
Credit: 8949649
RAC: 0

The CPU tasks are 5/18

The CPU tasks are 5/18 earliest deadline, 5/28 latest. About 250 cached.

GPU tasks are 5/22 earliest, 5/28 latest About the same cached.

 

 

Zalster
Zalster
Joined: 26 Nov 13
Posts: 3117
Credit: 4050672230
RAC: 0

I think we have your answer

I think we have your answer to why it's doing that.

About 50 minute for GPU work units and 11 hour 45 minutes for each CPU work unit (round that out to 12 hours for easy computation)

it would take almost 8 days running non stop for your CPU to finish all of those CPU work units. 

I think it believes you won't finish the amount of work in the time allotted so it's shifting work to try and get those work units done by the deadlines.  

With a smaller cache, you probably won't see this as much since there is a greater chance to finish by the deadlines

My 2 cents.  

Nick
Nick
Joined: 12 Oct 13
Posts: 27
Credit: 8949649
RAC: 0

Zalster A plausible

Zalster

A plausible hypothesis. I've shut off getting new work units and I'll let them work down to a reasonable number, then set the cache to 1 day and see what happens. 

Thanks to all for your help and ideas.

Nick

 

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.