Odd behavior with dual graphics cards

Nick

Joined: 12 Oct 13

Posts: 27

Credit: 8949649

RAC: 0

14 May 2017 15:50:23 UTC

Topic 207635

(moderation:

)

I run 2 Nvidia cards. One GTX 1050 Ti and one GTX 650 Ti. They are set to run 2 instances for a total of 4 instances running. Frequently only one card will be running and the status column under Tasks will display 2 tasks running on either device 0 or device 1. Other times 4 tasks will be running with Tasks showing 2 tasks running on (device 0) and 2 running on (device 1). I have set

<use_all_gpus>1</use_all_gpus>

in the cc_config.xml file in the BOINC directory and

<gpu_usage>0.5</gpu_usage>

in the app_config.xml in the einstein.phys.uwm.edu directory and clearly it uses both GPUs about half the time but why doesn't it use both GPUs all the time?

archae86

Joined: 6 Dec 05

Posts: 3161

Credit: 7263895107

RAC: 1568680

Your host is running both CPU

14 May 2017 16:09:49 UTC

Message 157873

(moderation:

)

Your host is running both CPU jobs (recently of the Continuous Gravitational Wave search Galactic Center Tuning lowFreq v1.01 (AVX) windows_x86_64 flavor) and GPU jobs (currently of the Gamma-ray pulsar binary search #1 on GPUs v1.20 (FGRPopencl1K-nvidia) windows_x86_64 flavor).

The project distributes the current GPU work of your type with the scheduler notation that associates a full CPU core worth of support with it. The behavior you describe resembles what I would expect if you had a 4 core CPU, and your scheduler was sometimes scheduling four GPU tasks (leaving no room for any CPU tasks, which thus get older and staler and approach deadline trouble) and sometimes promoting some CPU tasks to displace some GPU work in order to ward off CPU task deadline trouble

However you have a 16 core CPU reported. Perhaps you have used the preference mechanism to restrict scheduling to use only "25% of the processors"?

I have hosts with two GPUs, and I noticed when I allowed the recent Continuous gravity jobs to run as my first CPU tasks in a while, that I got into a state in which two tasks would run on the higher rated GPU, none on the slower one, plus two CPU tasks.

If you wish to keep both GPUs busy, only to allocate four CPU cores to BOINC, and to run a mix of CPU and GPU work, you may wish to drop back to 1X instead of 2X running on the GPUs. That will only modestly reduce GPU output. I did this on all three of my primary hosts recently in response to my similar experience when I started running the new CPU work.

Alternately, if you in fact have restricted BOINC to use only four of your 16 cores, perhaps you may wish to raise that to six, which on my diagnosis I predict would soon have you running 2 GPU tasks for each GPU, plus two pure CPU tasks.

The scheduler may still bang around somewhat, as the mix of CPU and GPU tasks causes it to "hunt" quite a lot. That is most easily managed by setting very low work queue request amounts. Try 0.1 day, for example.

Nick

Joined: 12 Oct 13

Posts: 27

Credit: 8949649

RAC: 0

ARCHA86 I have my local

14 May 2017 16:30:23 UTC

Message 157875

(moderation:

)

ARCHA86

I have my local configuration preferences set to use at most 100% of the CPUs and 100% of the CPU time. Is this what you mean?

floyd

Joined: 12 Sep 11

Posts: 133

Credit: 186610495

RAC: 0

Nick, you have set your work

14 May 2017 17:11:27 UTC

Message 157876

(moderation:

)

Nick,

you have set your work cache to 3 days, and while the GPU returns are in line with that, your CPU tasks take 10 days or so. That is close enough to the deadline for BOINC to regularly take measures to accelerate them, by withdrawing CPUs from GPU support.
If you haven't made changes lately that could cause this accumulation of tasks, this would mean that your CPU works much slower than expected. First question is, is it working slower than it should be or (IMO more likely) does BOINC expect too much? If it's the latter, that's probably because of a speed discrepancy between GPU and CPU. It would help to speed up the CPU (hardly possible) or slow down the GPU by running more tasks. If that's not enough, or not possible for some reason, I'd reduce the work cache to 1 day or 1.5 at most and wait until normal operation resumes. Maybe abort the oldest cached tasks to speed things up.

Nick

Joined: 12 Oct 13

Posts: 27

Credit: 8949649

RAC: 0

Floyd I've reset my cache to

14 May 2017 17:36:08 UTC

Message 157877

(moderation:

)

Floyd

I've reset my cache to 1 day. I'll see if that helps.

Thanks.

archae86

Joined: 6 Dec 05

Posts: 3161

Credit: 7263895107

RAC: 1568680

Nick, How many BOINC tasks

14 May 2017 18:46:48 UTC

Message 157881 in response to message 157875

(moderation:

)

Nick,

How many BOINC tasks of which types does BOINCMgr show in running status? If you have adequate RAM capacity, I'd suppose you might have had 12 CPU tasks running when you had 4 GPU tasks, and 14 CPU tasks when you saw only 2 GPU tasks running. If so, then Floyd's diagnosis and suggestion may apply. Given the settings you described, It seems unlikely that my first suggestion is useful.

I think you'll find in general that 3 days queue request is enough to give you trouble when mixing Einstein CPU and GPU tasks. I have good hope you'll find 1 day to work better (after a little while).

On the other hand the latest CPU work type here is pretty RAM-hungry. I don't know how the scheduler prioritizes task starting when it thinks you are running out of RAM. How much RAM does your system have? Various tools such are Process Explorer can tell you something about how much RAM the OS thinks is still available to it.

Nick

Joined: 12 Oct 13

Posts: 27

Credit: 8949649

RAC: 0

Archae86 I am only using

14 May 2017 19:02:17 UTC

Message 157882

(moderation:

)

Archae86

I am only using 12 of 24Gig so memory is not an issue.

Currently, I am running 16 CPU tasks, 2 GPU tasks on (Device 0) and 0 on (Device 1).

The CPU tasks are Gamma-ray pulsar binary search #1 1.05 (FGRPSSE)

The GPU tasks are Gamma-ray pulsar binary search #1 on GPUs 1.20 (FGRPopencl1K-nvidia)

Zalster

Joined: 26 Nov 13

Posts: 3117

Credit: 4050672230

RAC: 0

What's the deadline on the

14 May 2017 19:09:51 UTC

Message 157883

(moderation:

)

What's the deadline on the CPU task, how many do you have in your cache?

My guess is he has a bunch of CPU task with deadline relatively soon. Bonic manager is probably thinking he will not finish all of the CPU task in the allotted time and is giving higher priority to the CPU task over his GPU ones.

Nick

Joined: 12 Oct 13

Posts: 27

Credit: 8949649

RAC: 0

The CPU tasks are 5/18

14 May 2017 22:14:03 UTC

Message 157888

(moderation:

)

The CPU tasks are 5/18 earliest deadline, 5/28 latest. About 250 cached.

GPU tasks are 5/22 earliest, 5/28 latest About the same cached.

Zalster

Joined: 26 Nov 13

Posts: 3117

Credit: 4050672230

RAC: 0

I think we have your answer

14 May 2017 22:32:47 UTC

Message 157890

(moderation:

)

I think we have your answer to why it's doing that.

About 50 minute for GPU work units and 11 hour 45 minutes for each CPU work unit (round that out to 12 hours for easy computation)

it would take almost 8 days running non stop for your CPU to finish all of those CPU work units.

I think it believes you won't finish the amount of work in the time allotted so it's shifting work to try and get those work units done by the deadlines.

With a smaller cache, you probably won't see this as much since there is a greater chance to finish by the deadlines

My 2 cents.

Nick

Joined: 12 Oct 13

Posts: 27

Credit: 8949649

RAC: 0

Zalster A plausible

14 May 2017 23:01:15 UTC

Message 157891

(moderation:

)

Zalster

A plausible hypothesis. I've shut off getting new work units and I'll let them work down to a reasonable number, then set the cache to 1 day and see what happens.

Thanks to all for your help and ideas.

Nick

Odd behavior with dual graphics cards

Forums › Cruncher's Corner

Comment viewing options

Forums › Cruncher's Corner