Second GPU suddenly going idle?

th3tricky
th3tricky
Joined: 15 Mar 15
Posts: 18
Credit: 944439068
RAC: 0
Topic 219179

Hi all,

  I got back into the Einstein game a few weeks ago and everything has been going fine, 2 each gpu's in 2 computers and 1 gpu in a third computer, plugging away just fine. As of yesterday, the second gpu in the first 2 rigs has gone idle and remains that way today. I have never seen this happen and I did not change any setting, be it on my local computers, BOINC, or Einstein. 

Anyone have any thoughts?

 

 

 

Holmis
Joined: 4 Jan 05
Posts: 1118
Credit: 1055935564
RAC: 0

You need to provide more info

You need to provide more info to give anyone a chance to give any useful advice!
What GPUs are you running?
What settings are you using?
What searches are you allowing in your preferences?
Are you running more than one task per GPU?
Are you running other projects?
Does the event log tell you anything?

th3tricky
th3tricky
Joined: 15 Mar 15
Posts: 18
Credit: 944439068
RAC: 0

Seems to have sorted itself

Seems to have sorted itself out overnight, still strange, I haven't seen that before. 

-Event logs didn't show anything abnormal other than one instance of not being able to contact server. 

-I'm running all searches except CPU specific ones.

-3 tasks per GPU

-2080ti, 2070, 2060, 1660ti are the GPU's

-Running Rosetta@home as well, though Einstein seems to be clogging the CPU so Rosetta doesn't run much.

 

 

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5850
Credit: 110029906088
RAC: 22437579

th3tricky wrote:Seems to have

th3tricky wrote:
Seems to have sorted itself out overnight ...

I wouldn't count on that :-).  As Holmis says, unless you give a lot more detail about exactly what you see, people who don't have your vantage point are just guessing - including me.

You did give two further snippets of information that point in a certain direction so I'll hazard a guess.  I could be completely on the wrong track but only you can tell whether I am or not.  You mentioned:-

th3tricky wrote:

... -3 tasks per GPU

... - Einstein seems to be clogging the CPU so Rosetta doesn't run much.

With 2 nVidia GPUs per host and 3 concurrent tasks per GPU, BOINC is forced to 'reserve' 6 CPU cores to 'support' the GPU tasks.  This is down to nVidia's implementation of OpenCL.  Each core is essentially doing very little other than 'spin waiting' so that there is instant response whenever a GPU task requires CPU support.  You shouldn't try to reduce that allocation.  If you did, your GPU crunch times would be heavily slowed.  The result of this is that you don't have very many CPU cores available to process Rosetta tasks.

Over time (the couple of weeks you initially mentioned) the lack of ability to process Rosetta tasks will probably cause BOINC to think that those CPU tasks are getting into deadline trouble.  If that happens, BOINC will go into 'panic mode' (less colourfully known as high priority mode) and guess what - BOINC decides to pause one of your GPUs to free up some cores to clear the Rosetta backlog.  With 3 extra cores to crunch, the backlog is reduced and panic mode is no longer needed so things seemingly return to normal - until it is needed again.

As I said, this is all just guesswork - it needs more information from you.  For example, when you noticed that a GPU was idle, did you also see 3 extra CPU tasks crunching.  If so, that would be a pointer to the above guess being on the right track.  Also did that finish off and return a number of CPU tasks so that there are fewer now remaining?  If so, that would explain why panic mode (for the moment) is no longer needed.  In that case it's likely to return in the future unless you do something to stop it happening.

Don't think Einstein is to blame for "clogging the CPU" as you put it.  Einstein just supplies whatever tasks are requested by BOINC.  You need to change BOINC settings so BOINC does a better job of requesting what it can manage.  The first thing to do is make sure you don't have too large a work cache size.  How many days of work do you allow BOINC to fetch?  Until you get things running without ever going into panic mode, you should start with no more than say 1 day.

I took a look at your Windows 7 machine that lists RTX 2060 GPUs.  There are currently 750 GPU tasks in progress.  I looked at a page of tasks returned on July 7.  I saw 2 distinct sets of run times, some around 1850s and some around 2150s.  I'm guessing the slower times come from the 2nd GPU.  A back of the envelope calculation says (using a rough average of 2000s for 3 tasks on 2 GPUs (6 in total)) that you return a task about every 5.6 minutes or about 260 tasks per day.  So your 750 in progress represents nearly 3 days of work (irrespective of what your settings may say).  So I think it would be wise to reduce a little until you are sure BOINC doesn't go into panic mode.

A more important thing to do (in some ways) is to check if you really get much benefit from running 3X (3 concurrent tasks per GPU).  Have you tried running at 2X to see if you really do benefit from 3X?  I suspect you may not gain very much at all.  By choosing 2X, you would immediately free up an extra two threads so that it would be far less likely that Rosetta would fall behind and cause BOINC to panic.  If you do want to keep 3x, reducing your cache size should be effective in helping to prevent panic mode.

I hasten to emphasise that the above is speculation on the underlying causes of what you have seen.  I may be off the mark and only you can determine that.  Have a good think about what I've written and please ask questions if anything is not clear.  I'll be quite interested to hear if I'm on the right track at all, and what you decide to do if any of this helps you work out what the problem really was :-).

Cheers,
Gary.

Holmis
Joined: 4 Jan 05
Posts: 1118
Credit: 1055935564
RAC: 0

It could also be a matter of

It could also be a matter of resource share and debt between projects.
If Boinc allocates more resources to Einstein then Rosetta's debt will build up over time, when that debt is large enough Boinc will switch resources over to Rosetta until the debt is lowered.
Going from x3 to x2 on the GPUs and freeing more CPU resources for Rosetta will effect this.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5850
Credit: 110029906088
RAC: 22437579

Holmis wrote:It could also be

Holmis wrote:
It could also be a matter of resource share and debt between projects.

Yes, I guess so. I've only run a single project for quite a while so I don't get to experience the effects of inter-project resource shares and debts.

Since it's only Einstein using the GPUs, my gut feeling was that BOINC wouldn't disturb the full use of those GPUs unless there was a deadline risk to trigger it.  This is where extra information from the OP would be useful.  Were there some Rosetta tasks that were getting close to deadline when a GPU was seen to be idle?

Cheers,
Gary.

th3tricky
th3tricky
Joined: 15 Mar 15
Posts: 18
Credit: 944439068
RAC: 0

That is a good point.

That is a good point. Normally, I'm running 6 Einstein task on CPU/GPU and then 6 Rosetta. I will have to watch what happens to tasks next time this happens, as I didn't look at the Rosetta deadlines.  The only other thing I can add is that a few days before this happened it looked like Einstein wanted to clear out Pulsar search #5 tasks and switched over to running just those, plus one other GPU task. If this was the case I would think all GPU's would be called up to do the work! Looks like it was running about 15 of those on the first GPU, then second GPU goes idle for two days. May just be a coincidence. Strange thing is that I've run this configuration of tasks/projects for a few years and something like this hasn't happened.

I will try 2x tasks per GPU and see if that makes a difference. I use 3x per GPU because 1. The GPU's have the memory for it, and 2. It seems to be a good trade-off between number of tasks completed at once and time completed. Going to 4 per GPU seemed excessively long, even with something like a 2080 ti. 

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5850
Credit: 110029906088
RAC: 22437579

th3tricky wrote:... a few

th3tricky wrote:
... a few days before this happened it looked like Einstein wanted to clear out Pulsar search #5 tasks and switched over to running just those, plus one other GPU task.

That's more evidence that BOINC went into high priority mode.  Pulsar search #5 is CPU only.  It's looking for different things compared to the GPU search.  The tasks would take quite a while to run compared to GPU tasks.  What you describe is what BOINC would do if FGRP#5 tasks were in deadline trouble.  You should perhaps re-think the number of searches you support if you want to avoid this sort of unstable behaviour from time to time.

In your 2nd message you said, "I'm running all searches except CPU specific ones", so where did those FGRP#5 tasks come from?

th3tricky wrote:
If this was the case I would think all GPU's would be called up to do the work!

If you're suggesting that the GPUs should have been used to do CPU tasks, that's impossible.  They are different searches with different apps.  You would be much better served by disabling the Pulsar search #5 and just selecting the GPU search here.  Let your CPUs handle the work from Rosetta.

th3tricky wrote:
Looks like it was running about 15 of those on the first GPU ...

If by "15 of those" you are referring to "Pulsar search #5" tasks, then they were running on CPU cores, not the first GPU.

You have two different machines that show as having dual GPUs.  One has 4 cores (8 threads) and the other 6 cores (12 threads).  You mention 4 different models of GPU but don't state how they are paired off.  It's pretty much impossible to guess what might be going on if you don't give proper information.

th3tricky wrote:
Strange thing is that I've run this configuration of tasks/projects for a few years and something like this hasn't happened.

In your first message you said, "I got back into the Einstein game a few weeks ago ...", without saying how long the break was.  In any case, Welcome back!, but please realise that things may be quite different now to how they were when you were last active here.  For starters, you have quite different GPUs now (that weren't around back then) and the GPU app is quite different to what was on offer a few years ago.  In previous GPU searches, there was much more incentive to run multiple concurrent tasks.  Now, there is little gain, if any, in running more than X2.  You really need to do the experiments to find out.  In fact, the gain in running X2 compared to one task per gpu, is itself nothing like it was in former times.

Having lots of GPU memory is not going to make running X3 or X4 give you better results.  The 2nd factor you mentioned is the important one.  You need to properly test X2, X3, X4 with a significant number of tasks at each setting so you get a decent average for the task crunch times.  If you don't do that carefully under current conditions, you could be just fooling yourself into thinking that what applied several years ago will work now.

One other point.  You haven't indicated in previous messages, the mechanism you use to control the number of concurrent tasks per GPU.  It could be by using the project preference setting referred to as GPU utilization factor, or it could be by using the BOINC configuration file whose name is app_config.xml.  What mechanism do you use to control the number of concurrent GPU tasks?

Cheers,
Gary.

th3tricky
th3tricky
Joined: 15 Mar 15
Posts: 18
Credit: 944439068
RAC: 0

I assumed the #5 searches

I assumed the #5 searches were on GPU because under project preferences I have selected "No" under the "accept CPU tasks" option. I have since unchecked the #5 search on the preferences page as well. 

I went over to folding@home at the end of last year because of Einsteins issues with Nvidia's Turing gpu's that were dumping every task given to them. Now it seems to be much improved and I'm exciting to be back! 

My computer layout goes as such: i7-8700k with 2080 ti and 1660 ti. Second computer: i7-4790 with 2070 and 2060. Third computer: Ryzen 1700x, 1660.

As for the multiple tasks per GPU, I actually return 9 tasks per hour, per GPU, running 3 tasks each, and 8 returned per hour for 2 tasks per GPU. Not a whole lot, though it is something! Those numbers are also just what the 2080 ti is doing.  I am also using the Einstein preferences page to adjust the utilization. That option is actually one reason I like this project, unlike SETI where you have to do the app_config.xml modifications.  

Thanks for all the info Gary, I certainly appreciate it. 

 

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5850
Credit: 110029906088
RAC: 22437579

th3tricky wrote:... I have

th3tricky wrote:
... I have selected "No" under the "accept CPU tasks" option.

Are you talking about the "Resource Settings" section near the top of the page where it says "Use CPU:  Request CPU-only tasks from this project."?  If that was set to "No" you shouldn't get any CPU tasks at all (I imagine).  Mine is permanently set to "Yes" so that I can just tick the particular search box when I want to run CPU tasks.  Please note that if a search is for GPUs, there will be a "(GPU)" as part of the name for the search.  Gamma-ray pulsar search #5 has no extra label so it's CPU only.  That list does contain prior searches that are no longer active - eg Gamma-ray pulsar binary search #1.  There was a time when it was also run on CPUs but these days it's entirely for GPUs.

th3tricky wrote:
I have since unchecked the #5 search on the preferences page as well.

I think that's the best way to control what searches you get tasks for.

th3tricky wrote:
My computer layout goes as such: i7-8700k with 2080 ti and 1660 ti. Second computer: i7-4790 with 2070 and 2060. Third computer: Ryzen 1700x, 1660.

Strange how BOINC doesn't always properly recognise the most powerful GPU.  The 1660Ti shows when it really should show the 2080Ti.

With that information, I had a (very brief) look at a few run times to see I could pick the X3 and X2 tasks for the two different GPUs.  I could be wrong because I didn't look all that closely and certainly didn't seek out a proper representative batch in sufficient numbers to arrive at a proper calculated average.  So the numbers below are quite 'rough'.  This is the sort of thing you could do (more rigorously) by selecting batches of tasks (say 30 -50) in each category and working out proper averages.


  GPU           Task Run Time          Per Task Time          Tasks per Day            Approx Daily Credit
 Model            X3      X2            X3      X2             X3      X2                  X3        X2 
======           ====    ====           ===     ===            ===     ===                =====      =====

2080Ti           1270     890           423     445            204     194                ~700K      ~670K

1660Ti           2950    1960           983     980             88      88                ~300K      ~300K

The per task times are just the estimated task run times divided by the concurrency.  Tasks per day is 86400 divided by the per task time.  Daily credit assumes all tasks will be valid and achieve 3465 credits.

When I looked yesterday, there were 29 invalid tasks showing for 980 validated.  That's an invalid rate of close to 3% which seems a little high.  Other people report invalid rates around 1% or a little more - perhaps in the 1-2% range.  I'm wondering if the invalid rate does go up a bit if the task concurrency is higher so perhaps now that you are on X2 it might fall back a little.  The validator can be a bit 'picky' so everybody sees a few of these.

 

Cheers,
Gary.

archae86
archae86
Joined: 6 Dec 05
Posts: 3146
Credit: 7060184931
RAC: 1147353

Gary Roberts wrote:When I

Gary Roberts wrote:
When I looked yesterday, there were 29 invalid tasks showing for 980 validated.  That's an invalid rate of close to 3% which seems a little high.  Other people report invalid rates around 1% or a little more - perhaps in the 1-2% range. 

I ran a 2080 here for a number of weeks.  I noticed that the invalid rate was appreciably higher than for my other cards.  I'd be surprised if the 2080 Ti differed in that from the 2080.  I have no opinion what may be normal for the 1660.

While I agree with Gary that something like 1-2% seems more usual here for healthy systems, I'm not at all sure 3% warrants appreciable concern.

 

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.