Scheduling priority

robl
robl
Joined: 2 Jan 13
Posts: 1,434
Credit: 861,882,818
RAC: 503,400
Topic 197314

The scheduling priority for E@H is showing -2.13 on one of my machines. What does this imply.

What recently happened is I did an "software/rpm" upgrade which upgraded the NVIDIA drivers on this machine (Ubuntu 12.04 with NVIDIA GTX 650ti). I also requested through BOINC manager a buffer of one additional day. The caused a download of new GPU work for E@H and S@H, but then all GPU jobs reported computation errors for both projects. This made no sense. I then did a restart on the boinc-client on this machine and now find myself waiting on GPU work for both S@H and E@H. Things then got worse and I did a project reset on E@H and S@H. No work has come in so I am thinking I am waiting on some timer to expire. This machine only processes GPU work for these two projects and CPU work for Rosetta.

robl
robl
Joined: 2 Jan 13
Posts: 1,434
Credit: 861,882,818
RAC: 503,400

Scheduling priority

worse than I thought. the output of "dmesg" on unbuntu gave the following errors:

NVRM: API mismatch: the client has the version 304.108, but
[ 29.202545] NVRM: this kernel module has the version 304.88. Please
[ 29.202545] NVRM: make sure that this kernel module and all NVIDIA driver
[ 29.202545] NVRM: components have the same version.

Some how the install of new rpms for NVIDIA went sideways. I had to remove all NVIDIA drivers and do a fresh install. nvidia-settings is reporting correct hardware etc but I am still waiting on GPU work for both projects. the log states "Not requesting tasks: don't need".

I am not sure how this happened but will certainly pay attention the next time NVIDIA drivers are part of an rpm update/patch.

EDIT: GPU work for both E@H and S@H is now coming in. Don't want this to happen again.

earthbilly
earthbilly
Joined: 4 Apr 18
Posts: 21
Credit: 803,145,682
RAC: 292,237

I have had a similar

I have had a similar situation with computation errors stopping GPU computing and then waiting a while to get new work. Right now it's been hours waiting and no tasks yet for GPU's, but it's Sunday. I am wondering if it is on the server side or is it something with my computer making the computation errors? If there is something I can do to help prevent this I would like to know. The computer is running perfectly. It tests in the high 99% worldwide with Passmark. It is a shame to loose all that downloaded data and start over again. This is only occurring on one of my seven computers dedicated to Einstein. Knock on wood;-) I have tried changing back to the prior version of Nvidia driver and it still happened, so I reinstalled the very latest version and had the best two days host total ever, now this again.

MSI X99a Raider mobo, intel core i7-6800k CPU, 32 GB DDR4 Crucial ballistic sport RAM, 2-Nvidia 1060 6GB Founders Edition GPU's. All just a few months old.

100% powered by SOLAR

Sunny regards,

 

 

 

earthbilly
earthbilly
Joined: 4 Apr 18
Posts: 21
Credit: 803,145,682
RAC: 292,237

OH! Finally got some GPU

OH! Finally got some GPU tasks! I still am curious why this happens and if we can prevent it. Improve it. Repair it.

100% powered by SOLAR

Sunny regards,

 

 

 

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 4,835
Credit: 28,399,020,651
RAC: 35,964,019

earth_4 wrote:I have had a

earth_4 wrote:
I have had a similar situation with computation errors stopping GPU computing and then waiting a while to get new work.....

Unfortunately, it's not a similar situation as it's nothing like the previous problem of 4.5 years ago.  The error there was to do with an API version mismatch after a software upgrade.  All that means is that various driver and kernel components were built with different versions of build tools and as a consequence were incompatible with each other.  This is not what is happening in your case.

Even if you think it sounds similar, because it was so long ago, it's highly unlikely to be so.  System software and drivers, and science apps for that matter, will have changed a lot in that time.  At that time I think the search would have been one of the radio pulsar searches and not the current gamma-ray pulsar search.  In cases like this, you should just start a new thread and give more details about your particular compute errors.

earth_4 wrote:
I have tried changing back to the prior version of Nvidia driver and it still happened, so I reinstalled the very latest version and had the best two days host total ever, now this again.

Checking/reinstalling graphics drivers is always a good first step when lots of GPU crunching errors start occurring. If you seem to have fixed the problem and then after a period it comes back again, you have to start thinking about other causes.  A few days ago, I documented what seems to be a similar problem in this particular message and in a couple of further reports that followed.  Please have a read through those messages and see if what is reported there is what you are seeing on your machine.

Your problem isn't exactly the same but it may well be related.  In my various cases, the actual error message was to do with trying to allocate more GPU memory than a seemingly bogus and quite small allocation limit.  In your case the error message seems different.  I checked one of yours at random and this is what I got

Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x00007FF923F04D8C read attempt to address 0xFFFFFFFF

which I interpret to mean that the app was trying to read a memory location that it wasn't allowed to.  Maybe this is also some sort of memory allocation error - I certainly don't know.

Returning to what you posted, when you say, "waiting a while to get new work", did you check in BOINC Manager (Advanced view) to see if you could see the reason for 'waiting'?  Did you look at the entry for Einstein on the projects tab to see if, under the 'Status' column, it said something about project communication being suspended for quite a large number of hours with a counter counting down each second until that full backoff time had been run down to zero?

If you get a large number of compute errors very quickly (and you have several hundred all showing zero run time - so that's pretty quick :-) ) the client can call a 24hr halt to proceedings - presumably to give you some time to notice, investigate and potentially fix whatever the error happens to be.   Sometimes, things like this can be a transient problem and if you notice the situation, you can try stopping and restarting BOINC to see if the issue clears itself.

Once you stop and restart BOINC the backoff seems to be automatically canceled.  If not you can 'select' the Einstein project and then click 'update' to override the backoff and force communication.  I suggest you look for any repeat occurrence of your problem and see if normal crunching has resumed after a string of errors has instigated a backoff.  If you can see that behaviour, it would seem to be very much like what I have observed.

I have commented on things I think might be causing this. It's only a guess and it could be a misplaced guess.  If you can document the details of what happens to you, it might allow the Devs to work out what the problem really is.  In my case, there has been another machine go into a 24hr backoff and then start normal crunching again, just by canceling the backoff.  If you can observe the same behaviour, with totally different hardware and drivers, it would really indicate an issue with the tasks or the application rather than OS or drivers.

 

Cheers,
Gary.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.