procedural error on driver change out

Anonymous
Topic 197621

On the 25th of June I suspended the E@H project on one pc. I then brought in a newer NVIDIA driver (not the latest driver available). Restarted the machine and "resumed" E@H. All jobs generated computation errors for Binary Radio Pulsar Search (Perseus Arm Survey) v1.39 (BRP5-cuda32-nv270) -- 25 of them. No errors for these jobs since. Could/would a driver change out cause this. If so then would the correct procedure be to complete all in progress jobs and "no new jobs" prior to a driver upgrade/changeout?

tbret
tbret
Joined: 12 Mar 05
Posts: 2115
Credit: 4863901961
RAC: 101445

procedural error on driver change out

Quote:
On the 25th of June I suspended the E@H project on one pc. I then brought in a newer NVIDIA driver (not the latest driver available). Restarted the machine and "resumed" E@H. All jobs generated computation errors for Binary Radio Pulsar Search (Perseus Arm Survey) v1.39 (BRP5-cuda32-nv270) -- 25 of them. No errors for these jobs since. Could/would a driver change out cause this. If so then would the correct procedure be to complete all in progress jobs and "no new jobs" prior to a driver upgrade/changeout?

My guess is that somewhere in the process BOINC lost the fact that you had the type of hardware necessary to run that type of work and deleted it all.

I've had that happen, but it doesn't always happen and the only specific times I can remember it happening were when I did something with the hardware (like swapping slots). I don't think I've ever had that happen with a driver-update, even when I tell the machine to-do a clean install.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117913771313
RAC: 34574184

RE: ... Could/would a

Quote:
... Could/would a driver change out cause this.


I wouldn't have thought so. I would think that some other factor must have been involved.

As an indication as to how resilient the BOINC system seems to be, consider this example. I've been upgrading lots of hosts that were running 32 bit Linux and BOINC with both AMD and NVIDIA GPUs with a variety of driver versions. All I've been doing is

  • * Stop BOINC (still with an active work cache)
    * Install the 64 bit OS with KDE desktop (including latest drivers) 'over the top' (/home is a separate partition).
    * Install all updates from a fully updated local repository.
    * Install 64 bit BOINC 7.2.42 (from Berkeley) 'over the top' of whatever the previous version was (many were V6.x).
    * Dump copies of all 64 bit apps/libs/data in the project dirs of whatever projects were running on the particular machine.
    * Reboot and start BOINC.

In all cases, about 60 machines so far, BOINC has restarted, noted the platform change, noted the version change, properly detected the GPU where there was a GPU, and simply picked up from where it left off. I haven't lost a single task so far that I'm aware of. When a new work fetch was initiated, the 64 bit apps/libs/data were noted as 'already exists' so a whole bunch of downloads were avoided.

So, no, I don't think it's necessary to empty the cache before changing a driver.

Cheers,
Gary.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117913771313
RAC: 34574184

RE: My guess is that

Quote:
My guess is that somewhere in the process BOINC lost the fact that you had the type of hardware necessary to run that type of work and deleted it all.


No, the work won't get deleted.

If BOINC starts with GPU tasks on board and fails to detect the GPU, it will list all the onboard tasks as "GPU missing ..." but it won't delete them. As soon as you rectify whatever was causing the GPU to be not detected, the tasks will all be available for crunching again.

Cheers,
Gary.

Anonymous

RE: No, the work won't get

Quote:

No, the work won't get deleted.

If BOINC starts with GPU tasks on board and fails to detect the GPU, it will list all the onboard tasks as "GPU missing ..." but it won't delete them.


This was my experience on another NVIDIA node after a driver change out. If there was a problem WUs were flagged with "GPU missing". Usually for me a command line restart of the boinc-client would "find the misssing GPUs" and all would be fine. But not is this case.

Guess we will just accept what happened and move on.

mikey
mikey
Joined: 22 Jan 05
Posts: 12718
Credit: 1839121161
RAC: 3588

RE: RE: No, the work

Quote:
Quote:

No, the work won't get deleted.

If BOINC starts with GPU tasks on board and fails to detect the GPU, it will list all the onboard tasks as "GPU missing ..." but it won't delete them.


This was my experience on another NVIDIA node after a driver change out. If there was a problem WUs were flagged with "GPU missing". Usually for me a command line restart of the boinc-client would "find the misssing GPUs" and all would be fine. But not is this case.

Guess we will just accept what happened and move on.

I had a problem with a cc_config.xml file at another project, two different kinds of gpu's and during the process of figuring it out about 20 units were trashed. Not ALL units got trashed, just some of them. Once I got it figured out though the units that stayed but had 'gpu missing' eventually got crunched and new units got downloaded.

The problem in my case is BOTH gpu's are labeled as device zero!! I had to exclude projects using the ATI and NVIDIA tags. When I had two AMD/ATI gpu's in the machine one was device zero and the other was device one.

Boinc is funny duck sometimes!!

tbret
tbret
Joined: 12 Mar 05
Posts: 2115
Credit: 4863901961
RAC: 101445

RE: Boinc is funny duck

Quote:

Boinc is funny duck sometimes!!

I've been thinking about this because I've had the "GPU missing" message without the work being trashed, also. But I've also had it dump 250 highly coveted tasks at "that other project."

I'm trying to imagine what the difference in behavior was, and you just gave me a clue.

But, in no case was it something as simple as updating a driver.

mikey
mikey
Joined: 22 Jan 05
Posts: 12718
Credit: 1839121161
RAC: 3588

RE: RE: Boinc is funny

Quote:
Quote:

Boinc is funny duck sometimes!!

I've been thinking about this because I've had the "GPU missing" message without the work being trashed, also. But I've also had it dump 250 highly coveted tasks at "that other project."

I'm trying to imagine what the difference in behavior was, and you just gave me a clue.

But, in no case was it something as simple as updating a driver.

Nope mine was NOT a driver upgrade either, just a configuration problem I worked thru.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.