procedural error on driver change out

Anonymous

30 Jun 2014 21:03:30 UTC

Topic 197621

(moderation:

)

On the 25th of June I suspended the E@H project on one pc. I then brought in a newer NVIDIA driver (not the latest driver available). Restarted the machine and "resumed" E@H. All jobs generated computation errors for Binary Radio Pulsar Search (Perseus Arm Survey) v1.39 (BRP5-cuda32-nv270) -- 25 of them. No errors for these jobs since. Could/would a driver change out cause this. If so then would the correct procedure be to complete all in progress jobs and "no new jobs" prior to a driver upgrade/changeout?

tbret

Joined: 12 Mar 05

Posts: 2115

Credit: 4872051851

RAC: 229840

procedural error on driver change out

1 Jul 2014 1:22:46 UTC

Message 122073

(moderation:

)

Quote:

On the 25th of June I suspended the E@H project on one pc. I then brought in a newer NVIDIA driver (not the latest driver available). Restarted the machine and "resumed" E@H. All jobs generated computation errors for Binary Radio Pulsar Search (Perseus Arm Survey) v1.39 (BRP5-cuda32-nv270) -- 25 of them. No errors for these jobs since. Could/would a driver change out cause this. If so then would the correct procedure be to complete all in progress jobs and "no new jobs" prior to a driver upgrade/changeout?

My guess is that somewhere in the process BOINC lost the fact that you had the type of hardware necessary to run that type of work and deleted it all.

I've had that happen, but it doesn't always happen and the only specific times I can remember it happening were when I did something with the hardware (like swapping slots). I don't think I've ever had that happen with a driver-update, even when I tell the machine to-do a clean install.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5877

Credit: 118658960499

RAC: 19070878

RE: ... Could/would a

1 Jul 2014 1:24:01 UTC

Message 122074

(moderation:

)

Quote:

... Could/would a driver change out cause this.

I wouldn't have thought so. I would think that some other factor must have been involved.

As an indication as to how resilient the BOINC system seems to be, consider this example. I've been upgrading lots of hosts that were running 32 bit Linux and BOINC with both AMD and NVIDIA GPUs with a variety of driver versions. All I've been doing is

* Stop BOINC (still with an active work cache)
* Install the 64 bit OS with KDE desktop (including latest drivers) 'over the top' (/home is a separate partition).
* Install all updates from a fully updated local repository.
* Install 64 bit BOINC 7.2.42 (from Berkeley) 'over the top' of whatever the previous version was (many were V6.x).
* Dump copies of all 64 bit apps/libs/data in the project dirs of whatever projects were running on the particular machine.
* Reboot and start BOINC.

In all cases, about 60 machines so far, BOINC has restarted, noted the platform change, noted the version change, properly detected the GPU where there was a GPU, and simply picked up from where it left off. I haven't lost a single task so far that I'm aware of. When a new work fetch was initiated, the 64 bit apps/libs/data were noted as 'already exists' so a whole bunch of downloads were avoided.

So, no, I don't think it's necessary to empty the cache before changing a driver.

Cheers,
Gary.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5877

Credit: 118658960499

RAC: 19070878

RE: My guess is that

1 Jul 2014 1:33:41 UTC

Message 122075 in response to message 122073

(moderation:

)

Quote:

My guess is that somewhere in the process BOINC lost the fact that you had the type of hardware necessary to run that type of work and deleted it all.

No, the work won't get deleted.

If BOINC starts with GPU tasks on board and fails to detect the GPU, it will list all the onboard tasks as "GPU missing ..." but it won't delete them. As soon as you rectify whatever was causing the GPU to be not detected, the tasks will all be available for crunching again.

Cheers,
Gary.

Anonymous

RE: No, the work won't get

1 Jul 2014 1:47:37 UTC

Message 122076 in response to message 122075

(moderation:

)

Quote:

No, the work won't get deleted.

If BOINC starts with GPU tasks on board and fails to detect the GPU, it will list all the onboard tasks as "GPU missing ..." but it won't delete them.

This was my experience on another NVIDIA node after a driver change out. If there was a problem WUs were flagged with "GPU missing". Usually for me a command line restart of the boinc-client would "find the misssing GPUs" and all would be fine. But not is this case.

Guess we will just accept what happened and move on.

mikey

Joined: 22 Jan 05

Posts: 12829

Credit: 1883752578

RAC: 1100077

RE: RE: No, the work

1 Jul 2014 11:06:53 UTC

Message 122077 in response to message 122076

(moderation:

)

Quote:

Quote:
No, the work won't get deleted.

If BOINC starts with GPU tasks on board and fails to detect the GPU, it will list all the onboard tasks as "GPU missing ..." but it won't delete them.

This was my experience on another NVIDIA node after a driver change out. If there was a problem WUs were flagged with "GPU missing". Usually for me a command line restart of the boinc-client would "find the misssing GPUs" and all would be fine. But not is this case.

Guess we will just accept what happened and move on.

I had a problem with a cc_config.xml file at another project, two different kinds of gpu's and during the process of figuring it out about 20 units were trashed. Not ALL units got trashed, just some of them. Once I got it figured out though the units that stayed but had 'gpu missing' eventually got crunched and new units got downloaded.

The problem in my case is BOTH gpu's are labeled as device zero!! I had to exclude projects using the ATI and NVIDIA tags. When I had two AMD/ATI gpu's in the machine one was device zero and the other was device one.

Boinc is funny duck sometimes!!

tbret

Joined: 12 Mar 05

Posts: 2115

Credit: 4872051851

RAC: 229840

RE: Boinc is funny duck

1 Jul 2014 18:48:44 UTC

Message 122078 in response to message 122077

(moderation:

)

Quote:

Boinc is funny duck sometimes!!

I've been thinking about this because I've had the "GPU missing" message without the work being trashed, also. But I've also had it dump 250 highly coveted tasks at "that other project."

I'm trying to imagine what the difference in behavior was, and you just gave me a clue.

But, in no case was it something as simple as updating a driver.

mikey

Joined: 22 Jan 05

Posts: 12829

Credit: 1883752578

RAC: 1100077

RE: RE: Boinc is funny

2 Jul 2014 11:17:36 UTC

Message 122079 in response to message 122078

(moderation:

)

Quote:

Quote:

Boinc is funny duck sometimes!!

I've been thinking about this because I've had the "GPU missing" message without the work being trashed, also. But I've also had it dump 250 highly coveted tasks at "that other project."

I'm trying to imagine what the difference in behavior was, and you just gave me a clue.

But, in no case was it something as simple as updating a driver.

Nope mine was NOT a driver upgrade either, just a configuration problem I worked thru.

procedural error on driver change out

Forums › Cruncher's Corner

procedural error on driver change out

RE: ... Could/would a

RE: My guess is that

RE: No, the work won't get

RE: RE: No, the work

RE: Boinc is funny duck

RE: RE: Boinc is funny

Comment viewing options

Forums › Cruncher's Corner