On the 25th of June I suspended the E@H project on one pc. I then brought in a newer NVIDIA driver (not the latest driver available). Restarted the machine and "resumed" E@H. All jobs generated computation errors for Binary Radio Pulsar Search (Perseus Arm Survey) v1.39 (BRP5-cuda32-nv270) -- 25 of them. No errors for these jobs since. Could/would a driver change out cause this. If so then would the correct procedure be to complete all in progress jobs and "no new jobs" prior to a driver upgrade/changeout?
Copyright © 2024 Einstein@Home. All rights reserved.
procedural error on driver change out
)
My guess is that somewhere in the process BOINC lost the fact that you had the type of hardware necessary to run that type of work and deleted it all.
I've had that happen, but it doesn't always happen and the only specific times I can remember it happening were when I did something with the hardware (like swapping slots). I don't think I've ever had that happen with a driver-update, even when I tell the machine to-do a clean install.
RE: ... Could/would a
)
I wouldn't have thought so. I would think that some other factor must have been involved.
As an indication as to how resilient the BOINC system seems to be, consider this example. I've been upgrading lots of hosts that were running 32 bit Linux and BOINC with both AMD and NVIDIA GPUs with a variety of driver versions. All I've been doing is
* Install the 64 bit OS with KDE desktop (including latest drivers) 'over the top' (/home is a separate partition).
* Install all updates from a fully updated local repository.
* Install 64 bit BOINC 7.2.42 (from Berkeley) 'over the top' of whatever the previous version was (many were V6.x).
* Dump copies of all 64 bit apps/libs/data in the project dirs of whatever projects were running on the particular machine.
* Reboot and start BOINC.
In all cases, about 60 machines so far, BOINC has restarted, noted the platform change, noted the version change, properly detected the GPU where there was a GPU, and simply picked up from where it left off. I haven't lost a single task so far that I'm aware of. When a new work fetch was initiated, the 64 bit apps/libs/data were noted as 'already exists' so a whole bunch of downloads were avoided.
So, no, I don't think it's necessary to empty the cache before changing a driver.
Cheers,
Gary.
RE: My guess is that
)
No, the work won't get deleted.
If BOINC starts with GPU tasks on board and fails to detect the GPU, it will list all the onboard tasks as "GPU missing ..." but it won't delete them. As soon as you rectify whatever was causing the GPU to be not detected, the tasks will all be available for crunching again.
Cheers,
Gary.
RE: No, the work won't get
)
This was my experience on another NVIDIA node after a driver change out. If there was a problem WUs were flagged with "GPU missing". Usually for me a command line restart of the boinc-client would "find the misssing GPUs" and all would be fine. But not is this case.
Guess we will just accept what happened and move on.
RE: RE: No, the work
)
I had a problem with a cc_config.xml file at another project, two different kinds of gpu's and during the process of figuring it out about 20 units were trashed. Not ALL units got trashed, just some of them. Once I got it figured out though the units that stayed but had 'gpu missing' eventually got crunched and new units got downloaded.
The problem in my case is BOTH gpu's are labeled as device zero!! I had to exclude projects using the ATI and NVIDIA tags. When I had two AMD/ATI gpu's in the machine one was device zero and the other was device one.
Boinc is funny duck sometimes!!
RE: Boinc is funny duck
)
I've been thinking about this because I've had the "GPU missing" message without the work being trashed, also. But I've also had it dump 250 highly coveted tasks at "that other project."
I'm trying to imagine what the difference in behavior was, and you just gave me a clue.
But, in no case was it something as simple as updating a driver.
RE: RE: Boinc is funny
)
Nope mine was NOT a driver upgrade either, just a configuration problem I worked thru.