All "Gamma-ray pulsar binary search on GPU" units fail with computation fault

Richard Bertrand
Richard Bertrand
Joined: 2 Apr 06
Posts: 8
Credit: 5833265
RAC: 0
Topic 214652

Recently (the past week or so), all the Einstein Gamma-ray pulsar binary search on GPU" units fail with error "computation fault".

I have no idea where to start looking. There are no other signs that the nVidia GeForce GT650M card doesn't work correctly and I have fairly recently updated the nVidia driver (now running on 391.35).

Are there more problems known with these units?

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5850
Credit: 110037737339
RAC: 22393873

The problem will most likely

The problem will most likely be with the driver update. You probably don't have the OpenCL libraries properly installed.  Can you give full details of how you updated the driver.  There are no known problems with the tasks.

 

Cheers,
Gary.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5850
Credit: 110037737339
RAC: 22393873

Richard Bertrand wrote:... I

Richard Bertrand wrote:
... I have no idea where to start looking...

If you go to your account page on the website and click on the link to your computers, you should see this page.  There is a 'details' link to see the hardware details and a 'tasks' link to see all the tasks.  YOWIE!!!!  -- 97 compute errors!! :-).  There are no valid nVidia GPU tasks anywhere in your complete list so, until you work out what is causing the problem, you should disable receiving tasks for the FGRPB1G gamma-ray pulsar search rather than just continuing to trash the tasks.

If you scroll the pages of compute errors and pick any one you like, just click on the task ID link for it to see what was returned to the project.  There could be a lot of stuff in there but you're bound to find the word 'ERROR' somewhere and the commentary associated with that will give you a clue.  Here is such an excerpt chosen at random and slightly edited to stop lines overflowing.  In the very first line, I truncated the list of very long output filenames (not related to problem) and also (later on in the snippet) the name of the checkpoint file that wont exist anyway at the very start of crunching so the message is info and not a sign of a problem.

I picked a task that failed early so there wasn't a great deal to peruse.  This seems to be the most common type of problem in your list of 97 anyway.

output files: 'LATeah0060L_948.0_0_0.0_5878420_1_0.out' '../../projects/einstein.......'
20:44:12 (3288): [debug]: Flags: X64 SSE SSE2 GNUC X86 GNUX86
20:44:12 (3288): [debug]: Set up communication with graphics process.
boinc_get_opencl_ids returned [00000000033ABA80 , 00000000033ABEE0]
Using OpenCL platform provided by: NVIDIA Corporation
Using OpenCL device "GeForce GT 650M" by: NVIDIA Corporation
Max allocation limit: 536870912
Global mem size: 2147483648
OpenCL device has FP64 support
% Opening inputfile: ../../projects/einstein.phys.uwm.edu/LATeah0060L.dat
% Total amount of photon times: 30007
% Preparing toplist of length: 10
% Read 1255 binary points
read_checkpoint(): Couldn't open file 'LATeah0060L_948....cpt': No such file or directory (2)
% fft_size: 16777216 (0x1000000); alloc: 67108872
% Sky point 1/1
% Binary point 1/1255
% Creating FFT plan.
% fft length: 16777216 (0x1000000)
% Scratch buffer size: 136314880
% Starting semicoherent search over f0 and f1.
% nf1dots: 31 df1dot: 3.344368011e-015 f1dot_start: -1e-013 f1dot_band: 1e-013
% Filling array of photon pairs
Error during OpenCL FFT (error: -36)
ERROR: gen_fft_execute() returned with error 0
20:44:25 (3288): [CRITICAL]: ERROR: MAIN() returned with error '5'
FPU status flags: PRECISION
20:44:37 (3288): [normal]: done. calling boinc_finish(69).
20:44:37 (3288): called boinc_finish

Everything seems quite normal up to the line that says "% Filling array of photon pairs".  The next couple of lines identify that it's an FFT (Fast Fourier Transform) error related to a particular function (named gen_fft_execute() ).  Someone who knows what this function does might be able to comment of why there was an error.

My guess is still that it's being caused by the change in driver somehow.  You could either wait for a Dev to comment or take the initiative to do a clean install of the full driver package from nVidia (not from Microsoft).  If you search through the forums, there are plenty of examples of problems after Microsoft updates.  I don't use Windows at all so I have no first hand experience with which to guide you.  I can recall lots of examples where knowledgeable people have set out the things to do to get a proper clean install.  And once you get a driver that's working properly, don't allow it to be updated by Microsoft.  If there's a problem that needs a driver update, get the new driver from nVidia.  This seems to be the message that those 'in the know' are giving.

 

Cheers,
Gary.

Richard Bertrand
Richard Bertrand
Joined: 2 Apr 06
Posts: 8
Credit: 5833265
RAC: 0

Gary Roberts wrote:Can you

Gary Roberts wrote:
Can you give full details of how you updated the driver.

I used the nVidia GeForce Experience program that came up with the new driver.

Now, I have had some difficulty installing  new nVidia drivers because I am a normal user (no administrator) on my computer. In the past, the Experience program or the installer couldn't handle such a situation in a decent way and I had to search the driver installer program in some nVidia folder on my computer and install it myself.

This time also, it didn't start installing the first time I gave it a go, but the second time it did and it seemed to have installed the driver perfectly.

 

Richard Bertrand
Richard Bertrand
Joined: 2 Apr 06
Posts: 8
Credit: 5833265
RAC: 0

Gary Roberts wrote:... than

Gary Roberts wrote:
... than just continuing to trash the tasks.

I don't exactly know what you mean by "trashing". I was under the impression that every unit is send to multiple BOINC-users and if there are no results from one or more users, the results from the remaining will be used. And if there are too few remaining, additional users are given the same task to get more results.

If that is so, there shouldn't be lost much if only my units are "trashed"?

On topic: I have installed the driver again (clean install form the downloaded 391.35 installer from nVidia). Then I rebooted the PC. After that, the first 3 gamma-ray pulsar units ended in error (again). So far I can see, the card and driver seem to be fine.

The remark just after the error is "FPU status flags: PRECISION". Is there a switch/option in the driver I need to check or set to let it use the right computation precision? Or do I need to update the BOINC software (or the Einstein software)?

 

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5850
Credit: 110037737339
RAC: 22393873

Richard Bertrand wrote:I

Richard Bertrand wrote:
I don't exactly know what you mean by "trashing".

Sorry - poor choice of words. I was just observing that all these tasks ending in failure was something that had been happening for weeks.  It's always a bit unfortunate to see resources being wasted for such a long time.  I wasn't trying to be critical at all.  I was just trying to promote a message about the waste of resources, everybody's resources.

Quote:
I was under the impression that every unit is send to multiple BOINC-users and if there are no results from one or more users, the results from the remaining will be used. And if there are too few remaining, additional users are given the same task to get more results.

If it happened like that, it would be a really distressing waste of donated resources.  To allow a result to be confirmed and validated, ONLY two copies are distributed initially and no more are sent unless the initial tasks fail in some way, or are not returned within the deadline.  In fact these days, and for selected searches, there are strategies to send a high proportion of single tasks only, to computers known to be reliable, with occasional checks on continuing reliability.

Appropriate use of resources is quite a big deal.   For the FGRPB1G GPU search, there are just two copies with a 2-week deadline.  If both copies are returned and if they agree with each other, there will be NO additional copies.  Every task that fails or goes missing has to be replaced and this puts extra load on the project's resources as well as consuming additional volunteer resources.  This is why I try to encourage people to pay attention to what is happening on their machines.

Quote:
I have installed the driver again (clean install form the downloaded 391.35 installer from nVidia). Then I rebooted the PC. After that, the first 3 gamma-ray pulsar units ended in error (again). So far I can see, the card and driver seem to be fine.

I don't know anything about Windows drivers.  I've used only Linux for many years.  All I know of Windows driver problems is what I read from others.  There have been posts about the need to be extremely careful in removing remnants of previous installs before doing a fresh install. I can remember that some people have posted fairly detailed instructions on how to best achieve a final working result so all I can suggest is that you try searching for such posts.  Perhaps one of the Windows users might be able to comment on whether it's likely to be a driver problem and, if so, provide a link to those instructions.

Quote:
The remark just after the error is "FPU status flags: PRECISION". Is there a switch/option in the driver I need to check or set to let it use the right computation precision? Or do I need to update the BOINC software (or the Einstein software)?

I don't think there's anything to check or update.  There is a "FPU status flags:" entry right at the end of the output for every task, good or bad.  My guess is that it's simply to report the condition (status) of the hardware floating point unit that is used in the second (followup) stage of crunching.  This stage uses double precision to re-evaluate the 'toplist' - the ten most likely candidate signals found in the main crunching stage.  Your tasks are failing right at the start of the main crunching stage so the status of the FPU is probably nothing to do with the actual problem.  Just my guess.  Like you, I'm an ordinary volunteer with no particular IT background.  Anything I know has been gleaned from personal observation or by reading the comments of others who do seem to know :-).

 And as Mark Twain said, "It's not what you don't know that gets you into trouble.  It's what you know for sure. that just ain't so!"

I'm sure I've managed to get myself into trouble many times before :-).

 

Cheers,
Gary.

solling2
solling2
Joined: 20 Nov 14
Posts: 219
Credit: 1564139881
RAC: 36826

The stderr output of tasks in

The stderr output of tasks in that Gary's link says error status -36 as discussed here:

https://devtalk.nvidia.com/default/topic/501409/cl_invalid_command_queue-error-on-clfinish-command-a-lot-of-operations-in-each-kernel-driver-crash/?offset=2

and The network BIOS session limit was exceeded - maybe one error a consequence of the other?

However, as already suggested, trying to get the system clean once again with the Disk Driver Uninstall tool, then install the driver anew, reboot plus temperature monitoring on that laptop seems to be a good first step.

Richie
Richie
Joined: 7 Mar 14
Posts: 656
Credit: 1702989778
RAC: 0

Here's the newest version of

Here's the newest version of DDU (Display driver uninstaller):

https://www.wagnardsoft.com/forums/viewtopic.php?f=5&t=1069

I would try that in 'Normal' mode at first (not letting it boot into Safe mode in Windows 10), because in some cases choosing Safe mode has caused a boot loop, which could be a pain in the ass then and it would take some time to fix.

Richard Bertrand
Richard Bertrand
Joined: 2 Apr 06
Posts: 8
Credit: 5833265
RAC: 0

Thank you, Richie, for

Thank you, Richie, for providing the link to the DDU software. I have used it as per your instructions and it seems to have worked OK. The driver was completely removed up to the point that Windows only recognized a "basic video adapter".

After that, I installed the video driver again and that also seemed to go well.

However, my first gamma-ray calculation ended in error again. The calculations went well for more than an hour and a half (the projected calculation time for the task was one hour or so, but during calculations, this changed to more than 4 hours).

The error occured after the 402th (of 1255) photon-load event according to the task log.

As all other program's work fine (I tried a game, Google Earth and Seti is calculating constantly and without errors with OpenCL and Cuda calculation programs), I am wondering why only the Einstein OpenCL application has problems.

Solling2 said something about the temperature. I am used to having my computer use its TDP-budget to the fullest: the fan is almost always on. A few months ago I noticed that the fan speed is higher than I was used to. That is a signal that I need to clean the fan and cooling system.

Problem is, that I need to take the whole laptop apart just to clean the fan. Asus has closed this laptop completely, I can only change disks and memory rather easily.

As I need to take it apart, I probably also will renew the cooling paste on the CPU and video cards after 5 years of operating at rather high temperatures: at this moment each core and the nVidia card is constantly throtteling with the temperature switching between 80 and 100+ degrees Celcius.

The problem is, that I need this computer every day for work and private stuff. It will cost me the better part of an evening (may be more) to deassemble, clean and re-assemble the computer. I am not looking forward to this task....

Anyway, as the temperature sensors of the CPU and nVidia card seem to work OK and throttle the speed of these, I don't see why temperature would be an issue with the Einstein calculations?

 

 

Richie
Richie
Joined: 7 Mar 14
Posts: 656
Credit: 1702989778
RAC: 0

About the temperatures...

About the temperatures... they seem to be high for sure. Modern hardware (GPUs and CPUs) have internal protection and they will shut down if they become too hot. But there's more onto that.

I have never run Seti so I can't compare with that, but my personal experience is Nvidia GPUs likes to run at max ~80C. When temperature problems will arise at some practical heat level normally you might start to see distorted lines on the screen for example, IF you were using graphical environment (desktop software, videos, web... or so).

But if you were doing nothing graphical and running just Boinc (in the background) then the screen could be okay while temperatures rise above ~80C, but the 'scientific accuracy' on running Cuda or OpenCL could already become bad. Then errors could show up on these Einstein tasks, even if you're computer could still manage to fight with the heat and avoid shutting down GPU or CPU.

IF you want to regulate max temperatures to much lower level than what CPU and GPU will do by nature, TThrottle might be something to try out:

https://efmer.com/

I don't know how well it works with Windows 10, but I remember running it in the past and it was able to limit temps (CPU / GPU). That method might work if you left your laptop running these tasks like when you go to sleep. It might not work so well if you run Boinc at the same time if you want to work. There's probably that load of dust inside the machine (or thermal paste has lost its effect), which seems to result in heat building up faster. So... if temperatures were artificially regulated then that laptop would run with limited effective CPU/GPU computation power also.

Richie
Richie
Joined: 7 Mar 14
Posts: 656
Credit: 1702989778
RAC: 0

Uhm... another thing. If you

Uhm... another thing. If you would like to test more if that problem is related to certain Nvidia video drivers... here's something you could do:

1. Download: 397.40 Vulkan developer driver: https://developer.nvidia.com/vulkan-driver

2. Download some kind of usable software for registry cleaning (for example... Eusing Free Registry Cleaner or CCleaner Free ... from their own web sites).

3. Run once again the DDU in normal mode. After it has done what it does, reboot once more.

4. Run the registry cleaner software and just scan for registry problems and clean them (not wipe your web cookies etc., if you don't want to).

5. Reboot.

6. Install Nvidia 397.40 and choose Custom installation and 'Clean install'.

7. Reboot.

8. Run the registry cleaner software again and just scan for registry problems and clean them.

9. Reboot.

10. Run Boinc.

 

Now, if there will still be problems with Einstein tasks then I would say the reason is something else than GPU drivers not setting up properly.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.