Recently (the past week or so), all the Einstein Gamma-ray pulsar binary search on GPU" units fail with error "computation fault".
I have no idea where to start looking. There are no other signs that the nVidia GeForce GT650M card doesn't work correctly and I have fairly recently updated the nVidia driver (now running on 391.35).
Are there more problems known with these units?
Copyright © 2024 Einstein@Home. All rights reserved.
The problem will most likely
)
The problem will most likely be with the driver update. You probably don't have the OpenCL libraries properly installed. Can you give full details of how you updated the driver. There are no known problems with the tasks.
Cheers,
Gary.
Richard Bertrand wrote:... I
)
If you go to your account page on the website and click on the link to your computers, you should see this page. There is a 'details' link to see the hardware details and a 'tasks' link to see all the tasks. YOWIE!!!! -- 97 compute errors!! :-). There are no valid nVidia GPU tasks anywhere in your complete list so, until you work out what is causing the problem, you should disable receiving tasks for the FGRPB1G gamma-ray pulsar search rather than just continuing to trash the tasks.
If you scroll the pages of compute errors and pick any one you like, just click on the task ID link for it to see what was returned to the project. There could be a lot of stuff in there but you're bound to find the word 'ERROR' somewhere and the commentary associated with that will give you a clue. Here is such an excerpt chosen at random and slightly edited to stop lines overflowing. In the very first line, I truncated the list of very long output filenames (not related to problem) and also (later on in the snippet) the name of the checkpoint file that wont exist anyway at the very start of crunching so the message is info and not a sign of a problem.
I picked a task that failed early so there wasn't a great deal to peruse. This seems to be the most common type of problem in your list of 97 anyway.
Everything seems quite normal up to the line that says "% Filling array of photon pairs". The next couple of lines identify that it's an FFT (Fast Fourier Transform) error related to a particular function (named gen_fft_execute() ). Someone who knows what this function does might be able to comment of why there was an error.
My guess is still that it's being caused by the change in driver somehow. You could either wait for a Dev to comment or take the initiative to do a clean install of the full driver package from nVidia (not from Microsoft). If you search through the forums, there are plenty of examples of problems after Microsoft updates. I don't use Windows at all so I have no first hand experience with which to guide you. I can recall lots of examples where knowledgeable people have set out the things to do to get a proper clean install. And once you get a driver that's working properly, don't allow it to be updated by Microsoft. If there's a problem that needs a driver update, get the new driver from nVidia. This seems to be the message that those 'in the know' are giving.
Cheers,
Gary.
Gary Roberts wrote:Can you
)
I used the nVidia GeForce Experience program that came up with the new driver.
Now, I have had some difficulty installing new nVidia drivers because I am a normal user (no administrator) on my computer. In the past, the Experience program or the installer couldn't handle such a situation in a decent way and I had to search the driver installer program in some nVidia folder on my computer and install it myself.
This time also, it didn't start installing the first time I gave it a go, but the second time it did and it seemed to have installed the driver perfectly.
Gary Roberts wrote:... than
)
I don't exactly know what you mean by "trashing". I was under the impression that every unit is send to multiple BOINC-users and if there are no results from one or more users, the results from the remaining will be used. And if there are too few remaining, additional users are given the same task to get more results.
If that is so, there shouldn't be lost much if only my units are "trashed"?
On topic: I have installed the driver again (clean install form the downloaded 391.35 installer from nVidia). Then I rebooted the PC. After that, the first 3 gamma-ray pulsar units ended in error (again). So far I can see, the card and driver seem to be fine.
The remark just after the error is "FPU status flags: PRECISION". Is there a switch/option in the driver I need to check or set to let it use the right computation precision? Or do I need to update the BOINC software (or the Einstein software)?
Richard Bertrand wrote:I
)
Sorry - poor choice of words. I was just observing that all these tasks ending in failure was something that had been happening for weeks. It's always a bit unfortunate to see resources being wasted for such a long time. I wasn't trying to be critical at all. I was just trying to promote a message about the waste of resources, everybody's resources.
If it happened like that, it would be a really distressing waste of donated resources. To allow a result to be confirmed and validated, ONLY two copies are distributed initially and no more are sent unless the initial tasks fail in some way, or are not returned within the deadline. In fact these days, and for selected searches, there are strategies to send a high proportion of single tasks only, to computers known to be reliable, with occasional checks on continuing reliability.
Appropriate use of resources is quite a big deal. For the FGRPB1G GPU search, there are just two copies with a 2-week deadline. If both copies are returned and if they agree with each other, there will be NO additional copies. Every task that fails or goes missing has to be replaced and this puts extra load on the project's resources as well as consuming additional volunteer resources. This is why I try to encourage people to pay attention to what is happening on their machines.
I don't know anything about Windows drivers. I've used only Linux for many years. All I know of Windows driver problems is what I read from others. There have been posts about the need to be extremely careful in removing remnants of previous installs before doing a fresh install. I can remember that some people have posted fairly detailed instructions on how to best achieve a final working result so all I can suggest is that you try searching for such posts. Perhaps one of the Windows users might be able to comment on whether it's likely to be a driver problem and, if so, provide a link to those instructions.
I don't think there's anything to check or update. There is a "FPU status flags:" entry right at the end of the output for every task, good or bad. My guess is that it's simply to report the condition (status) of the hardware floating point unit that is used in the second (followup) stage of crunching. This stage uses double precision to re-evaluate the 'toplist' - the ten most likely candidate signals found in the main crunching stage. Your tasks are failing right at the start of the main crunching stage so the status of the FPU is probably nothing to do with the actual problem. Just my guess. Like you, I'm an ordinary volunteer with no particular IT background. Anything I know has been gleaned from personal observation or by reading the comments of others who do seem to know :-).
And as Mark Twain said, "It's not what you don't know that gets you into trouble. It's what you know for sure. that just ain't so!"
I'm sure I've managed to get myself into trouble many times before :-).
Cheers,
Gary.
The stderr output of tasks in
)
The stderr output of tasks in that Gary's link says error status -36 as discussed here:
https://devtalk.nvidia.com/default/topic/501409/cl_invalid_command_queue-error-on-clfinish-command-a-lot-of-operations-in-each-kernel-driver-crash/?offset=2
and The network BIOS session limit was exceeded - maybe one error a consequence of the other?
However, as already suggested, trying to get the system clean once again with the Disk Driver Uninstall tool, then install the driver anew, reboot plus temperature monitoring on that laptop seems to be a good first step.
Here's the newest version of
)
Here's the newest version of DDU (Display driver uninstaller):
https://www.wagnardsoft.com/forums/viewtopic.php?f=5&t=1069
I would try that in 'Normal' mode at first (not letting it boot into Safe mode in Windows 10), because in some cases choosing Safe mode has caused a boot loop, which could be a pain in the ass then and it would take some time to fix.
Thank you, Richie, for
)
Thank you, Richie, for providing the link to the DDU software. I have used it as per your instructions and it seems to have worked OK. The driver was completely removed up to the point that Windows only recognized a "basic video adapter".
After that, I installed the video driver again and that also seemed to go well.
However, my first gamma-ray calculation ended in error again. The calculations went well for more than an hour and a half (the projected calculation time for the task was one hour or so, but during calculations, this changed to more than 4 hours).
The error occured after the 402th (of 1255) photon-load event according to the task log.
As all other program's work fine (I tried a game, Google Earth and Seti is calculating constantly and without errors with OpenCL and Cuda calculation programs), I am wondering why only the Einstein OpenCL application has problems.
Solling2 said something about the temperature. I am used to having my computer use its TDP-budget to the fullest: the fan is almost always on. A few months ago I noticed that the fan speed is higher than I was used to. That is a signal that I need to clean the fan and cooling system.
Problem is, that I need to take the whole laptop apart just to clean the fan. Asus has closed this laptop completely, I can only change disks and memory rather easily.
As I need to take it apart, I probably also will renew the cooling paste on the CPU and video cards after 5 years of operating at rather high temperatures: at this moment each core and the nVidia card is constantly throtteling with the temperature switching between 80 and 100+ degrees Celcius.
The problem is, that I need this computer every day for work and private stuff. It will cost me the better part of an evening (may be more) to deassemble, clean and re-assemble the computer. I am not looking forward to this task....
Anyway, as the temperature sensors of the CPU and nVidia card seem to work OK and throttle the speed of these, I don't see why temperature would be an issue with the Einstein calculations?
About the temperatures...
)
About the temperatures... they seem to be high for sure. Modern hardware (GPUs and CPUs) have internal protection and they will shut down if they become too hot. But there's more onto that.
I have never run Seti so I can't compare with that, but my personal experience is Nvidia GPUs likes to run at max ~80C. When temperature problems will arise at some practical heat level normally you might start to see distorted lines on the screen for example, IF you were using graphical environment (desktop software, videos, web... or so).
But if you were doing nothing graphical and running just Boinc (in the background) then the screen could be okay while temperatures rise above ~80C, but the 'scientific accuracy' on running Cuda or OpenCL could already become bad. Then errors could show up on these Einstein tasks, even if you're computer could still manage to fight with the heat and avoid shutting down GPU or CPU.
IF you want to regulate max temperatures to much lower level than what CPU and GPU will do by nature, TThrottle might be something to try out:
https://efmer.com/
I don't know how well it works with Windows 10, but I remember running it in the past and it was able to limit temps (CPU / GPU). That method might work if you left your laptop running these tasks like when you go to sleep. It might not work so well if you run Boinc at the same time if you want to work. There's probably that load of dust inside the machine (or thermal paste has lost its effect), which seems to result in heat building up faster. So... if temperatures were artificially regulated then that laptop would run with limited effective CPU/GPU computation power also.
Uhm... another thing. If you
)
Uhm... another thing. If you would like to test more if that problem is related to certain Nvidia video drivers... here's something you could do:
1. Download: 397.40 Vulkan developer driver: https://developer.nvidia.com/vulkan-driver
2. Download some kind of usable software for registry cleaning (for example... Eusing Free Registry Cleaner or CCleaner Free ... from their own web sites).
3. Run once again the DDU in normal mode. After it has done what it does, reboot once more.
4. Run the registry cleaner software and just scan for registry problems and clean them (not wipe your web cookies etc., if you don't want to).
5. Reboot.
6. Install Nvidia 397.40 and choose Custom installation and 'Clean install'.
7. Reboot.
8. Run the registry cleaner software again and just scan for registry problems and clean them.
9. Reboot.
10. Run Boinc.
Now, if there will still be problems with Einstein tasks then I would say the reason is something else than GPU drivers not setting up properly.