Multiple Computing Errors with LAT computions under GPU

PulsarOperator
PulsarOperator
Joined: 29 Jun 20
Posts: 4
Credit: 22411437
RAC: 1685
Topic 223222

Hello all,

I sometimes (seldom but nevertheless) get computing errors while calculating LAT samples under NVIDIA GEFORCE GTX GPU. The effect is that a faulty calculation also causes all subsequent LAT sample calculations into a faulty state immediately. Wouldn't it be better to stop all calculation when a fault has occurred and to request the user to reboot the PC? Sure, the better solution would be to recover the GPU but I assume this to be difficult.

Yours, PulsarOperator.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5849
Credit: 110013420402
RAC: 23386687

Hi PulsarOperator, Welcome to

Hi PulsarOperator, Welcome to Einstein!

PulsarOperator wrote:
... Wouldn't it be better to stop all calculation when a fault has occurred and to request the user to reboot the PC? Sure, the better solution would be to recover the GPU but I assume this to be difficult

What if the problem has nothing to do with the GPU itself?

Several years ago, (pre-GPU days) I had a similar problem with the CPU version of the gamma-ray pulsar search.  Occasionally, a task would fail and immediately all other tasks would also fail.  The event log would declare that a data file that all tasks depended on had failed an MD5 checksum test.  Funny thing was, an independent test of the file showed the file was actually quite valid.  To cut a long story short, it turned out that a memory module in that machine had a single memory fault at a particular location that BOINC would occasionally hit when it was performing the MD5 checks of data files at the start of each new task.  Replacing that stick completely solved the problem.  At first, rebooting seemed to work but it was just kicking the can down the road until the problem re-occurred.

So, my guess is that you may have some sort of transient hardware issue.  It's only a rough guess, since you give no details that would allow anything better.  Your computers are hidden (default preference setting that you can adjust) so nobody but you can see any hardware details.

Have you looked in BOINC's event log to see what gets reported there?

Have you gone to the website and looked at one of your failed tasks?  If you click the Task ID link you can see exactly what gets reported about the task and it's possible that the information there will point to the actual cause of the problem.  Certainly, if your computers weren't hidden or if you provided a link to a failed task, others could very quickly check this for you.

Cheers,
Gary.

PulsarOperator
PulsarOperator
Joined: 29 Jun 20
Posts: 4
Credit: 22411437
RAC: 1685

Hi Gary, thanks a lot for

Hi Gary,

thanks a lot for your fast response. Here are the links to the reports of 2 tasks which failed during calculation yesterday and today.

https://einsteinathome.org/de/task/985901065

https://einsteinathome.org/de/task/985901083

I have some other tasks which failed immediately after start as a consequence of these tasks above. Here one example

https://einsteinathome.org/de/task/985895283

I can see that in all reports a network problem is reported in the beginning: "Netzwerkzugriff verweigert." in German. Could this cause the problem? I really have sporadic loss of WiFi connection with my notebook which I could not resolve so far. It seems to be a hardware issue since I have it with Windows10 and with Debian Linux on this engine, as well.

Cheers, PulsarOperator

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5849
Credit: 110013420402
RAC: 23386687

PulsarOperator wrote:I can

PulsarOperator wrote:
I can see that in all reports a network problem is reported in the beginning: "Netzwerkzugriff verweigert." in German. Could this cause the problem?

Probably quite unlikely.  BOINC deals with communications whilst the app just does its thing.  The app wouldn't care about a network issue.

What tends to happen is that the app produces some sort of error code specific to it which Windows then tries to interpret as if it were a Windows error code.  A classic example I used to see quite often was that tasks failed because, "the printer is out of paper" :-).

What you have to do is scroll through to the point in the stderr.txt output that the app sends back to the project and find the actual failure point, usually right near the end.  For the link you gave, here is what I found.



% Binary point 1212/1631
% Starting semicoherent search over f0 and f1.
% nf1dots: 41  df1dot: 2.512676418e-015  f1dot_start: -1e-013  f1dot_band: 1e-013
% Filling array of photon pairs
Malloc failed in prepare_ts_2_phase_diff_sorted (1125512)
ERROR: prepare_ts_2_phase_diff_sorted() returned with error 605903168
09:19:05 (2496): [CRITICAL]: ERROR: MAIN() returned with error '1'
FPU status flags:  PRECISION
09:19:17 (2496): [normal]: done. calling boinc_finish(65).
09:19:17 (2496): called boinc_finish

I'm not a programmer or even a hardware expert but to me the point of failure seems to be involved with the malloc() (memory allocation) function.  In other words, the app had requested extra memory to create an array of photon pairs and something went wrong.  This seems to point to a transient memory issue so that is where you probably should start.  You could try running some memory testing app.  I use memtest86 under Linux.

If your memory is removable, try unplugging the module(s), blowing the slot(s) with compressed air and reseating, perhaps a couple of times to ensure good contact.  I've even had success with cleaning the gold plated contacts as well.  Ultimately, swapping modules may be necessary to prove exactly where the problem lies.

Good luck with sorting it out!

Cheers,
Gary.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.