Stared getting computing errors

Andrei-Costin Babaua

Joined: 12 Apr 20

Posts: 3

Credit: 14551392

RAC: 0

27 Jan 2021 18:48:15 UTC

Topic 224623

(moderation:

)

Hello,

As the title says, I have recently started to get computation errors for one of my hosts:

Please take a look here

Have you got any ideas why is this happening, or is it just my old PC taking it's last breath?

Thank you!

Ian&Steve C.

Joined: 19 Jan 20

Posts: 4145

Credit: 49511725457

RAC: 35328477

I can't say for sure, but it

27 Jan 2021 19:16:31 UTC

Message 182917

(moderation:

)

I can't say for sure, but it may be that your 1GB video card just doesn't have enough VRAM for the new tasks.

I'm blacklisted from the new tasks due to my architecture, but maybe someone else can confirm VRAM use on the new 3001L00 tasks so see if they might be using more than 1GB

_________________________________________________________________________

archae86

Joined: 6 Dec 05

Posts: 3164

Credit: 7370961687

RAC: 2214503

Your problem is associated

27 Jan 2021 19:23:07 UTC

Message 182918

(moderation:

)

Your problem is associated with attempts to process the new flavor of GPU GRP tasks which have task names starting with LATeah3001. Your system was previously successful with task names starting with LATeah2049L.

There are several threads here on more than one forum triggered by the wide observation that all modern Nvidia cards (Volta, Turing, Ampere...) fail on this new series of tasks. Delivery of GRP GPU work to the "modern hosts" is been disabled. The threshold for "modern" as I have termed it is Compute Capability of 7.0 or higher.

However your GT 710 is not a "modern" card by the terms in use at all. Wikipedia lists it with a Compute Capability of 3.5 and as a member of the Kepler generation.

Meanwhile, my advice to you is to use your Project Preferences settings to opt out of GPU Gamma-Ray Pulsar work here at Einstein. You can monitor the forums to see whether some progress is made in issuing new applications or in ceasing the distribution of tasks which trigger this problem.

Thanks for your report.

Ian&Steve C.

Joined: 19 Jan 20

Posts: 4145

Credit: 49511725457

RAC: 35328477

I threw a 1060 6GB on my

27 Jan 2021 19:44:57 UTC

Message 182921 in response to message 182917

(moderation:

)

I threw a 1060 6GB on my Ubuntu testbench, and during these 3001L00 tasks, they do use a bit more VRAM than the older files needed, at about 785MB. and with running the desktop environment on the same card, it's using about 1GB total.

but this is on linux. if the Windows app needs slightly more, or running the windows desktop needs slightly more than required for my linux desktop, I could see you bumping into that 1GB limit.

just spitballing though.

_________________________________________________________________________

Andrei-Costin Babaua

Joined: 12 Apr 20

Posts: 3

Credit: 14551392

RAC: 0

Hello, Thank you all

27 Jan 2021 20:38:28 UTC

Message 182922

(moderation:

)

Hello,

Thank you all for your support!

One more question on my side, if I'd bump into the 1GB limit, wouldn't tasks have different running times? I see that all fail after 22.something k seconds. Looking at the stderr reported, looks like every task succeeds the main analysis and then crush at the very last step (at least that's how I interpreted, please correct me if I'm wrong).

Thank you!

Ian&Steve C.

Joined: 19 Jan 20

Posts: 4145

Credit: 49511725457

RAC: 35328477

others have noticed that the

27 Jan 2021 21:21:44 UTC

Message 182923

(moderation:

)

others have noticed that the final phase of computation (89.999%-100%) does seem to take longer on these new tasks. I think it's doing some recalculation in double precision during this time. the best clue is that you get error code -36 in your stderr.txt file, but a dev would have to decode what that error code referrs to.

a quick check on that GPU model GT 710 1GB does reveal that it is capable of DP, and even though there were several slightly different models of GT 710 released, your system is at least self identifying it at having CC 3.5 (via your last sched request log).

it's really hard to say if this is just yet another issue with these new tasks manifesting in a new way due to marginally capable hardware, or a VRAM limit, or something else entirely. your GPU def doesnt like these tasks though.

if you feel like playing around with things, you could try newer or even older drivers to see if it makes any difference, but I don't have high hopes that you'd see different results.

_________________________________________________________________________

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5887

Credit: 119364752614

RAC: 25847594

Andrei-Costin Babaua

27 Jan 2021 21:28:46 UTC

Message 182924

(moderation:

)

Andrei-Costin Babaua wrote:

... Have you got any ideas why is this happening ...

In addition to what the others have mentioned, I took a look at the stderr output returned to the project for one of your failed tasks and compared it to what I get for one of mine that doesn't have the problem. My GPU is old too, but is rather more capable than yours. It's an AMD HD7850 that I bought in 2013 and it still runs very well.

Firstly, here is the last checkpoint written on your failed task, right near the bottom of the log. Up to that point you can see all the previous checkpoints. The final number (in this case 920) is the total number of 'skypoints' processed to that point. You can see this number steadily increasing by about 27 for each checkpoint written. This is all perfectly normal.

% C 0 920

Following this last checkpoint the error message is

ERROR: /home/bema/source/fermilat/src/bridge_fft_clfft.c:1073: clFinish failed. status=-36

For comparison, my GPU's last checkpoint was

% C 0 937

which was then followed by a very normal output line when there isn't any problem.

FPU status flags: PRECISION

Because, the total number of skypoints in the two examples are very similar, my guess is that your task had successfully completed processing the data and that it was the transition to the 'followup' stage where the 'toplist' of candidate signals is being recalculated in double precision that caused the issue. This is quite different from what the modern nvidia GPUs are having an issue with.

I know none of this helps you resolve your problem but you can take consolation from the fact that you have probably given the Devs something else to ponder while they try to sort out the more pressing modern nvidia GPU problem :-).

My guess is that if there isn't a quick resolution as to why LATeah3001L based tasks are having problems for some, there might be a fairly prompt reverting to the earlier LATeah2nnnn based tasks to buy some time in sorting it all out. If that happens, there should be a note (maybe Technical News) to alert you to try again.

I'm actually quite surprised that a GT 710 was able to do the tasks in the first place. I don't know how you had the patience to wait nearly 9 hours to see a result though :-).

Cheers,
Gary.

Andrei-Costin Babaua

Joined: 12 Apr 20

Posts: 3

Credit: 14551392

RAC: 0

Thank you all again for the

27 Jan 2021 23:25:16 UTC

Message 182928

(moderation:

)

Thank you all again for the answers!

Ian&Steve C. wrote:

[...]you could try newer or even older drivers to see if it makes any difference, but I don't have high hopes that you'd see different results.

I have already tried that before posting here. Sadly, as you said, it didn't help at all.

Gary Roberts wrote:

[...]you have probably given the Devs something else to ponder while they try to sort out the more pressing modern nvidia GPU problem :-).

That wasn't my intention at all :D. All I wanted was to find out if the problem is my PC, which kind of is, because it's old.

Gary Roberts wrote:

.I don't know how you had the patience to wait nearly 9 hours to see a result though :-).

Given that I interact with that PC once, maybe twice a month (to start it up after a power outage, or for driver updates, but nothing more than that), long running times are not that big of a problem for me, as long as tasks are running normally. It sits nicely in a corner, forgotten by the world and aging silently. Well, as silent as an overheated CPU can be :).

Stefan Ledwina

Joined: 23 Oct 05

Posts: 17

Credit: 2624020368

RAC: 1286010

I am also starting to get

28 Jan 2021 8:41:58 UTC

Message 182940

(moderation:

)

I am also starting to get computation errors on my computer with a 1080 Ti - https://einsteinathome.org/de/host/12819241/tasks/6/0?page=18

As far as I can tell from looking thru a few error messages, they all seem to fail with the same error.

Stderr output wrote:

<core_client_version>7.16.6</core_client_version>
<![CDATA[
<message>
process exited with code 28 (0x1c, -228)</message>
<stderr_txt>
08:31:54 (293563): [normal]: This Einstein@home App was built at: Feb 15 2017 10:50:14
08:31:54 (293563): [normal]: Start of BOINC application '../../projects/einstein.phys.uwm.edu/hsgamma_FGRPB1G_1.20_x86_64-pc-linux-gnu__FGRPopencl1K-nvidia'.
08:31:54 (293563): [debug]: 1e+16 fp, 7.2e+09 fp/s, 1462114 s, 406h08m33s66
08:31:54 (293563): [normal]: % CPU usage: 1.000000, GPU usage: 0.330000
command line: ../../projects/einstein.phys.uwm.edu/hsgamma_FGRPB1G_1.20_x86_64-pc-linux-gnu__FGRPopencl1K-nvidia --inputfile ../../projects/einstein.phys.uwm.edu/LATeah3001L00.dat --alpha 2.59819959601 --delta -0.694603692878 --skyRadius 1.890770e-06 --ldiBins 15 --f0start 524.0 --f0Band 8.0 --firstSkyPoint 0 --numSkyPoints 1 --f1dot -1e-13 --f1dotBand 1e-13 --df1dot 2.516443855e-15 --ephemdir ../../projects/einstein.phys.uwm.edu/JPLEPH --Tcoh 2097152.0 --toplist 10 --cohFollow 10 --numCells 1 --useWeights 1 --Srefinement 1 --CohSkyRef 1 --cohfullskybox 1 --mmfu 0.1 --reftime 56100 --model 0 --f0orbit 0.005 --mismatch 0.1 --demodbinary 1 --BinaryPointFile ../../projects/einstein.phys.uwm.edu/templates_LATeah3001L00_0532_13747548.dat --debug 0 --device 0 -o LATeah3001L00_532.0_0_0.0_13747548_0_0.out
output files: 'LATeah3001L00_532.0_0_0.0_13747548_0_0.out' '../../projects/einstein.phys.uwm.edu/LATeah3001L00_532.0_0_0.0_13747548_0_0' 'LATeah3001L00_532.0_0_0.0_13747548_0_0.out.cohfu' '../../projects/einstein.phys.uwm.edu/LATeah3001L00_532.0_0_0.0_13747548_0_1'
08:31:54 (293563): [debug]: Flags: X64 SSE SSE2 GNUC X86 GNUX86
08:31:54 (293563): [debug]: glibc version/release: 2.31/stable
08:31:54 (293563): [debug]: Set up communication with graphics process.
boinc_get_opencl_ids returned [0x2db2120 , 0x2db0560]
Using OpenCL platform provided by: NVIDIA Corporation
Using OpenCL device "GeForce GTX 1080 Ti" by: NVIDIA Corporation
Max allocation limit: 2929557504
Global mem size: 11718230016
OpenCL device has FP64 support
read_checkpoint(): Couldn't open file 'LATeah3001L00_532.0_0_0.0_13747548_0_0.out.cpt': No such file or directory (2)
% fft length: 16777216 (0x1000000)
% Scratch buffer size: 136314880
% C 0 63
ERROR: /home/bema/fermilat/src/bridge_fft_clfft.c:948: clFinish failed. status=-36
ERROR: opencl_ts_2_phase_diff_sorted() returned with error 730567168
08:33:24 (293563): [CRITICAL]: ERROR: MAIN() returned with error '-36'
FPU status flags:
mv: Aufruf von stat fÃ¼r 'LATeah3001L00_532.0_0_0.0_13747548_0_0.out' nicht mÃ¶glich: Datei oder Verzeichnis nicht gefunden
mv: Aufruf von stat fÃ¼r 'LATeah3001L00_532.0_0_0.0_13747548_0_0.out' nicht mÃ¶glich: Datei oder Verzeichnis nicht gefunden
mv: Aufruf von stat fÃ¼r 'LATeah3001L00_532.0_0_0.0_13747548_0_0.out' nicht mÃ¶glich: Datei oder Verzeichnis nicht gefunden
mv: Aufruf von stat fÃ¼r 'LATeah3001L00_532.0_0_0.0_13747548_0_0.out' nicht mÃ¶glich: Datei oder Verzeichnis nicht gefunden
mv: Aufruf von stat fÃ¼r 'LATeah3001L00_532.0_0_0.0_13747548_0_0.out' nicht mÃ¶glich: Datei oder Verzeichnis nicht gefunden
mv: Aufruf von stat fÃ¼r 'LATeah3001L00_532.0_0_0.0_13747548_0_0.out.cohfu' nicht mÃ¶glich: Datei oder Verzeichnis nicht gefunden
mv: Aufruf von stat fÃ¼r 'LATeah3001L00_532.0_0_0.0_13747548_0_0.out.cohfu' nicht mÃ¶glich: Datei oder Verzeichnis nicht gefunden
mv: Aufruf von stat fÃ¼r 'LATeah3001L00_532.0_0_0.0_13747548_0_0.out.cohfu' nicht mÃ¶glich: Datei oder Verzeichnis nicht gefunden
mv: Aufruf von stat fÃ¼r 'LATeah3001L00_532.0_0_0.0_13747548_0_0.out.cohfu' nicht mÃ¶glich: Datei oder Verzeichnis nicht gefunden
mv: Aufruf von stat fÃ¼r 'LATeah3001L00_532.0_0_0.0_13747548_0_0.out.cohfu' nicht mÃ¶glich: Datei oder Verzeichnis nicht gefunden
mv: Aufruf von stat fÃ¼r 'LATeah3001L00_532.0_0_0.0_13747548_0_0.out.cohfu' nicht mÃ¶glich: Datei oder Verzeichnis nicht gefunden
mv: Aufruf von stat fÃ¼r 'LATeah3001L00_532.0_0_0.0_13747548_0_0.out.cohfu' nicht mÃ¶glich: Datei oder Verzeichnis nicht gefunden
08:33:35 (293563): [normal]: done. calling boinc_finish(28).
08:33:35 (293563): called boinc_finish
Warning: Program terminating, but clFFT resources not freed.
Please consider explicitly calling clfftTeardown( ).

</stderr_txt>
]]>

Maybe it can help the devs to figure out whats going wrong with the new set...

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5887

Credit: 119364752614

RAC: 25847594

Stefan Ledwina wrote:I am

28 Jan 2021 10:36:41 UTC

Message 182949 in response to message 182940

(moderation:

)

Stefan Ledwina wrote:

I am also starting to get computation errors on my computer with a 1080 Ti - https://einsteinathome.org/de/host/12819241/tasks/6/0?page=18

This is a different problem to the one being discussed here. I don't own any of these recent model nvidia GPUs but I imagine you will find your problem is the same as has been discussed in this different thread over the last few days. You could add your information there if you wish.

The Devs are already working on a solution.

Cheers,
Gary.

Stared getting computing errors

Forums › Problems and Bug Reports

Comment viewing options

Forums › Problems and Bug Reports