Several computation errors

EA4HAM

Joined: 2 Apr 20

Posts: 4

Credit: 8047977

RAC: 3315

5 Apr 2020 8:57:08 UTC

Topic 221674

(moderation:

)

Hi,

I'm a new user in this platform (I executed SETI@Home before but the project was hibernated) and I just started computing with my HP Envy 15-j100ns laptop (Intel i7 4702MQ processor and NVIDIA GeForce GT 750M GPU) but I observe that I have a lot of computation error,

There are a lot of errors with "Exit status: 3 (0x00000003) Unknown error code" finishing with "/home/jenkins/workspace/workspace/EaH-GW-Testing/SLAVE/MinGW6.3/TARGET/windows-x64/EinsteinAtHome/source/fftw-3.3.4/kernel/alloc.c:269: assertion failed". Examples: TASK 936156655, TASK 936831183, TASK 936831222, TASK 937698385, TASK 937715695, TASK 937718022, TASK 937792263.

There are also a lot of errors with "Exit status: 114 (0x00000072) Unknown error code". Examples: TASK 937151418, TASK 937442723, TASK 937442728, TASK 937467455, TASK 937472061, TASK 937483658, TASK 937486015, TASK 937490927, TASK 937641298, TASK 937663281, TASK 937663286, TASK 937671306, TASK 937676433, TASK 937681870, TASK 937685180, TASK 937689643, TASK 937698360, TASK 937700025, TASK 937700466, TASK 937705468, TASK 937705474, TASK 937715687, TASK 937726462, TASK 937730165, TASK 937730169, TASK 937740573, TASK 937747782, TASK 937747787, TASK 937751633, TASK 937754384, TASK 937782345, TASK 937800153, TASK 937802460, TASK 937804679, TASK 937826782, TASK 937830720.

I also have

a task (TASK 936846478) with "Exit status: 53 (0x00000035) Unknown error code",

a task (TASK 937766392) with "Exit status; 60 (0x0000003C) Unknown error code",

two tasks (TASK 936871431, TASK 937710686) with "Exit status: 65 (0x00000041) Unknown error code",

a task (TASK 937703433) with "Exit status: 69 (0x00000045) Unknown error code",

a task (TASK 936166068) with "Exit status: 197 (0x000000C5) EXIT_TIME_LIMIT_EXCEEDED",

a task (TASK 936632590) with "Exit status: 1024 (0x00000400) Unknown error code"

and a task (TASK 937745426) with "Exit status: 16484 (0x00004064) Unknown error code".

As the error codes 3 and 114 is the most repepated, I would like to troubleshoot that ones.

I ran the "sfc /scannow" command on the system without any error.

How can I solve the error codes 3 and 114?

Many thanks!

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5872

Credit: 117707425738

RAC: 35060415

ntsuba wrote:How can I solve

5 Apr 2020 9:51:35 UTC

Message 176385

(moderation:

)

ntsuba wrote:

How can I solve the error codes 3 and 114?

Hi ntsuba, Welcome to Einstein.

Unfortunately, a GT 750M is probably not capable of successfully crunching the gravitational wave GPU tasks. You should go to your account page -> preferences -> project and find the tick boxes where the different searches for this project are listed. Your best option would be to select only the gamma-ray pulsar GPU search (FGRPB1G). Even that search may be quite slow for your particular GPU.

Some of your CPU cores could be used to run the GW CPU tasks but you should take care not to overload/overheat your laptop.

Cheers,
Gary.

Richie

Joined: 7 Mar 14

Posts: 656

Credit: 1702989778

RAC: 0

GT 750M (4GB) has actually

5 Apr 2020 10:52:21 UTC

Message 176386

(moderation:

)

GT 750M (4GB) has actually crunched ten GW GPU tasks successfully (run times about 3-4 hours though), so it is able to deal with some of those GPU tasks at least. I think that the underlying problem is that there's the Intel GPU that seems to have been trying to crunch all kinds of GPU tasks. It can't handle GW GPU tasks at all and with pulsar tasks the success rate has been poor (but not zero percent).

I would suggest disabling that Intel GPU completely from crunching anything. How it can be disabled the most easy way, I'm not sure. The chip can be disabled completely from BIOS. If it is needed for video output then it can't be disabled of course.

Maybe app_config.xml and including plan_class tags in there could be a way to exclude Intel GPU from crunching.

https://boinc.berkeley.edu/wiki/Client_configuration#Options

https://einsteinathome.org/apps.php?xml=1

edit: And now I remembered that it can be excluded by de-selecting it in the Project Preferences:

https://einsteinathome.org/account/prefs/project

EA4HAM

Joined: 2 Apr 20

Posts: 4

Credit: 8047977

RAC: 3315

Hi Gary, It seems that you

5 Apr 2020 10:54:39 UTC

Message 176387

(moderation:

)

Hi Gary,

It seems that you are true. 39 of 51 (76,5%) comes from "Gravitational Wave search O2 Multi-Directional v2.07 () windows_x86_64" which is executed by "Gravitational Wave search O2 Multi-Directional v2.07 (GWnew)" application. I can't see the GPU in any of these tasks. I have no valid tasks of this type.

I unchecked the " Gravitational Wave search O2 Multi-Directional" application at Account -> Preferences -> Projects -> Applications.

I also updated the NVIDIA and INTEL GPU drivers.

About the "Exit status: 3 (0x00000003) Unknown error code" which was executed with "Gamma-ray pulsar search #5 v1.08 (FGRPSSE)" I have 5 failed tasks (TASK 936831183, TASK 936831222, TASK 937698385, TASK 937718022 and TASK 937792263) and 5 correct tasks (TASK 936828047, TASK 936831206, TASK 936831208, TASK 936831209 and TASK 936831221). That is a 50% success rate, I think is not a good one. In these cases the GPU was not been used.

About the integrated Intel GPU in the processor I have the following:

Five tasks of type "Gamma-ray pulsar binary search #1 on GPUs v1.22 () windows_x86_64" has been failed with the following error codes:

TASK 936846478 - 53 (0x00000035) Unknown error code

TASK 937766392 - 60 (0x0000003C) Unknown error code

TASK 936871431 and TASK 937710686 - 65 (0x00000041) Unknown error code

TASK 937703433- 69 (0x00000045) Unknown error code

These 5 five tasks were executed with "Gamma-ray pulsar binary search #1 on GPUs v1.22 (FGRPopencl-intel_gpu)".

And only two tasks (TASK 936871577 and TASK 936871578) could be completed succesfully. All of it was executed with "Gamma-ray pulsar binary search #1 on GPUs v1.22 (FGRPopencl-intel_gpu)".

This is a very bad success rate, only 28,6%.

About the other NVIDIA GPU erros, I have the following tasks

TASK 936166068 - "Exit status: 197 (0x000000C5) EXIT_TIME_LIMIT_EXCEEDED"

TASK 936632590 - "Exit status: 1024 (0x00000400) Unknown error code"

These tasks were executed with "Gravitational Wave search O2 Multi-Directional GPU v2.07 (GW-opencl-nvidia)"

And 10 tasks (TASK 936166040, TASK 936166046, TASK 936178820, TASK 936632526, TASK 936788793, TASK 936847577, TASK 936847621, TASK 936847625, TASK 936847629 and TASK 936166068) were executed succesfully. Is an acceptable success rate, 83,33%.

Regards,

Nándor

EA4HAM

Joined: 2 Apr 20

Posts: 4

Credit: 8047977

RAC: 3315

Hi again, The application

5 Apr 2020 11:05:50 UTC

Message 176388

(moderation:

)

Hi again,

The application "Gravitational Wave search O2 Multi-Directional v2.07 () windows_x86_64" has a 0% success rate on my Windows 10 x64 Home laptop which has the Intel i7-4702MQ processor.

On my openSUSE Tumbleweed VPS server it has no issue with it, example the TASK 936960420. There are 16 tasks with error, all off them has the "198 (0x000000C6) EXIT_MEM_LIMIT_EXCEEDED" exit status. It was my fault configuring the memory usage.

In conclusion, it seems that the laptop has an issue executing "Gravitational Wave search O2 Multi-Directional" tasks.

Regards,

Nándor

EA4HAM

Joined: 2 Apr 20

Posts: 4

Credit: 8047977

RAC: 3315

Hi, I'm back after 4 days

9 Apr 2020 15:49:00 UTC

Message 176488

(moderation:

)

Hi,

I'm back after 4 days executing BOINC on my laptop.

After deactivating the option "Gravitational Wave search O2 Multi-Directional" application at Account -> Preferences -> Projects -> Applications I had only one new error on TASK 938557944. The exit status of this task was "-1 (0xFFFFFFFF) Unknown error code".

So many thanks for the suggestions, it seems that the computation effort now is worth.

Thanks!

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5872

Credit: 117707425738

RAC: 35060415

ntsuba wrote:... I had only

10 Apr 2020 1:31:28 UTC

Message 176500 in response to message 176488

(moderation:

)

ntsuba wrote:

... I had only one new error on TASK 938557944. The exit status of this task was "-1 (0xFFFFFFFF) Unknown error code".

I had a look at this task. The exit status you saw (near top of page) is mostly not that useful. If you scroll down into the actual output of the app, you can see (for example) the following, which represents the actual start of crunching for the app. The INFO message confirms we are `starting from scratch' - ie. no previous checkpoint to start from.

2020-04-08 21:25:32.6707 (3336) [normal]: INFO: No checkpoint checkpoint.cpt found - starting from scratch
% --- Cpt:0, total:271, sky:1/1, f1dot:1/271

0.% --- CG:2118404 FG:123892 f1dotmin_fg:-2.955938795097e-008 df1dot_fg:8.253006451613e-014 f2dotmin_fg:-6.651808823529e-019 df2dot_fg:2.660723529412e-020 f3dotmin_fg:0 df3dot_fg:1
....INFO: Major Windows version: 6
c
..........c
..........c
..........c
..........c
..........c
..........c
..........c
..........c
..........c
..........c
..........c
..........c
..........putenv 'LAL_DEBUG_LEVEL=3'
2020-04-09 16:05:47.9425 (2892) [normal]: This program is published under the GNU General Public License, version 2
2020-04-09 16:05:47.9526 (2892) [normal]: For details see http://einstein.phys.uwm.edu/license.php
2020-04-09 16:05:47.9606 (2892) [normal]: This Einstein@home App was built at: Dec 19 2019 12:14:49

There is also other information - my interpretation (I could be wrong) is that cpt:0 means checkpoint zero out of an expected total of 271, sky 1/1 means we only have one skypoint to analyse and f1dot:1/271 means that we are starting the first of 271 calculation loops.

Within each loop, there are sub-loops (represented by the lines of dots) with a point where a new checkpoint is written (c) at the end of each line. So having started at 2020-04-08 21:25:32.6707 (listed right at the top) calculations were stopped for the day (BOINC shut down) just before a checkpoint was going to be written. This means that the unfinished partial line represents a small loss that will need to be recalculated at the next startup.

The calculations restarted with the "putenv" continuation of the last row of dots. It's the next day, 2020-04-09 16:05:47.9425. If you continue reading the stuff after what I quoted above, you will find

[/10]2020-04-09 16:11:22.9149 (2892) [debug]: Successfully read checkpoint:25[/]

This just confirms that the crunching of the task was successfully resumed from the previously saved checkpoint.

After quite a lot more processing, you get to the actual error situation and the following is the best guide to the problem (I've left out the huge path string that precedes the actual filename of the routine in which the error occurred)

OpenCLutils.c:582): clEnqueueNDRangeKernel failed: CL_INVALID_COMMAND_QUEUE

I'm not a programmer so I've no clue about what was being done at line 582 in the code but, whatever it was, your GPU couldn't handle it and therefore crashed. Since this same point would have been traversed many times in previous loops, something glitched with your GPU on this particular loop. No doubt, the app developer would have some idea of the cause of "CL_INVALID_COMMAND_QUEUE" but I'd be surprised if it wasn't just characterised by "hardware glitch".

I decided to go through all this in such detail so that others experiencing similar failures can know how to investigate for themselves and perhaps realise that sometimes, the problem is just that their hardware isn't quite up to the demands being placed on it.

A 'once off' situation like this can probably be ignored, but if it's happening repeatedly with different tasks, it's time to check obvious things like GPU cooling, voltage stability, overall total loading, etc, and if there is no obvious hardware deficiency, give the hardware something a bit easier to crunch :-).

Cheers,
Gary.

Several computation errors

Forums › Problems and Bug Reports

ntsuba wrote:How can I solve

GT 750M (4GB) has actually

Hi Gary, It seems that you

Hi again, The application

Hi, I'm back after 4 days

ntsuba wrote:... I had only

Comment viewing options

Forums › Problems and Bug Reports