Several computation errors

EA4HAM
EA4HAM
Joined: 2 Apr 20
Posts: 4
Credit: 8048670
RAC: 3363
Topic 221674

Hi,

 

I'm a new user in this platform (I executed SETI@Home before but the project was hibernated) and I just started computing with my HP Envy 15-j100ns laptop (Intel i7 4702MQ processor and NVIDIA GeForce GT 750M GPU) but I observe that I have a lot of computation error,

 

There are a lot of errors with "Exit status: 3 (0x00000003) Unknown error code" finishing with "/home/jenkins/workspace/workspace/EaH-GW-Testing/SLAVE/MinGW6.3/TARGET/windows-x64/EinsteinAtHome/source/fftw-3.3.4/kernel/alloc.c:269: assertion failed". Examples: TASK 936156655TASK 936831183TASK 936831222TASK 937698385TASK 937715695TASK 937718022TASK 937792263.

 

There are also a lot of errors with "Exit status: 114 (0x00000072) Unknown error code". Examples: TASK 937151418TASK 937442723TASK 937442728TASK 937467455TASK 937472061TASK 937483658TASK 937486015TASK 937490927TASK 937641298TASK 937663281TASK 937663286TASK 937671306TASK 937676433TASK 937681870TASK 937685180TASK 937689643TASK 937698360TASK 937700025TASK 937700466TASK 937705468TASK 937705474TASK 937715687TASK 937726462TASK 937730165TASK 937730169TASK 937740573TASK 937747782TASK 937747787TASK 937751633TASK 937754384TASK 937782345TASK 937800153TASK 937802460TASK 937804679TASK 937826782TASK 937830720.

 

I also have

a task (TASK 936846478) with "Exit status: 53 (0x00000035) Unknown error code",

a task (TASK 937766392) with "Exit status; 60 (0x0000003C) Unknown error code",

two tasks (TASK 936871431, TASK 937710686) with "Exit status: 65 (0x00000041) Unknown error code",

a task (TASK 937703433) with "Exit status: 69 (0x00000045) Unknown error code",

a task (TASK 936166068) with "Exit status: 197 (0x000000C5) EXIT_TIME_LIMIT_EXCEEDED",

a task (TASK 936632590) with "Exit status: 1024 (0x00000400) Unknown error code"

and a task (TASK 937745426) with "Exit status: 16484 (0x00004064) Unknown error code".

As the error codes 3 and 114 is the most repepated, I would like to troubleshoot that ones.

I ran the "sfc /scannow" command on the system without any error.

How can I solve the error codes 3 and 114?

Many thanks!

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117710315720
RAC: 35046727

ntsuba wrote:How can I solve

ntsuba wrote:
How can I solve the error codes 3 and 114?

Hi ntsuba, Welcome to Einstein.

Unfortunately, a GT 750M is probably not capable of successfully crunching the gravitational wave GPU tasks.  You should go to your account page -> preferences -> project and find the tick boxes where the different searches for this project are listed.  Your best option would be to select only the gamma-ray pulsar GPU search (FGRPB1G).  Even that search may be quite slow for your particular GPU.

Some of your CPU cores could be used to run the GW CPU tasks but you should take care not to overload/overheat your laptop.

Cheers,
Gary.

Richie
Richie
Joined: 7 Mar 14
Posts: 656
Credit: 1702989778
RAC: 0

GT 750M (4GB) has actually

GT 750M (4GB) has actually crunched ten GW GPU tasks successfully (run times about 3-4 hours though), so it is able to deal with some of those GPU tasks at least. I think that the underlying problem is that there's the Intel GPU that seems to have been trying to crunch all kinds of GPU tasks. It can't handle GW GPU tasks at all and with pulsar tasks the success rate has been poor (but not zero percent).

I would suggest disabling that Intel GPU completely from crunching anything. How it can be disabled the most easy way, I'm not sure. The chip can be disabled completely from BIOS. If it is needed for video output then it can't be disabled of course.

Maybe app_config.xml and including plan_class tags in there could be a way to exclude Intel GPU from crunching.

https://boinc.berkeley.edu/wiki/Client_configuration#Options

https://einsteinathome.org/apps.php?xml=1

edit: And now I remembered that it can be excluded by de-selecting it in the Project Preferences:

https://einsteinathome.org/account/prefs/project

EA4HAM
EA4HAM
Joined: 2 Apr 20
Posts: 4
Credit: 8048670
RAC: 3363

Hi Gary, It seems that you

Hi Gary,

It seems that you are true. 39 of 51 (76,5%) comes from "Gravitational Wave search O2 Multi-Directional v2.07 () windows_x86_64" which is executed by "Gravitational Wave search O2 Multi-Directional v2.07 (GWnew)" application. I can't see the GPU in any of these tasks. I have no valid tasks of this type.

I unchecked the " Gravitational Wave search O2 Multi-Directional" application at Account -> Preferences -> Projects -> Applications.

I also updated the NVIDIA and INTEL GPU drivers.


About the "Exit status: 3 (0x00000003) Unknown error code" which was executed with "Gamma-ray pulsar search #5 v1.08 (FGRPSSE)" I have 5 failed tasks (TASK 936831183TASK 936831222TASK 937698385TASK 937718022 and TASK 937792263) and 5 correct tasks (TASK 936828047TASK 936831206TASK 936831208TASK 936831209 and TASK 936831221). That is a 50% success rate, I think is not a good one. In these cases the GPU was not been used.

 


About the integrated Intel GPU in the processor I have the following:

Five tasks of type "Gamma-ray pulsar binary search #1 on GPUs v1.22 () windows_x86_64" has been failed with the following error codes:

TASK 936846478 - 53 (0x00000035) Unknown error code

TASK 937766392 - 60 (0x0000003C) Unknown error code

TASK 936871431 and TASK 937710686 - 65 (0x00000041) Unknown error code

TASK 937703433- 69 (0x00000045) Unknown error code

These 5 five tasks were executed with "Gamma-ray pulsar binary search #1 on GPUs v1.22 (FGRPopencl-intel_gpu)".

And only two tasks (TASK 936871577 and TASK 936871578) could be completed succesfully. All of it was executed with "Gamma-ray pulsar binary search #1 on GPUs v1.22 (FGRPopencl-intel_gpu)".

This is a very bad success rate, only 28,6%.


About the other NVIDIA GPU erros, I have the following tasks

TASK 936166068 - "Exit status: 197 (0x000000C5) EXIT_TIME_LIMIT_EXCEEDED"

TASK 936632590 - "Exit status: 1024 (0x00000400) Unknown error code"

These tasks were executed with "Gravitational Wave search O2 Multi-Directional GPU v2.07 (GW-opencl-nvidia)"

And 10 tasks (TASK 936166040TASK 936166046TASK 936178820TASK 936632526TASK 936788793TASK 936847577TASK 936847621TASK 936847625TASK 936847629 and TASK 936166068) were executed succesfully. Is an acceptable success rate, 83,33%.

Regards,

Nándor

EA4HAM
EA4HAM
Joined: 2 Apr 20
Posts: 4
Credit: 8048670
RAC: 3363

Hi again, The application

Hi again,

The application "Gravitational Wave search O2 Multi-Directional v2.07 () windows_x86_64" has a 0% success rate on my Windows 10 x64 Home laptop which has the Intel i7-4702MQ processor.

On my openSUSE Tumbleweed VPS server it has no issue with it, example the TASK 936960420. There are 16 tasks with error, all off them has the "198 (0x000000C6) EXIT_MEM_LIMIT_EXCEEDED" exit status. It was my fault configuring the memory usage.

In conclusion, it seems that the laptop has an issue executing "Gravitational Wave search O2 Multi-Directional" tasks.

Regards,

Nándor

EA4HAM
EA4HAM
Joined: 2 Apr 20
Posts: 4
Credit: 8048670
RAC: 3363

Hi, I'm back after 4 days

Hi,

I'm back after 4 days executing BOINC on my laptop.

After deactivating the option "Gravitational Wave search O2 Multi-Directional" application at Account -> Preferences -> Projects -> Applications I had only one new error on TASK 938557944. The exit status of this task was "-1 (0xFFFFFFFF) Unknown error code".

So many thanks for the suggestions, it seems that the computation effort now is worth.

Thanks!

 

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117710315720
RAC: 35046727

ntsuba wrote:... I had only

ntsuba wrote:
... I had only one new error on TASK 938557944. The exit status of this task was "-1 (0xFFFFFFFF) Unknown error code".

I had a look at this task. The exit status you saw (near top of page) is mostly not that useful.  If you scroll down into the actual output of the app, you can see (for example) the following, which represents the actual start of crunching for the app.  The INFO message confirms we are `starting from scratch' - ie. no previous checkpoint to start from.

2020-04-08 21:25:32.6707 (3336) [normal]: INFO: No checkpoint checkpoint.cpt found - starting from scratch
% --- Cpt:0, total:271, sky:1/1, f1dot:1/271

0.% --- CG:2118404 FG:123892 f1dotmin_fg:-2.955938795097e-008 df1dot_fg:8.253006451613e-014 f2dotmin_fg:-6.651808823529e-019 df2dot_fg:2.660723529412e-020 f3dotmin_fg:0 df3dot_fg:1
....INFO: Major Windows version: 6
c
..........c
..........c
..........c
..........c
..........c
..........c
..........c
..........c
..........c
..........c
..........c
..........c
..........putenv 'LAL_DEBUG_LEVEL=3'
2020-04-09 16:05:47.9425 (2892) [normal]: This program is published under the GNU General Public License, version 2
2020-04-09 16:05:47.9526 (2892) [normal]: For details see http://einstein.phys.uwm.edu/license.php
2020-04-09 16:05:47.9606 (2892) [normal]: This Einstein@home App was built at: Dec 19 2019 12:14:49

There is also other information - my interpretation (I could be wrong) is that cpt:0 means checkpoint zero out of an expected total of 271, sky 1/1 means we only have one skypoint to analyse and f1dot:1/271 means that we are starting the first of 271 calculation loops.

Within each loop, there are sub-loops (represented by the lines of dots) with a point where a new checkpoint is written (c) at the end of each line.  So having started at 2020-04-08 21:25:32.6707 (listed right at the top) calculations were stopped for the day (BOINC shut down) just before a checkpoint was going to be written.  This means that the unfinished partial line represents a small loss that will need to be recalculated at the next startup.

The calculations restarted with the "putenv" continuation of the last row of dots. It's the next day, 2020-04-09 16:05:47.9425.  If you continue reading the stuff after what I quoted above, you will find

[/10]2020-04-09 16:11:22.9149 (2892) [debug]: Successfully read checkpoint:25[/]

This just confirms that the crunching of the task was successfully resumed from the previously saved checkpoint.

After quite a lot more processing, you get to the actual error situation and the following is the best guide to the problem (I've left out the huge path string that precedes the actual filename of the routine in which the error occurred)

OpenCLutils.c:582): clEnqueueNDRangeKernel failed: CL_INVALID_COMMAND_QUEUE

I'm not a programmer so I've no clue about what was being done at line 582 in the code but, whatever it was, your GPU couldn't handle it and therefore crashed.  Since this same point would have been traversed many times in previous loops, something glitched with your GPU on this particular loop.  No doubt, the app developer would have some idea of the cause of "CL_INVALID_COMMAND_QUEUE" but I'd be surprised if it wasn't just characterised by "hardware glitch".

I decided to go through all this in such detail so that others experiencing similar failures can know how to investigate for themselves and perhaps realise that sometimes, the problem is just that their hardware isn't quite up to the demands being placed on it.

A 'once off' situation like this can probably be ignored, but if it's happening repeatedly with different tasks, it's time to check obvious things like GPU cooling, voltage stability, overall total loading, etc, and if there is no obvious hardware deficiency, give the hardware something a bit easier to crunch :-).

Cheers,
Gary.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.