Computation Error: Output files absent

Dr_Mabuse
Dr_Mabuse
Joined: 9 May 05
Posts: 11
Credit: 9333614
RAC: 446
Topic 223735

Hello experts

I got again this kiind of error:

17.10.2020 14:25:32 | Einstein@Home | Computation for task h1_0383.45_O2C02Cl4In0__O2MDFS2_Spotlight_383.95Hz_1585_1 finished
17.10.2020 14:25:32 | Einstein@Home | Output file h1_0383.45_O2C02Cl4In0__O2MDFS2_Spotlight_383.95Hz_1585_1_0 for task h1_0383.45_O2C02Cl4In0__O2MDFS2_Spotlight_383.95Hz_1585_1 absent
17.10.2020 14:25:32 | Einstein@Home | Output file h1_0383.45_O2C02Cl4In0__O2MDFS2_Spotlight_383.95Hz_1585_1_1 for task h1_0383.45_O2C02Cl4In0__O2MDFS2_Spotlight_383.95Hz_1585_1 absent
17.10.2020 14:25:32 | Einstein@Home | Output file h1_0383.45_O2C02Cl4In0__O2MDFS2_Spotlight_383.95Hz_1585_1_2 for task h1_0383.45_O2C02Cl4In0__O2MDFS2_Spotlight_383.95Hz_1585_1 absent
17.10.2020 14:25:45 | Einstein@Home | Starting task h1_0383.45_O2C02Cl4In0__O2MDFS2_Spotlight_383.95Hz_1592_2
17.10.2020 14:25:46 | Einstein@Home | Started upload of h1_0383.45_O2C02Cl4In0__O2MDFS2_Spotlight_383.95Hz_1585_1_3
17.10.2020 14:25:48 | Einstein@Home | Finished upload of h1_0383.45_O2C02Cl4In0__O2MDFS2_Spotlight_383.95Hz_1585_1_3
17.10.2020 14:25:55 |  | Suspending computation - CPU is busy
17.10.2020 14:26:05 |  | Resuming computation
17.10.2020 14:26:46 | Einstein@Home | Project requested delay of 60 seconds
17.10.2020 14:31:25 | Einstein@Home | Computation for task h1_0383.45_O2C02Cl4In0__O2MDFS2_Spotlight_383.95Hz_1592_2 finished
17.10.2020 14:31:25 | Einstein@Home | Output file h1_0383.45_O2C02Cl4In0__O2MDFS2_Spotlight_383.95Hz_1592_2_0 for task h1_0383.45_O2C02Cl4In0__O2MDFS2_Spotlight_383.95Hz_1592_2 absent
17.10.2020 14:31:25 | Einstein@Home | Output file h1_0383.45_O2C02Cl4In0__O2MDFS2_Spotlight_383.95Hz_1592_2_1 for task h1_0383.45_O2C02Cl4In0__O2MDFS2_Spotlight_383.95Hz_1592_2 absent
17.10.2020 14:31:25 | Einstein@Home | Output file h1_0383.45_O2C02Cl4In0__O2MDFS2_Spotlight_383.95Hz_1592_2_2 for task h1_0383.45_O2C02Cl4In0__O2MDFS2_Spotlight_383.95Hz_1592_2 absent
17.10.2020 14:31:47 | Einstein@Home | Starting task h1_0383.45_O2C02Cl4In0__O2MDFS2_Spotlight_383.95Hz_1591_2
17.10.2020 14:31:49 | Einstein@Home | Started upload of h1_0383.45_O2C02Cl4In0__O2MDFS2_Spotlight_383.95Hz_1592_2_3
17.10.2020 14:31:51 | Einstein@Home | Finished upload of h1_0383.45_O2C02Cl4In0__O2MDFS2_Spotlight_383.95Hz_1592_2_3
17.10.2020 14:33:10 | Einstein@Home | Project requested delay of 60 seconds
17.10.2020 14:35:46 | Einstein@Home | Computation for task h1_0383.45_O2C02Cl4In0__O2MDFS2_Spotlight_383.95Hz_1591_2 finished
17.10.2020 14:35:46 | Einstein@Home | Output file h1_0383.45_O2C02Cl4In0__O2MDFS2_Spotlight_383.95Hz_1591_2_0 for task h1_0383.45_O2C02Cl4In0__O2MDFS2_Spotlight_383.95Hz_1591_2 absent
17.10.2020 14:35:46 | Einstein@Home | Output file h1_0383.45_O2C02Cl4In0__O2MDFS2_Spotlight_383.95Hz_1591_2_1 for task h1_0383.45_O2C02Cl4In0__O2MDFS2_Spotlight_383.95Hz_1591_2 absent
17.10.2020 14:35:46 | Einstein@Home | Output file h1_0383.45_O2C02Cl4In0__O2MDFS2_Spotlight_383.95Hz_1591_2_2 for task h1_0383.45_O2C02Cl4In0__O2MDFS2_Spotlight_383.95Hz_1591_2 absent

What can I do to prvent this error ?

Please help

thanks

Jochen

 

Harri Liljeroos
Harri Liljeroos
Joined: 10 Dec 05
Posts: 3610
Credit: 2902305583
RAC: 1039112

That is not the error, that

That is not the error, that is the result of the error. Because there was an error the result files are absent. You have to look at the stderr output for the actual error. That can be found here on the web site on your account page where your tasks are listed. 

Richie
Richie
Joined: 7 Mar 14
Posts: 656
Credit: 1702989778
RAC: 0

Hi! It is because of

Hi!

It is because of "OpenCL Device used for Search/Recalc and/or semi coherent step: 'GeForce GTX 1050 (Platform: NVIDIA CUDA, global memory: 2048 MiB)". All those tasks that failed had so called DF of 0.50 . They required more memory than what your 2GB GPU is able to provide. Some GW GPU tasks with lower DF could run with a 2GB card, but a user isn't able to choose what kind of tasks the project server will send, considering the DF.

It would be a fail safe solution to run only "Gamma-ray pulsar binary search #1 on GPUs" with a 2GB card.

Paul Vleugels
Paul Vleugels
Joined: 1 Mar 05
Posts: 12
Credit: 32007584
RAC: 679

Hello Harri, Last 2 weeks

Hello Harri,

Last 2 weeks I'm also facing the same problem mentioned in this ticket.

Many WU's were cancelled after +/- 2 min CPU time. Based on your comment I had a look to the stderr output, but there're a lot of errors being mentioned (like unknown with exit code 1024) that I've no clue what they mean or how to correct this. An insider / developer will understand for sure, but I'm not.

This is the stderr of one of the failing WU's, can you please have a look?

Task 1015155880

Naam: h1_0397.30_O2C02Cl4In0__O2MDFS2_Spotlight_397.80Hz_1937_0

Werkeenheid ID: 493618356

Aangemaakt: 7 Oct 2020 9:11:55 UTC

Verzonden: 12 Oct 2020 6:35:52 UTC

Rapporteren voor: 19 Oct 2020 6:35:52 UTC

Ontvangen: 12 Oct 2020 15:38:52 UTC

Server status: Over

Uitkomst: Computation error

Client status: Compute error

Afsluit status: 1024 (0x00000400) Unknown error code

Computer: 12535157

Run time (sec): 107.04

CPU time (sec): 102.80

Peak working set size (MB): 373.27

Peak swap size (MB): 2075.45

Peak disk usage (MB): 0.02

Validatie status: Invalid

Toegekend punten: 0

Applicatie: Gravitational Wave search O2 Multi-Directional GPU v2.07 (GW-opencl-nvidia)
windows_x86_64


Stderr output

<core_client_version>7.16.11</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code 1024 (0x400)</message>
<stderr_txt>
putenv 'LAL_DEBUG_LEVEL=3'
2020-10-12 17:32:16.1058 (63356) [normal]: This program is published under the GNU General Public License, version 2
2020-10-12 17:32:16.1058 (63356) [normal]: For details see http://einstein.phys.uwm.edu/license.php
2020-10-12 17:32:16.1138 (63356) [normal]: This Einstein@home App was built at: Dec 19 2019 12:14:49

2020-10-12 17:32:16.1159 (63356) [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/einstein_O2MDF_2.07_windows_x86_64__GW-opencl-nvidia.exe'.
Activated exception handling...
[DEBUG} GPU type: 1
[DEBUG} got GPU info from BOINC
[DEBUG} got VendorID 4318
2020-10-12 17:32:16.4763 (63356) [debug]: BSGL output files
2020-10-12 17:32:16.4883 (63356) [debug]: Flags: LAL_DEBUG, OPTIMIZE, HS_OPTIMIZATION, GC_SSE2_OPT, X64, SSE, SSE2, GNUC X86 GNUX86
2020-10-12 17:32:16.4883 (63356) [debug]: Set up communication with graphics process.

DEPRECATION WARNING: program has invoked obsolete function XLALGetVersionString(). Please see XLALVCSInfoString() for information about a replacement.
Code-version: %% LAL: 6.19.2.1 (CLEAN 98bbe72a728eb25935e9195dafae691335dabf8c)
%% LALPulsar: 1.17.1.1 (CLEAN 98bbe72a728eb25935e9195dafae691335dabf8c)
%% LALApps: 6.23.0.1 (CLEAN 98bbe72a728eb25935e9195dafae691335dabf8c)

2020-10-12 17:32:17.5658 (63356) [normal]: Reading input data ... 2020-10-12 17:33:32.9681 (63356) [normal]: Search FstatMethod used: 'ResampOpenCL'
2020-10-12 17:33:32.9681 (63356) [normal]: Recalc FstatMethod used: 'DemodSSE'
2020-10-12 17:33:32.9701 (63356) [normal]: OpenCL Device used for Search/Recalc and/or semi coherent step: 'GeForce GTX 950M (Platform: NVIDIA CUDA, global memory: 2048 MiB)'
2020-10-12 17:33:32.9701 (63356) [normal]: OpenCL version is used for the semi-coherent step!
2020-10-12 17:33:59.8248 (63356) [normal]: Number of segments: 12, total number of SFTs in segments: 10192
done.
% --- GPS reference time = 1177858472.0000 , GPS data mid time = 1177858472.0000
2020-10-12 17:33:59.8896 (63356) [normal]: dFreqStack = 2.251046e-007, df1dot = 5.685400e-013, df2dot = 4.020648e-019, df3dot = 0.000000e+000
% --- Setup, N = 12, T = 1296000 s, Tobs = 19750204 s, gammaRefine = 9, gamma2Refine = 23, gamma3Refine = 1

DEPRECATION WARNING: program has invoked obsolete function InitDopplerSkyScan(). Please see XLALInitDopplerSkyScan() for information about a replacement.
2020-10-12 17:33:59.9066 (63356) [normal]: INFO: No checkpoint checkpoint.cpt found - starting from scratch
% --- Cpt:0, total:26, sky:1/1, f1dot:1/26

0.% --- CG:2669916 FG:222119 f1dotmin_fg:-3.753135252444e-008 df1dot_fg:6.317111111111e-014 f2dotmin_fg:-1.922918608696e-019 df2dot_fg:1.748107826087e-020 f3dotmin_fg:0 df3dot_fg:1
XLAL Error - XLALComputeECLFFT_OpenCL (/home/jenkins/workspace/workspace/EaH-GW-OpenCL-Testing/SLAVE/MinGW6.3/TARGET/windows-x64/EinsteinAtHome/source/lalsuite/lalpulsar/src/ComputeFstat_Resamp_OpenCL.c:1248): Processing FFT failed: CL_MEM_OBJECT_ALLOCATION_FAILURE
XLAL Error - XLALComputeECLFFT_OpenCL (/home/jenkins/workspace/workspace/EaH-GW-OpenCL-Testing/SLAVE/MinGW6.3/TARGET/windows-x64/EinsteinAtHome/source/lalsuite/lalpulsar/src/ComputeFstat_Resamp_OpenCL.c:1248): Internal function call failed
XLAL Error - XLALComputeFaFb_Resamp_OpenCL (/home/jenkins/workspace/workspace/EaH-GW-OpenCL-Testing/SLAVE/MinGW6.3/TARGET/windows-x64/EinsteinAtHome/source/lalsuite/lalpulsar/src/ComputeFstat_Resamp_OpenCL.c:654): Check failed: (*fftfuncs->computefft_func) ( fftfuncs->fftplan, ws->TS_FFT, ((void *)0) ) == XLAL_SUCCESS
XLAL Error - XLALComputeFaFb_Resamp_OpenCL (/home/jenkins/workspace/workspace/EaH-GW-OpenCL-Testing/SLAVE/MinGW6.3/TARGET/windows-x64/EinsteinAtHome/source/lalsuite/lalpulsar/src/ComputeFstat_Resamp_OpenCL.c:654): Internal function call failed
XLAL Error - XLALComputeFstatResamp_OpenCL (/home/jenkins/workspace/workspace/EaH-GW-OpenCL-Testing/SLAVE/MinGW6.3/TARGET/windows-x64/EinsteinAtHome/source/lalsuite/lalpulsar/src/ComputeFstat_Resamp_OpenCL.c:441): Check failed: XLALComputeFaFb_Resamp_OpenCL ( resamp, ws, thisPoint, common->dFreq, numFreqBins, TimeSeriesX_SRC_a, TimeSeriesX_SRC_b ) == XLAL_SUCCESS
XLAL Error - XLALComputeFstatResamp_OpenCL (/home/jenkins/workspace/workspace/EaH-GW-OpenCL-Testing/SLAVE/MinGW6.3/TARGET/windows-x64/EinsteinAtHome/source/lalsuite/lalpulsar/src/ComputeFstat_Resamp_OpenCL.c:441): Internal function call failed
XLAL Error - XLALComputeFstat (/home/jenkins/workspace/workspace/EaH-GW-OpenCL-Testing/SLAVE/MinGW6.3/TARGET/windows-x64/EinsteinAtHome/source/lalsuite/lalpulsar/src/ComputeFstat.c:875): Check failed: (input->method_funcs.compute_func) ( *Fstats, common, input->method_data ) == XLAL_SUCCESS
XLAL Error - XLALComputeFstat (/home/jenkins/workspace/workspace/EaH-GW-OpenCL-Testing/SLAVE/MinGW6.3/TARGET/windows-x64/EinsteinAtHome/source/lalsuite/lalpulsar/src/ComputeFstat.c:875): Internal function call failed
MAIN: XLALComputeFstat() failed with errno=1024
2020-10-12 17:34:00.7417 (63356) [CRITICAL]: ERROR: MAIN() returned with error '1024'
2020-10-12 17:34:00.7438 (63356) [debug]: resultfile '../../projects/einstein.phys.uwm.edu/h1_0397.30_O2C02Cl4In0__O2MDFS2_Spotlight_397.80Hz_1937_0_0' (len 96), current config file: 0
Code-version: %% LAL: 6.19.2.1 (CLEAN 98bbe72a728eb25935e9195dafae691335dabf8c)
%% LALPulsar: 1.17.1.1 (CLEAN 98bbe72a728eb25935e9195dafae691335dabf8c)
%% LALApps: 6.23.0.1 (CLEAN 98bbe72a728eb25935e9195dafae691335dabf8c)

FPU status flags: COND_0 PRECISION
2020-10-12 17:34:00.7438 (63356) [debug]: worker done. return(1024) to caller
2020-10-12 17:34:00.7438 (63356) [normal]: done. calling boinc_finish(1024).
17:34:00 (63356): called boinc_finish

</stderr_txt>
]]>





Richie
Richie
Joined: 7 Mar 14
Posts: 656
Credit: 1702989778
RAC: 0

Paul Vleugels wrote:Many WU's

Paul Vleugels wrote:

Many WU's were cancelled after +/- 2 min CPU time. Based on your comment I had a look to the stderr output, but there're a lot of errors being mentioned (like unknown with exit code 1024) that I've no clue what they mean or how to correct this.

Naam: h1_0397.30_O2C02Cl4In0__O2MDFS2_Spotlight_397.80Hz_1937_0

Applicatie: Gravitational Wave search O2 Multi-Directional GPU v2.07 (GW-opencl-nvidia)
windows_x86_64

Stderr output

2020-10-12 17:33:32.9701 (63356) [normal]: OpenCL Device used for Search/Recalc and/or semi coherent step: 'GeForce GTX 950M (Platform: NVIDIA CUDA, global memory: 2048 MiB)'

Hi Paul !

Your computer has "NVIDIA GeForce GTX 950M (2048MB) driver: 388.00".

All the GW GPU tasks that failed had a DF of 0.50 (task ID's had xxxx.30 ... xxx.80 or xxxx.35 ... xxx..85 in their names, 85 - 35 or 80 - 30 = DF of 0.50). Those GW GPU tasks with high DF value will require more than 2GB of GPU memory. That's why they have crashed early in the beginning.

Some GW GPU tasks with lower DF could run with a 2GB card, but a user isn't able to choose what kind of GW GPU tasks the project server will send, considering the DF.

Currently it would be a fail safe solution to run only "Gamma-ray pulsar binary search #1 on GPUs" with a 2GB GPU.

Also, the Nvidia GPU driver 388.00 on your computer is a three years old version. That could be a major problem for any kind of succesfull crunching with Nvidia GPU. I would definitely suggest you to visit Nvidia driver site and upgrade that driver to the latest version even if you didn't run Boinc.

Paul Vleugels
Paul Vleugels
Joined: 1 Mar 05
Posts: 12
Credit: 32007584
RAC: 679

Hello Richie, Thanks for

Hello Richie,

Thanks for your quick feedback !

Looking to my Speccy output I can see that my Nvidia GeForce GTX 950 M driver is current'ly '23.21.13.8800'. I wonder where and how you've see '388.00' (maybe your 388 is a sub string within 23.21.13.8800'?

Finding a driver update at the Nvidia side is not easy (can't find the 950 M at all), but I'll continue anyway.

Second question: how and where can I set I prefer to process 'Gamma-ray pulsar binary search #1 on GPUs' WU's as you suggest?

San-Fernando-Valley
San-Fernando-Valley
Joined: 16 Mar 16
Posts: 260
Credit: 6915341637
RAC: 20493800

DRIVER: https://www.nvidia

San-Fernando-Valley
San-Fernando-Valley
Joined: 16 Mar 16
Posts: 260
Credit: 6915341637
RAC: 20493800

Gamma Ray:    

Gamma Ray:

 

  ACCOUNT

     PREFERENCE

        PROJECT

 

make a check for Gamma Ray

and don'T forget to "save" change at the bottom

 

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5842
Credit: 109406101267
RAC: 35391326

Paul Vleugels wrote:... I had

Paul Vleugels wrote:
... I had a look to the stderr output, but there're a lot of errors being mentioned (like unknown with exit code 1024) that I've no clue what they mean or how to correct this. An insider / developer will understand for sure, but I'm not.

None of the people likely to reply here are "insiders" - we're all just volunteers like yourself.

The trick is not to look at error codes but to look (earlier in the output) for keywords.  In your case, the key information is contained in the line that says:

Processing FFT failed: CL_MEM_OBJECT_ALLOCATION_FAILURE

The app uses a process referred to as a "Fast Fourier Transform" (FFT) to process a very large number of data points, all of which requires a lot of memory.  The problem is that your GPU doesn't have enough memory to store the data - a memory allocation failure.

At earlier stages for this particular search, memory requirements were less so perhaps your card may just have had enough memory earlier on.  As the run progresses, memory requirements are likely to increase even further so even if some tasks are not failing now, the situation is likely to get worse with time.  As a general comment for anyone with a 2GB card, you should avoid the GW search and select the gamma-ray pulsar search instead.  2GB will be fine there.

Cheers,
Gary.

Richie
Richie
Joined: 7 Mar 14
Posts: 656
Credit: 1702989778
RAC: 0

Quote:Paul Vleugels wrote:I

Quote:

Paul Vleugels wrote:
I wonder where and how you've see '388.00'

From the line "Coprosessors:" at the information page of your host... https://einsteinathome.org/host/12535157

That's what Boinc currently knows about it.

Quote:
(maybe your 388 is a sub string within 23.21.13.8800'?

Yes, that sub string would tell the version also.

Quote:
Finding a driver update at the Nvidia side is not easy (can't find the 950 M at all), but I'll continue anyway.

San-Fernando-Valley shot with precision already. Well, here would be just an alternative page to start searching for Nvidia drivers if in a need some day: https://www.nvidia.com/Download/Find.aspx?lang=en-us

Product Type: GeForce , Product Series: GeForce 900M Series (Notebook) , Product: GeForce GTX 950M , Operating System: Windows 10 64-bit

I'm not sure if that very old driver was yet of type 'DCH' but you could try an installation with that type at first.

Alternatively, you could first download and install GPU-Z . It would show "DCH" included at the "Driver Version" information if the currently installed driver is a DCH version. If it doesn't show "DCH" then you should choose the 'Standard' version.

Paul Vleugels
Paul Vleugels
Joined: 1 Mar 05
Posts: 12
Credit: 32007584
RAC: 679

Thanks everybody for all your

Thanks everybody for all your advices and tips. Will process it and get back to you soon, enjoy your weekend !

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.