My GTX 980ti keeps crashing on All-Sky Gravitational Wave search on O3 1.04 (GW-opencl-nvidia)

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3958
Credit: 47031812642
RAC: 65087474

i would DDU your current

i would DDU your current drivers and reinstall completely fresh. to rule it out.



some of your errors seem consistent with driver issues, CL_INVALID_COMMAND_QUEUE. one of these instances happened within minutes of your CL_OUT_OF_HOST_MEMORY error and are likely linked to the same driver crash event.



Quote:
XLAL Error - XLALOpenCLWait (/home/jenkins/workspace/workspace/EaH-GW-OpenCL-Testing/SLAVE/MinGW6.3/TARGET/windows-x64/EinsteinAtHome/source/lalsuite/lalpulsar/lib/GPUUtils/OpenCLUtils.c:671): clFinish ( queue ) failed with OpenCL error: CL_INVALID_COMMAND_QUEUE XLAL Error - XLALOpenCLWait (/home/jenkins/workspace/workspace/EaH-GW-OpenCL-Testing/SLAVE/MinGW6.3/TARGET/windows-x64/EinsteinAtHome/source/lalsuite/lalpulsar/lib/GPUUtils/OpenCLUtils.c:671): Generic failure XLAL Error - XLALGPUWait (/home/jenkins/workspace/workspace/EaH-GW-OpenCL-Testing/SLAVE/MinGW6.3/TARGET/windows-x64/EinsteinAtHome/source/lalsuite/lalpulsar/lib/GPUUtils/GPUUtils.c:474): Check failed: gpuObj.wait(gpuObj.multiStreams) == XLAL_SUCCESS XLAL Error - XLALGPUWait (/home/jenkins/workspace/workspace/EaH-GW-OpenCL-Testing/SLAVE/MinGW6.3/TARGET/windows-x64/EinsteinAtHome/source/lalsuite/lalpulsar/lib/GPUUtils/GPUUtils.c:474): Internal function call failed: Generic failure XLAL Error - XLALComputeFstatResamp_GPU (/home/jenkins/workspace/workspace/EaH-GW-OpenCL-Testing/SLAVE/MinGW6.3/TARGET/windows-x64/EinsteinAtHome/source/lalsuite/lalpulsar/lib/ComputeFstat_Resamp_GPU.c:401): Wait for {Fa(f_k), Fb(f_k)} XLAL Error - XLALComputeFstatResamp_GPU (/home/jenkins/workspace/workspace/EaH-GW-OpenCL-Testing/SLAVE/MinGW6.3/TARGET/windows-x64/EinsteinAtHome/source/lalsuite/lalpulsar/lib/ComputeFstat_Resamp_GPU.c:401): Internal function call failed: Generic failure XLAL Error - XLALComputeFstat (/home/jenkins/workspace/workspace/EaH-GW-OpenCL-Testing/SLAVE/MinGW6.3/TARGET/windows-x64/EinsteinAtHome/source/lalsuite/lalpulsar/lib/ComputeFstat.c:784): Check failed: (input->method_funcs.compute_func) ( *Fstats, common, input->method_data ) == XLAL_SUCCESS XLAL Error - XLALComputeFstat (/home/jenkins/workspace/workspace/EaH-GW-OpenCL-Testing/SLAVE/MinGW6.3/TARGET/windows-x64/EinsteinAtHome/source/lalsuite/lalpulsar/lib/ComputeFstat.c:784): Internal function call failed: Generic failure MAIN: XLALComputeFstat() failed with errno=1152 2023-09-09 12:33:58.1008 (14104) [CRITICAL]: ERROR: MAIN() returned with error '1152'


 



Quote:

XLAL Error - XLALOpenCLInit (/home/jenkins/workspace/workspace/EaH-GW-OpenCL-Testing/SLAVE/MinGW6.3/TARGET/windows-x64/EinsteinAtHome/source/lalsuite/lalpulsar/lib/GPUUtils/OpenCLUtils.c:389): OpenCL Create Context failed with OpenCL error: CL_OUT_OF_HOST_MEMORY XLAL Error - XLALOpenCLInit (/home/jenkins/workspace/workspace/EaH-GW-OpenCL-Testing/SLAVE/MinGW6.3/TARGET/windows-x64/EinsteinAtHome/source/lalsuite/lalpulsar/lib/GPUUtils/OpenCLUtils.c:389): Generic failure XLAL Error - XLALGPUInitByType (/home/jenkins/workspace/workspace/EaH-GW-OpenCL-Testing/SLAVE/MinGW6.3/TARGET/windows-x64/EinsteinAtHome/source/lalsuite/lalpulsar/lib/GPUUtils/GPUUtils.c:648): Check failed: XLALOpenCLInit() == XLAL_SUCCESS XLAL Error - XLALGPUInitByType (/home/jenkins/workspace/workspace/EaH-GW-OpenCL-Testing/SLAVE/MinGW6.3/TARGET/windows-x64/EinsteinAtHome/source/lalsuite/lalpulsar/lib/GPUUtils/GPUUtils.c:648): Internal function call failed: Generic failure XLAL Error - XLALGPUInit (/home/jenkins/workspace/workspace/EaH-GW-OpenCL-Testing/SLAVE/MinGW6.3/TARGET/windows-x64/EinsteinAtHome/source/lalsuite/lalpulsar/lib/GPUUtils/GPUUtils.c:693): Check failed: XLALGPUInitByType ( type ) == XLAL_SUCCESS XLAL Error - XLALGPUInit (/home/jenkins/workspace/workspace/EaH-GW-OpenCL-Testing/SLAVE/MinGW6.3/TARGET/windows-x64/EinsteinAtHome/source/lalsuite/lalpulsar/lib/GPUUtils/GPUUtils.c:693): Internal function call failed: Generic failure XLAL Error - XLALCreateFstatInput (/home/jenkins/workspace/workspace/EaH-GW-OpenCL-Testing/SLAVE/MinGW6.3/TARGET/windows-x64/EinsteinAtHome/source/lalsuite/lalpulsar/lib/ComputeFstat.c:470): Failed to initialize GPU XLAL Error - XLALCreateFstatInput (/home/jenkins/workspace/workspace/EaH-GW-OpenCL-Testing/SLAVE/MinGW6.3/TARGET/windows-x64/EinsteinAtHome/source/lalsuite/lalpulsar/lib/ComputeFstat.c:470): Internal function call failed: Generic failure SetUpSFTs: XLALCreateFstatInput() failed with errno=1152Error[1] 14: function SetUpSFTs, file /home/jenkins/workspace/workspace/EaH-GW-OpenCL-Testing/SLAVE/MinGW6.3/TARGET/windows-x64/EinsteinAtHome/source/lalsuite/lalapps/src/pulsar/GCT/HierarchSearchGCT.c, line 3164, $Id$ ABORT: XLAL function call failed Level 0: $Id$ Function call `SetUpSFTs( &status, &usefulParams )' failed. file /home/jenkins/workspace/workspace/EaH-GW-OpenCL-Testing/SLAVE/MinGW6.3/TARGET/windows-x64/EinsteinAtHome/source/lalsuite/lalapps/src/pulsar/GCT/HierarchSearchGCT.c, line 1148

Level 1: $Id$
Status code 14: XLAL function call failed
function SetUpSFTs, file /home/jenkins/workspace/workspace/EaH-GW-OpenCL-Testing/SLAVE/MinGW6.3/TARGET/windows-x64/EinsteinAtHome/source/lalsuite/lalapps/src/pulsar/GCT/HierarchSearchGCT.c, line 3164
2023-09-09 12:35:28.1155 (12320) [CRITICAL]: BOINC_LAL_ErrHand(): xlalErrno = 1152


 


_________________________________________________________________________

lohphat
lohphat
Joined: 20 Feb 05
Posts: 29
Credit: 87338224
RAC: 24107

I'm about to leave town for a

I'm about to leave town for a week so I'll triage the potential driver issue when I return.

Thanks for the detailed and accurate analysis of the crash logs to point me in a direction.

lohphat
lohphat
Joined: 20 Feb 05
Posts: 29
Credit: 87338224
RAC: 24107

I just updated the nVidia

I just updated the nVidia driver to 537.34 and all tasks failed.

examples:

https://einsteinathome.org/workunit/752170493

https://einsteinathome.org/workunit/752170555

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3958
Credit: 47031812642
RAC: 65087474

did you DDU the old drivers

did you DDU the old drivers before updating? if not, you should do that and try again. if you have some driver corruption, sometimes it can linger across driver installs and upgrades and issues will persist until you remove every last trace of the old drivers and start again. the best way to run DDU is to boot into safe mode, run it there, then reboot and install new drivers fresh.

it might also be an issue with the latest drivers. try an older version, like around 525.xx or so.

_________________________________________________________________________

San-Fernando-Valley
San-Fernando-Valley
Joined: 16 Mar 16
Posts: 409
Credit: 10216113455
RAC: 22199004

... try  537.42  and be sure

... try  537.42  and be sure to put a check mark on the field "perform a clean installation" ...

EDIT:    ... have to do a "custom installation"  to see that field  !!!

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3958
Credit: 47031812642
RAC: 65087474

no that's not what I'm

no that's not what I'm referring to. "clean installation" can still leave some remnants.

you really have to use DDU to get it all.

https://www.guru3d.com/download/display-driver-uninstaller-download/

_________________________________________________________________________

San-Fernando-Valley
San-Fernando-Valley
Joined: 16 Mar 16
Posts: 409
Credit: 10216113455
RAC: 22199004

Yes, I know, but thanks

Yes, I know, but thanks anyway.

BUT, I have never ever had a problem with moving to new Nvidia GPU driver versions.

Maybe just luck ....

Have you all a nice weekend ...

S-F-V

lohphat
lohphat
Joined: 20 Feb 05
Posts: 29
Credit: 87338224
RAC: 24107

Just updated via a clean

Just updated via a clean install to 537.42 and the first WU errored out, still being worked on by another host.  But all other WU are completing OK with no driver crash.

So don't know if it was the clean install which helped or the new driver version.

Error:
https://einsteinathome.org/workunit/752518219

Success:
https://einsteinathome.org/workunit/752518551
https://einsteinathome.org/workunit/752518220
https://einsteinathome.org/workunit/752518216

San-Fernando-Valley
San-Fernando-Valley
Joined: 16 Mar 16
Posts: 409
Credit: 10216113455
RAC: 22199004

lohphat wrote: Just updated

lohphat wrote:

Just updated via a clean install to 537.42 and the first WU errored out, still being worked on by another host.  But all other WU are completing OK with no driver crash.

So don't know if it was the clean install which helped or the new driver version.

...

Sounds good.

Would be of interest to know which method you used for the clean install.

Nvidia or DDU ?

S-F-V

 

lohphat
lohphat
Joined: 20 Feb 05
Posts: 29
Credit: 87338224
RAC: 24107

Just use the one built into

Just use the one built into the driver installer.  It removed the old driver then rebooted to remove it.  The system restarted with the default video driver then I had to re-run the nVidia installer -- it didn't auto-continue after the restart.

Then restarted BOINC and retrieved a new batch of WU's and only the first errored out, all others completed and validated.  The WU which errored out was successfully completed by another user's host.

https://einsteinathome.org/workunit/752518219

I suspected that BOINC and the All-Sky Gravitational Wave search on O3 v1.04 app was having a resource contention when my screen saver (ElectricSheep Gold) or any app with GPU acceleration (Firefox) were used -- that's when the driver would crash before.

But the WU which errored out happened after the screen saver was on and running and only errored out when the monitor was turned off by my power plan.

But allowing the same thing to happen to the other WUs didn't generate the same issue, all subsequent WU's completed in the same conditions.

So...don't know.

I'll report back if I get any other GPU apps assigned to me to compare buut so far, I'm only getting All-Sky Gravitational Wave search on O3 v1.04 WUs.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.