One task in a set on 4-5 ALWAYS has a computation error (i5-4590, B85M-E rev 1.02, 8GB Memtested RAM, GTX 970, Antec HCG-520 PSU)

eton975
eton975
Joined: 5 Mar 14
Posts: 5
Credit: 187262
RAC: 0
Topic 198059

I've noticed that one task in pretty much every set of 4 or 5 (usually one of these 4-5 being a GPU task) suddenly jumps forward in progress and then has a computation error. Here's the error output of one:

Quote:

7.4.36

(unknown error) - exit code -1 (0xffffffff)

2015-04-18 19:48:26.3296 (5236) [normal]: This program is published under the GNU General Public License, version 2
2015-04-18 19:48:26.3296 (5236) [normal]: For details see http://einstein.phys.uwm.edu/license.php
2015-04-18 19:48:26.3296 (5236) [normal]: This Einstein@home App was built at: Mar 20 2015 10:34:04

2015-04-18 19:48:26.3296 (5236) [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/einstein_S6BucketFU2UB_1.01_windows_intelx86__SSE2.exe'.
Activated exception handling...
2015-04-18 19:48:26.3296 (5236) [debug]: Flags: LAL_NDEBUG, OPTIMIZE, HS_OPTIMIZATION, GC_SSE2_OPT, i386, SSE, SSE2, GNUC X86 GNUX86
2015-04-18 19:48:26.3296 (5236) [debug]: Set up communication with graphics process.
command line: projects/einstein.phys.uwm.edu/einstein_S6BucketFU2UB_1.01_windows_intelx86__SSE2.exe @../../projects/einstein.phys.uwm.edu/S6BucketFU2UB_30379196.conf.gz --DataFiles1=..\..\projects\einstein.phys.uwm.edu\h1_0204.30_S6GC1;..\..\projects\einstein.phys.uwm.edu\l1_0204.30_S6GC1;..\..\projects\einstein.phys.uwm.edu\h1_0204.35_S6GC1;..\..\projects\einstein.phys.uwm.edu\l1_0204.35_S6GC1;..\..\projects\einstein.phys.uwm.edu\h1_0204.40_S6GC1;..\..\projects\einstein.phys.uwm.edu\l1_0204.40_S6GC1;..\..\projects\einstein.phys.uwm.edu\h1_0204.45_S6GC1;..\..\projects\einstein.phys.uwm.edu\l1_0204.45_S6GC1 --ephemE=../../projects/einstein.phys.uwm.edu/earth_09_11 --ephemS=../../projects/einstein.phys.uwm.edu/sun_09_11 --segmentList=../../projects/einstein.phys.uwm.edu/seglist-S6BucketFU2UB.dat -o ../../projects/einstein.phys.uwm.edu/h1_0204.30_S6GC1__S6BucketFU2UBb_30379196_0_0
Code-version: %% LAL: 6.12.0.1 (CLEAN 63b6fcfd194db92b458300b2e4d5a2eefb8c253b)
%% LALPulsar: 1.9.0.1 (CLEAN 63b6fcfd194db92b458300b2e4d5a2eefb8c253b)
%% LALApps: 6.14.0.1 (CLEAN 63b6fcfd194db92b458300b2e4d5a2eefb8c253b)

2015-04-18 19:48:26.4859 (5236) [normal]: FstatMethod used: 'DemodSSE'
2015-04-18 19:48:26.4859 (5236) [normal]: Reading input data ... 2015-04-18 19:48:50.7798 (5236) [normal]: Number of segments: 44, total number of SFTs in segments: 13143
done.
% --- GPS reference time = 960499913.5000 , GPS data mid time = 960541454.5000
2015-04-18 19:48:50.7954 (5236) [normal]: dFreqStack = 1.956840e-006, df1dot = 2.377608e-011, df2dot = 0.000000e+000, df3dot = 0.000000e+000
% --- Setup, N = 44, T = 503831 s, Tobs = 22160773 s, gammaRefine = 100, gamma2Refine = 6603, gamma3Refine = 1
2015-04-18 19:48:50.7954 (5236) [debug]: Successfully read checkpoint:54430
% --- Cpt:54430, total:52751, sky:3202/3103, f1dot:14/17

2015-04-18 19:48:50.7954 (5236) [normal]: Finished main analysis.
2015-04-18 19:48:50.7954 (5236) [normal]: Recalculating statistics for the final toplist...
XLAL Error - XLALComputeFaFb_SSE (/home/jenkins/workspace/workspace/EAH-GW-Release/SLAVE/MINGW32/TARGET/windows-x86/EinsteinAtHome/source/lalsuite/lalpulsar/src/ComputeFstat_Demod_ComputeFaFb.c:148): Required frequency-bins [916290, 916305] not covered by SFT-interval [367821, 367968]
[Parameters: alpha:0, Dphi_alpha:9.162974e+005, Tsft:1.800000e+003, *Tdot_al:1.000093e+000]

XLAL Error - XLALComputeFaFb_SSE (/home/jenkins/workspace/workspace/EAH-GW-Release/SLAVE/MINGW32/TARGET/windows-x86/EinsteinAtHome/source/lalsuite/lalpulsar/src/ComputeFstat_Demod_ComputeFaFb.c:148): Input domain error
XLAL Error - ComputeFstat_Demod (/home/jenkins/workspace/workspace/EAH-GW-Release/SLAVE/MINGW32/TARGET/windows-x86/EinsteinAtHome/source/lalsuite/lalpulsar/src/ComputeFstat_Demod.c:212): Check failed: XLALComputeFaFb_SSE ( &FaX, &FbX, FstatAtoms_p, multiSFTs->data[X], thisPoint.fkdot, multiSSBTotal->data[X], multiAMcoef->data[X], Dterms) == XLAL_SUCCESS
XLAL Error - ComputeFstat_Demod (/home/jenkins/workspace/workspace/EAH-GW-Release/SLAVE/MINGW32/TARGET/windows-x86/EinsteinAtHome/source/lalsuite/lalpulsar/src/ComputeFstat_Demod.c:212): Internal function call failed: Input domain error
XLAL Error - XLALComputeFstat (/home/jenkins/workspace/workspace/EAH-GW-Release/SLAVE/MINGW32/TARGET/windows-x86/EinsteinAtHome/source/lalsuite/lalpulsar/src/ComputeFstat.c:662): Check failed: ComputeFstat_Demod(*Fstats, common, input->demod) == XLAL_SUCCESS
XLAL Error - XLALComputeFstat (/home/jenkins/workspace/workspace/EAH-GW-Release/SLAVE/MINGW32/TARGET/windows-x86/EinsteinAtHome/source/lalsuite/lalpulsar/src/ComputeFstat.c:662): Internal function call failed: Input domain error
XLAL Error - XLALComputeExtraStatsSemiCoherent (/home/jenkins/workspace/workspace/EAH-GW-Release/SLAVE/MINGW32/TARGET/windows-x86/EinsteinAtHome/source/lalsuite/lalapps/src/pulsar/GCT/RecalcToplistStats.c:195): XLALComputeFstat() failed with errno=1057
XLAL Error - XLALComputeExtraStatsSemiCoherent (/home/jenkins/workspace/workspace/EAH-GW-Release/SLAVE/MINGW32/TARGET/windows-x86/EinsteinAtHome/source/lalsuite/lalapps/src/pulsar/GCT/RecalcToplistStats.c:195): Internal function call failed: Input domain error
XLAL Error - XLALComputeExtraStatsForToplist (/home/jenkins/workspace/workspace/EAH-GW-Release/SLAVE/MINGW32/TARGET/windows-x86/EinsteinAtHome/source/lalsuite/lalapps/src/pulsar/GCT/RecalcToplistStats.c:107): Failed call to XLALComputeExtraStatsSemiCoherent().
XLAL Error - XLALComputeExtraStatsForToplist (/home/jenkins/workspace/workspace/EAH-GW-Release/SLAVE/MINGW32/TARGET/windows-x86/EinsteinAtHome/source/lalsuite/lalapps/src/pulsar/GCT/RecalcToplistStats.c:107): Internal function call failed: Input domain error
XLAL Error - MAIN (/home/jenkins/workspace/workspace/EAH-GW-Release/SLAVE/MINGW32/TARGET/windows-x86/EinsteinAtHome/source/lalsuite/lalapps/src/pulsar/GCT/HierarchSearchGCT.c:1671): XLALComputeExtraStatsForToplist() failed with xlalErrno = 1057.

XLAL Error - MAIN (/home/jenkins/workspace/workspace/EAH-GW-Release/SLAVE/MINGW32/TARGET/windows-x86/EinsteinAtHome/source/lalsuite/lalapps/src/pulsar/GCT/HierarchSearchGCT.c:1671): Invalid pointer
2015-04-18 19:48:50.8111 (5236) [CRITICAL]: ERROR: MAIN() returned with error '-1'
FPU status flags: PRECISION
2015-04-18 19:48:50.8111 (5236) [normal]: done. calling boinc_finish(-1).
19:48:50 (5236): called boinc_finish

]]>

I heard that this indicates the input files have been corrupted somehow, either through RAM problems, HDD corruption or a CPU miscalculation, but I've reset the project, Memtested the RAM and they've come back fine. I've stresstested the CPU with the IntelBurnTest, Prime95 and the official Intel CPU Diagnostic and they've come back fine.

I'm very unsure as to what's causing this. Anyone know?

Pooh Bear 27
Pooh Bear 27
Joined: 20 Mar 05
Posts: 1376
Credit: 20312671
RAC: 0

One task in a set on 4-5 ALWAYS has a computation error (i5-4590

Also could be something interfering at the time. Virus programs seem to be some of the worst offenders. Best thing to do is disable checking of the BOINC folder in the virus checkers.

Also, do you run other background work? It looks like you have Hyperthreading turned off. On that machine depends what you are crunching turning on the Hyperthreading and using 1/2 cores might help, so that the GPU gets less contention when working.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5850
Credit: 110217013594
RAC: 27917960

RE: I'm very unsure as to

Quote:
I'm very unsure as to what's causing this. Anyone know?


The Devs may be able to shed some light on this. The latest GW search has just recently started so it might be something related to that.

In the meantime, you could try deselecting the two GW searches in your project preferences and see if the Gamma Ray Pulsar search works correctly on your system. If you'd like a good Australian connection, you could also select the Binary Radio Pulsar search (BRP6) which uses data from the Parkes radio telescope. I'm running that combination (FGRP4 and BRP6 - Parkes PMPS XT) on many machines without problems.

Cheers,
Gary.

eton975
eton975
Joined: 5 Mar 14
Posts: 5
Credit: 187262
RAC: 0

RE: Also could be something

Quote:

Also could be something interfering at the time. Virus programs seem to be some of the worst offenders. Best thing to do is disable checking of the BOINC folder in the virus checkers.

Also, do you run other background work? It looks like you have Hyperthreading turned off. On that machine depends what you are crunching turning on the Hyperthreading and using 1/2 cores might help, so that the GPU gets less contention when working.

I'll look into the possibility of AV interference.

My CPU is an i5, so its hyperthreading features were laser-cut at the Intel factory.

EDIT: OK, stopped the AV's (AVG Free) realtime/scanning ops in C:\Program Files\BOINC and C:\ProgramData\BOINC. Fingers crossed.

Quote:
Quote:
I'm very unsure as to what's causing this. Anyone know?

The Devs may be able to shed some light on this. The latest GW search has just recently started so it might be something related to that.

In the meantime, you could try deselecting the two GW searches in your project preferences and see if the Gamma Ray Pulsar search works correctly on your system. If you'd like a good Australian connection, you could also select the Binary Radio Pulsar search (BRP6) which uses data from the Parkes radio telescope. I'm running that combination (FGRP4 and BRP6 - Parkes PMPS XT) on many machines without problems.

Will also try if the first measure isn't successful.

eton975
eton975
Joined: 5 Mar 14
Posts: 5
Credit: 187262
RAC: 0

Alright,

Alright, update:

Whitelisting the folders in AV didn't help.

Disabling the S6 GW tasks did, however, and no errors happen on the pulsar searches.

Could it be a bug in BOINC/GW code, or could there be a bad part of my CPU that is only stressed by GW tasks?

AgentB
AgentB
Joined: 17 Mar 12
Posts: 915
Credit: 513211304
RAC: 0

RE: Could it be a bug in

Quote:

Could it be a bug in BOINC/GW code, or could there be a bad part of my CPU that is only stressed by GW tasks?



I would guess it is likely to be the second, it could be a variety of system components not just the CPU. The Windows System and Application event logs around the time of task failing may reveal some clues.

If the event logs are not revealing, and everything up to date, i would try searching for some stress testing utilities to test all components on your system.

E@H does not have any that i am aware of, but another BOINC GPU project https://folding.stanford.edu/home/download-utilities/ has some which i used a very long time ago on a Linux laptop. - stresscpu2 and memtestCL i think were two i used when i had concerns.

I have looked at your tasks here http://einsteinathome.org/account/58485/computers
- and if that list is correct, it seems you are not working any GPU tasks at the moment, only Gamma-ray pulsar search #4 v1.05 (FGRP4-SSE2).

You can, as you are probably aware, restrict E@H to CPU only tasks, then add Intel GPU tasks, then add GPU tasks once you are confident earlier tasks are stable.

Good luck.

eton975
eton975
Joined: 5 Mar 14
Posts: 5
Credit: 187262
RAC: 0

Not exactly sure what I

Not exactly sure what I should be looking for in Event Viewer. It's got a weird interface and everything is arcane. Any pointers?

I'm just frustrated because EVERY SINGLE other CPU test, even the one you linked, comes back totally fine. But ONLY this project has a problem where one task (out of 4 being worked on by the CPU) has a problem. Has anyone else reported anything like this? Any other possibilities?

AgentB
AgentB
Joined: 17 Mar 12
Posts: 915
Credit: 513211304
RAC: 0

RE: Not exactly sure what I

Quote:
Not exactly sure what I should be looking for in Event Viewer. It's got a weird interface and everything is arcane. Any pointers?

No, sorry i gave up on Microsoft´s interfaces a while back, so i googled...

http://www.dummies.com/how-to/content/how-to-review-events-in-windows-7-and-vista.html Seems to explain howto use it. You don´t list any event errors logged. Were there any?

Quote:

I'm just frustrated because EVERY SINGLE other CPU test, even the one you linked, comes back totally fine. But ONLY this project has a problem where one task (out of 4 being worked on by the CPU) has a problem.

As I mentioned it may not be a CPU only issue. You don´t say what projects work successfully, can you be specific which projects this computer is working ok?

Do you have any other computers working ok, which you can try E@H on?

I suggested earlier but you have not confirmed, are you running CPU tasks ONLY for E@H? As you know this is set in your E@H preferences.

http://einstein.phys.uwm.edu/prefs.php?subset=project

Use CPU
Enforced by version 6.10+ yes
Use ATI GPU
Enforced by version 6.10+ no
Use NVIDIA GPU
Enforced by version 6.10+ no
Use INTEL GPU
Enforced by version 7.0.27+ no

Quote:

Has anyone else reported anything like this? Any other possibilities?

No, but you might also find some help here

http://boinc.berkeley.edu/wiki/BOINC_Help

Good luck.

Pollux_P3D
Pollux_P3D
Joined: 8 Feb 11
Posts: 30
Credit: 212418648
RAC: 0

Hi eton975, do you tested

Hi eton975,

do you tested the hard disk with ScanDisk?

Please reduce the working buffer in Boincmanager. There are tons of Wus be abortet by yourself. 0,3 days should be enough.
Boincmanager wizard/settings/Use of the network/Minimum working buffer

Critical: XLAL Error

Gruß Pollux

eton975
eton975
Joined: 5 Mar 14
Posts: 5
Credit: 187262
RAC: 0

RE: (AgentB) -snip- OK,

Quote:
(AgentB) -snip-

OK, looked through the Event viewer logs in Applications and System. Most of the errors were about stupid **** like me resetting the PC with the power button or not being able to access the floppy drive. Didn't really see anything relating to BOINC. 'Critical errors' my ***.

Quote:

As I mentioned it may not be a CPU only issue. You don´t say what projects work successfully, can you be specific which projects this computer is working ok?

Do you have any other computers working ok, which you can try E@H on?

I suggested earlier but you have not confirmed, are you running CPU tasks ONLY for E@H? As you know this is set in your E@H preferences.

Only doing E@H ATM, but everything except the Gravitational Wave searches are fine. Not doing GPU tasks just to try and isolate the issue. I have another PC that I can try E@H on. Will probably contact Intel and ASUS too.

Quote:

do you tested the hard disk with ScanDisk?

Please reduce the working buffer in Boincmanager. There are tons of Wus be abortet by yourself. 0,3 days should be enough.
Boincmanager wizard/settings/Use of the network/Minimum working buffer

I'll try that.

Already got a low working buffer(0.1 days), reason you see so many aborted tasks is because I'm trying to hunt down the issue and one GW task in a set of four always has an error.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.