Critical: XLAL Error

Alex
Alex
Joined: 1 Mar 05
Posts: 451
Credit: 500397891
RAC: 29233
Topic 196798

I have some failing LV SSE-wu's in the last 2 says; one of thenm is http://einsteinathome.org/task/343005413

Stderror output says:
2013-02-07 16:57:35.7508 (1520) [normal]: Recalculating statistics for the final toplist...
2013-02-07 16:57:35.7518 (1520) [CRITICAL]: Required frequency-bins [805088, 805103] not covered by SFT-interval [909995, 910486]
[Parameters: alpha:0, Dphi_alpha:8.050956e+005, Tsft:1.800000e+003, *Tdot_al:1.000073e+000]
XLAL Error - LocalXLALComputeFaFb (/home/jenkins/workspace/workspace/EAH-GW-S6LV1/SLAVE/MINGW32/TARGET/windows-x86/EinsteinAtHome/source/lalsuite/lalapps/src/pulsar/FDS_isolated/OptimizedCFS/LocalComputeFstat.c:554): Input domain error

LocalXALComputeFaFb() failed
Error[1] 5: function LocalComputeFStat, file /home/jenkins/workspace/workspace/EAH-GW-S6LV1/SLAVE/MINGW32/TARGET/windows-x86/EinsteinAtHome/source/lalsuite/lalapps/src/pulsar/FDS_isolated/OptimizedCFS/LocalComputeFstat.c, line 338, $Id$
ABORT: XLAL function call failed
XLALComputeExtraStatsSemiCoherent, line 360 : Failed call to LAL function ComputeFStat(). statusCode=5

XLAL Error - XLALComputeExtraStatsSemiCoherent (/home/jenkins/workspace/workspace/EAH-GW-S6LV1/SLAVE/MINGW32/TARGET/windows-x86/EinsteinAtHome/source/lalsuite/lalapps/src/pulsar/GCT/LineVeto.c:361): Internal function call failed: Input domain error

Error in function XLALComputeExtraStatsForToplist, line 220 : Failed call to XLALComputeLineVetoSemiCoherent().

XLAL Error - XLALComputeExtraStatsForToplist (/home/jenkins/workspace/workspace/EAH-GW-S6LV1/SLAVE/MINGW32/TARGET/windows-x86/EinsteinAtHome/source/lalsuite/lalapps/src/pulsar/GCT/LineVeto.c:221): Internal function call failed: Input domain error
XLAL Error - MAIN (/home/jenkins/workspace/workspace/EAH-GW-S6LV1/SLAVE/MINGW32/TARGET/windows-x86/EinsteinAtHome/source/lalsuite/lalapps/src/pulsar/GCT/HierarchSearchGCT.c:1814): Check (XLAL_SUCCESS == XLALComputeExtraStatsForToplist ( semiCohToplist, "GCTtop", &stackMultiSFT, &stackMultiNoiseWeights, &stackMultiDetStates, &CFparams, refTimeGPS, uvar_SignalOnly, uvar_outputSingleSegStats )) failed
XLAL Error - MAIN (/home/jenkins/workspace/workspace/EAH-GW-S6LV1/SLAVE/MINGW32/TARGET/windows-x86/EinsteinAtHome/source/lalsuite/lalapps/src/pulsar/GCT/HierarchSearchGCT.c:1814): XLALComputeExtraStatsForToplist() failed with xlalErrno = 1057.

XLAL Error - MAIN (/home/jenkins/workspace/workspace/EAH-GW-S6LV1/SLAVE/MINGW32/TARGET/windows-x86/EinsteinAtHome/source/lalsuite/lalapps/src/pulsar/GCT/HierarchSearchGCT.c:1814): Invalid pointer
2013-02-07 16:57:35.7608 (1520) [CRITICAL]: ERROR: MAIN() returned with error '-1'
FPU status flags: COND_0 PRECISION
2013-02-07 16:57:35.7618 (1520) [normal]: done. calling boinc_finish(-1).
16:57:35 (1520): called boinc_finish

They all show similar errors; it happens on one machine only (Intel i3, Host ID 5394959 ). I tried some LHC wu's, they finish without error, I have no troubles using windows.
Does this point to a hw-problem or is it a sw-problem?

Nobody316
Nobody316
Joined: 14 Jan 13
Posts: 141
Credit: 2008126
RAC: 0

Critical: XLAL Error

http://boincfaq.mundayweb.com/index.php?view=210&language=1

see if this helps... if i understand right I thinks it may be data problem. I don't know for sure but hope that link helps...

PC setup MSI-970A-G46 AMD FX-8350 8 core OC'd 4.45GHz 16GB ram PC3-10700 Geforce GTX 650Ti Windows 7 x64 Einstein@Home

Alex
Alex
Joined: 1 Mar 05
Posts: 451
Credit: 500397891
RAC: 29233

RE: http://boincfaq.mundayw

Quote:

http://boincfaq.mundayweb.com/index.php?view=210&language=1

see if this helps... if i understand right I thinks it may be data problem. I don't know for sure but hope that link helps...

THX for the link; looks like something similar happened earlier.
I hope for a response from BM or Bikeman. Anyway, I will run some tests on my hardware to find out if there is a faulty ram or a disk problem. I'm a little bit concerned because it happens on one machine only and noone else is reporting these probles, this points to a problem on my side. On the other hand, a burn-in test did not show any problem.
I'll try another mix of project and watch the machine careful.

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4273
Credit: 245217726
RAC: 12915

The workunit you pointed to

The workunit you pointed to was successfully completed by at least one other (Linux) host, so it can hardly be a systematic error in the application or in the workunit setup.

The error means that the frequency range the program tries to analyze is not contained in the data that was read from disk.

So there are essentially two possibilities: The data on disk is corrupted or can't be read (completely), or the program wrongly calculates the data it needs. If there is no problem in the application (version) itself, the latter is usually the result of some calculations going wrong in the floating point unit (FPU) of the CPU ("usually" here means that I have yet to see a single case where this wasn't the problem).

To get rid of possibly bad data I'd recommend to reset the project. In older BOINC Clients this won't delete and re-download the "sticky" files, so you better detach and re-attach there. Apart from that try to monitor the CPU temperature.

Einstein@Home uses hand-coded assembler routines for its core computation, it might well be that of all BOINC projects E@H puts the most stress on the floating-point units (x87 FPU and SSE). I don't know what you used for burn-in, but most such procedures don't use the FPUs as much as E@H does.

BM

BM

Alex
Alex
Joined: 1 Mar 05
Posts: 451
Credit: 500397891
RAC: 29233

RE: To get rid of possibly

Quote:


To get rid of possibly bad data I'd recommend to reset the project. In older BOINC Clients this won't delete and re-download the "sticky" files, so you better detach and re-attach there. Apart from that try to monitor the CPU temperature.

Einstein@Home uses hand-coded assembler routines for its core computation, it might well be that of all BOINC projects E@H puts the most stress on the floating-point units (x87 FPU and SSE). I don't know what you used for burn-in, but most such procedures don't use the FPUs as much as E@H does.

BM

BM,

many thanks for that info, perfect support from the dev's as usual.

I use Sisoft Sandra; i3 was working all the day on burn-in, no errors.
I have reset the system (detached and reattached) last time I encountered problems, this did not help.
Your explanation about SSE can explain, why win sees no problem there, SSE is rarely used there. Heat? It's my only system not running with cpu-cooler with heat-pipes, just the original one.
It's easy to change this.

Alexander

Alex
Alex
Joined: 1 Mar 05
Posts: 451
Credit: 500397891
RAC: 29233

If someone is interested,

If someone is interested, here is the result:
it was a mainboard error.
After reducing the system to the minimum required parts the problem was still there. Exchanging power supply and ram with another pc did not help eiter.
I bought a new cpu, an i3-3220.
Grrr .. the same problem.
I also made a restore of win. No change.
I started sisoft without running any tests, just monitoring all sensors that can be monitored. The ICH - chip went hot, more than 85 deg celsius.

So i bought a new mainboard, one with 2 pcie x16 3.0 slots, running both x8 when used. And a HD7870XT Boost (Tahiti, 1536 shaders).
I have moved my gtx550ti also to that system, enabled 3 cpu's and running 2 wu's together on both gpu's.
With win7-64, BM 7.0.52, CCC 13.2 beta 5 the system seems to run stable, producing 6 gpu-results per h and running 2 cpu- wu's also.
GPU-Z and AMD Trixx both show a GPU-usage of 1 - 3 %, so most likely an update is required to support that gpu.

https://dl.dropbox.com/u/50246791/cpu-gpu%20multiuse%20system.PNG

In this configuration I have the following crunching times:
nvidia 3518 sec
amd 1871 sec
grp 1.04 5566 sec
GW s6 ext 14890 sec

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 689271061
RAC: 218138

Hi! Thanks for letting us

Hi!

Thanks for letting us know about the cause of the problem. Isn't E@H a nice stress test ?? ;-)

Cheers
HB

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.