Gravitational Wave search O1 all-sky tuning (O1AS20-100T)

Jonathan Jeckell
Jonathan Jeckell
Joined: 11 Nov 04
Posts: 112
Credit: 719,230,631
RAC: 632,123

My Ubuntu Linux box has

My Ubuntu Linux box has barfed on 3 of the 5 in its queue too (still processing the remaining 2). These things happen as we work out the bugs, but I was honestly hoping to be one of the first to help contribute to this.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,210
Credit: 43,575,203,008
RAC: 44,375,753

I found this list of tasks on

I found this list of tasks on one of your hosts (ID=12151761). At the time I saw it there were 3 compute errors out of 8 total - 5 still in progress. The interesting thing is that the original 5 tasks are listed as V1.00 and the 3 latest are listed as V1.02.

You should consider aborting the last 2 of the original 5 (the V1.00) ones as they seem doomed to failure anyway. You could try one of the new ones to see if it works. If you follow the WU links for each of the new ones, you can see what has happened previously for those. This might give you some idea of your prospects for ultimate success for any of them. There is one WU that has 2 'in progress' tasks but one of those is V1.00 (on 32bit Linux) and there is already a failed V1.00 task also on 32bit Linux. The quorum I'm talking about is this one but take a look at all three - it's good experience for tools to use when trying to work out what's going on.

Cheers,
Gary.

Mike Hewson
Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6,125
Credit: 126,944,457
RAC: 12,907

FWIW : this host on this

FWIW : this host on this result went belly up after a one-byte file access error.

Quote:

2016-02-12 02:46:29.5682 (30867) [normal]: Reading input data ... ERROR: data gap or overlap at first bin of SFT#0 (GPS 1128211934.000000) expected bin 90359, bin 90360 read from file '../../projects/einstein.phys.uwm.edu/h1_0050.20_O1C01Cl1In1'
XLAL Error - XLALLoadSFTs (/home/jenkins/workspace/workspace/EAH-GW-Master/SLAVE/LINUX32-COMPAT/TARGET/linux-x86/EinsteinAtHome/source/lalsuite/lalpulsar/src/SFTfileIO.c:882): I/O error
XLAL Error - XLALLoadMultiSFTsFromView (/home/jenkins/workspace/workspace/EAH-GW-Master/SLAVE/LINUX32-COMPAT/TARGET/linux-x86/EinsteinAtHome/source/lalsuite/lalpulsar/src/SFTfileIO.c:1046): Failed to XLALLoadSFTs() for IFO X = 0

XLAL Error - XLALLoadMultiSFTsFromView (/home/jenkins/workspace/workspace/EAH-GW-Master/SLAVE/LINUX32-COMPAT/TARGET/linux-x86/EinsteinAtHome/source/lalsuite/lalpulsar/src/SFTfileIO.c:1046): Internal function call failed: I/O error
XLAL Error - XLALLoadMultiSFTs (/home/jenkins/workspace/workspace/EAH-GW-Master/SLAVE/LINUX32-COMPAT/TARGET/linux-x86/EinsteinAtHome/source/lalsuite/lalpulsar/src/SFTfileIO.c:1004): Check failed: ( multiSFTs = XLALLoadMultiSFTsFromView ( multiCatalogView, fMin, fMax )) != ((void *)0)
XLAL Error - XLALLoadMultiSFTs (/home/jenkins/workspace/workspace/EAH-GW-Master/SLAVE/LINUX32-COMPAT/TARGET/linux-x86/EinsteinAtHome/source/lalsuite/lalpulsar/src/SFTfileIO.c:1004): Internal function call failed: I/O error
XLAL Error - XLALCreateFstatInput (/home/jenkins/workspace/workspace/EAH-GW-Master/SLAVE/LINUX32-COMPAT/TARGET/linux-x86/EinsteinAtHome/source/lalsuite/lalpulsar/src/ComputeFstat.c:405): Check failed: ( multiSFTs = XLALLoadMultiSFTs(SFTcatalog, minFreqFull, maxFreqFull) ) != ((void *)0)
XLAL Error - XLALCreateFstatInput (/home/jenkins/workspace/workspace/EAH-GW-Master/SLAVE/LINUX32-COMPAT/TARGET/linux-x86/EinsteinAtHome/source/lalsuite/lalpulsar/src/ComputeFstat.c:405): Internal function call failed: I/O error


ie. possibly the supplied data file and not the app itself.

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter. Blaise Pascal

AgentB
AgentB
Joined: 17 Mar 12
Posts: 915
Credit: 513,211,304
RAC: 0

I awoke to find this host

I awoke to find this host with v1.02 tasks, four completed with no errors. (and CPU temps up 10C GPU temps down 15C - expected)

To my pleasant surprise - this task has a MAGIC QM wingman assigned.

robl
robl
Joined: 2 Jan 13
Posts: 1,633
Credit: 1,102,663,402
RAC: 692,298

one "01" job completed at

one "01" job completed at ~37+. Currently in a pending state. This is a V1.02 job. Running on a GTX 770 Linux machine.

MAGIC Quantum Mechanic
MAGIC Quantum M...
Joined: 18 Jan 05
Posts: 1,304
Credit: 418,731,097
RAC: 100,248

RE: I awoke to find this

Quote:

I awoke to find this host with v1.02 tasks, four completed with no errors. (and CPU temps up 10C GPU temps down 15C - expected)

To my pleasant surprise - this task has a MAGIC QM wingman assigned.

Sorry about that AgentB

That is the ONLY one of my 7 hosts not in my house and it got a "Not started by deadline - canceled"

THAT would not happen with the 6 hosts that have me staring at them 24/7

(no I never sleep)

So maybe you will have better luck in the future.

 

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,210
Credit: 43,575,203,008
RAC: 44,375,753

RE: ... That is the ONLY

Quote:
... That is the ONLY one of my 7 hosts not in my house and it got a "Not started by deadline - canceled"


What heresy!! Looks like the machine got turned off for the duration sometime after after a whole bunch of O1AS tasks got downloaded!! :-). 22 canceled and 4 timed out - no response. I guess those 4 were in progress and couldn't be canceled. Shouldn't you just abort them to save the waste?

Look on the bright side - they were all V1.01 so I guess that would have been just a whole lot of wasted computing ;-). The machine has obviously now been turned back on since all those cancelled tasks have been reported and there is a new V1.02 replacement :-).

Now that V1.02 looks the goods (as long as they fix the estimates so that the DCF doesn't go absolutely bonkers) it might just be time to consider sticking my toe in the water :-).

Cheers,
Gary.

MAGIC Quantum Mechanic
MAGIC Quantum M...
Joined: 18 Jan 05
Posts: 1,304
Credit: 418,731,097
RAC: 100,248

Well Gary that is what

Well Gary that is what happens when you install Boinc to do Einstein GPU tasks on your sister-in-laws laptop as payment for spending hours installing Windows 10 and all the updates for her

And since they all automatically got set to run these new tasks I had no way to turn that off on hers and I am surprised that hers got 27 of them and all 6 of my home hosts only got 4 and even then 3 of those went on my old 3-core so it started running all 3 cores and turning off the other CPU tasks (vLHC)

WHY did Boinc decide to give her not very fast quad-core 27 tasks?

And NONE for my 8-core or any of the quad-core?

I changed hers from here (Location setting) not to get anymore but not much I can do about that one that was doing fine with the GPU tasks getting those CPU tasks all of a sudden.

I only have 3 finished so far (at home) and it took took them over 79,000 seconds and they are all pending so far.

She isn't a youngster so a laptop with a new OS is not something she is a expert on (I guess I'm not a youngster either but I am a mad scientist)

 

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,210
Credit: 43,575,203,008
RAC: 44,375,753

RE: WHY did Boinc decide to

Quote:
WHY did Boinc decide to give her not very fast quad-core 27 tasks?


Bad luck, I guess.

There are some previous BRP6 tasks crunched on the Intel GPU. The last was sent on Feb 05 and returned on Feb 07. There doesn't seem to be anything else until Feb 12. At that time the machine got 4 BRP6 followed by small batches of O1AST at roughly 1 minute intervals. Looks like the cache setting was large enough (and perhaps the estimate on the new tasks was small enough) to allow the whole 26 to be downloaded over a period of about 6 minutes. Perhaps the previous BRP6 done on the Intel GPU had been done faster than expected and had caused the DCF to be rather lower than appropriate for the GW tasks.

I'm guessing this just happened to correspond to the release of the V1.01 app together with tasks available for download and a low DCF to encourage lots of them. Just the luck of the draw. I've seen this sort of thing happen before so when there's a brand new app ready to be tested, I try to avoid the very first flush of tasks. When I want to join in, I make sure my cache setting is so low that I can't get more than a couple to start with. I have a dual core machine right now with about 2-3 hours to go on the last two FGRPB1. It's only asking for O1AST tasks now and it has a cache setting of 0.25 days. I reckon it will be ready to download and crunch as soon as they make some more available. It might even get a resend of one of the previous lot.

Quote:
I changed hers from here (Location setting) not to get anymore but not much I can do about that one that was doing fine with the GPU tasks getting those CPU tasks all of a sudden.


You should put it back on just BRP6 for the Intel GPU. It was doing very well on those. You really don't want to run alpha test stuff on a machine you can't directly access :-).

Quote:

... (I guess I'm not a youngster either but I am a mad scientist)


We're ALL mad scientists around here mate, even if we were something else in another life ;-). Take a look at our resident 'refugee from an otherwise highly esteemed profession'. He obviously knows more about black holes than about how to cure a pain in a black hole ... I rest my case :-).

Cheers,
Gary.

Christian Beer
Christian Beer
Moderator
Joined: 9 Feb 05
Posts: 595
Credit: 96,904,763
RAC: 0

Another Update: The 1.02

Another Update:

The 1.02 apps solve the missing result file problem (upload failure -161) and we already receive all of the result files. The validator is already running and we keep a lookout for any validation errors.

We will grant Credit to all those who suffered from the upload failure later this week.

There will be an update to 1.03 shortly that fixes some problems with checkpointing that we found.

I'm also going to generate more work after the apps are updated so your machines can keep busy.

We are aware that runtimes seem to be "off the scale". But his was a little bit expected so we can tune the main search. The runtimes on a host seem to be consistent. Why some hosts take 6h and some 24h we don't know yet. I will dig into that when there are more successful results available to make a proper statistic.

If you find new problems with the 1.03 version please open a new thread in Problems and Bug Reports.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.