Fermi LAT Gamma-ray pulsar search "FGRP2" - shorter tasks

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 3,938
Credit: 199,766,723
RAC: 47,546
Topic 197181

We found that due to problems with our validator we need to re-run ~200k old, shorter FGRP2 tasks. We are currently manually mixing these between the new, longer ones. These are all from data files below LATeah0030 and the WU names end in "_RERUN".

BM

BM

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,205
Credit: 43,113,378,263
RAC: 45,140,514

Fermi LAT Gamma-ray pulsar search "FGRP2" - shorter tasks

Quote:
... These are all from data files below LATeah0030 ...


Hi Bernd,

I notice we have now gone above the LATeah0030 you mentioned, so I guess there must be more needed than you originally thought. Can you please give an indication of the new upper limit for these "_RERUN" jobs?

As you may recall, we had a discussion about the file sizes of all the skygrids - particularly the very large ones at the time that LATeah0023 and LATeah0024 data was being crunched initially. Recently, there have been a lot of "_RERUN" tasks for these. Those started tapering off a little some days ago and I'm now seeing reruns for the range 0026 to 0029 as well. I thought that rerun tasks might be just about finished. However, this morning, I'm seeing reruns for 0031 and 0032 so I'm interested to find out how many more of these there are and where the new upper limit for LATeah00xx might be?

I should hasten to add that I'm in no way concerned by any of this. I'm also not in need of any "sticky" files mechanism as was mentioned in the previous discussion. These days, I'm using a rather involved shell script that ensures that all participating hosts always have a full complement of the files they are likely to need. This script automates the saving and deploying of all such files using rsync. It also allows for files to be deleted from the cache once I think they are finished with. It's just that I didn't know that files above 0030 were going to be 'needed' again :-). I'd kept them all from when those were first issued but hadn't been redeploying them. I've hastily added them back into the mix now :-).

I've added a few bells and whistles to this script which allow me to do a number of things from managing the cached files to recording useful stats (for me anyway) of exactly what is happening on each host that participates in the scheme. I find it very useful to have these hosts do all their downloading in my ISP's 'off-peak' period where I have a much larger allowance. Each morning I can review the log to see what new data files were downloaded, how many potential downloads were skipped because the files were already available, how many tasks each host requested, etc.

As an example, I've included below, a copy of this morning's log in full. The 'off-peak' time starts at 1.00am local time, so you will notice that each host fills its work cache sometime after 1.00am. There are 20 hosts in this little group. Each host in turn syncs itself with the cache and then makes a request for new work. If any one host downloads a 'new' data file, it will be pushed back to the cache and so be immediately available to all subsequent hosts. That way, any such file is only ever downloaded once. Notice that host 192.168.0.116 downloaded the new file skygrid_LATeah0045U_0720.0.dat and that host 192.168.0.125 (2 hosts later) subsequently acquired this file from the cache and so was able to skip the download - as did quite a few other hosts later on as well. However the thing that really caught my attention was host 192.168.0.162 downloading file skygrid_LATeah0032U_0304.0.dat. "I thought these were supposed to finish at 0030, what the @#$%$#@ is going on ...", sez he, muttering to himself ... :-).

[pre]New run to get more tasks for hosts listed below -- started at Sat Oct 26 01:13:04 EST 2013.
Options - cleanstr=, datastr=*LATeah*, Days=5.2, Pause=380, Hosts=tchosts, Verbose=yes
===========================================================================================
TIME IP ADDRESS NEW TASKS STATS OR FILES DOWNLOADED / SKIPPED
---- ---------- ---------------------------------------------
01:13:04: 192.168.0.112 --> 17 new tasks received made up of 5 FGRP2 and 12 BRP5 (PAS)
--> No new data files downloaded but 1 was skipped.
--> skipped skygrid_LATeah0045U_0688.0.dat

01:20:14: 192.168.0.116 --> 14 new tasks received made up of 4 FGRP2 and 10 BRP5 (PAS)
--> 1 new data file(s) downloaded and 1 skipped.
--> dloaded skygrid_LATeah0045U_0720.0.dat
--> skipped skygrid_LATeah0045U_0688.0.dat

01:27:25: 192.168.0.117 --> 16 new tasks received made up of 3 FGRP2 and 13 BRP5 (PAS)
--> No new data files downloaded but 1 was skipped.
--> skipped skygrid_LATeah0045U_0688.0.dat

01:34:37: 192.168.0.125 --> 20 new tasks received made up of 10 FGRP2 and 10 BRP5 (PAS)
--> 2 new data file(s) downloaded and 3 skipped.
--> dloaded skygrid_LATeah0044U_0336.0.dat
--> dloaded skygrid_LATeah0045U_0656.0.dat
--> skipped skygrid_LATeah0045U_0688.0.dat
--> skipped LATeah0044U.dat
--> skipped skygrid_LATeah0045U_0720.0.dat

01:41:49: 192.168.0.155 --> 22 new tasks received made up of 8 FGRP2 and 14 BRP5 (PAS)
--> No new data files downloaded but 5 were skipped.
--> skipped skygrid_LATeah0045U_0688.0.dat
--> skipped skygrid_LATeah0045U_0720.0.dat
--> skipped skygrid_LATeah0024U_1136.0.dat
--> skipped skygrid_LATeah0045U_0656.0.dat
--> skipped skygrid_LATeah0045U_0592.0.dat

01:48:57: 192.168.0.162 --> 27 new tasks received made up of 4 FGRP2 and 23 BRP5 (PAS)
--> 1 new data file(s) downloaded and 5 skipped.
--> dloaded skygrid_LATeah0032U_0304.0.dat
--> skipped skygrid_LATeah0045U_0688.0.dat
--> skipped LATeah0032U.dat
--> skipped LATeah0029U.dat
--> skipped skygrid_LATeah0029U_0912.0.dat
--> skipped skygrid_LATeah0045U_0656.0.dat

01:56:17: 192.168.0.168 --> 8 new tasks received made up of 8 FGRP2 and 0 BRP5 (PAS)
--> 1 new data file(s) downloaded and 5 skipped.
--> dloaded skygrid_LATeah0032U_0624.0.dat
--> skipped skygrid_LATeah0045U_0688.0.dat
--> skipped skygrid_LATeah0030U_1392.0.dat
--> skipped skygrid_LATeah0030U_1360.0.dat
--> skipped LATeah0032U.dat
--> skipped skygrid_LATeah0030U_1424.0.dat

02:03:16: 192.168.0.183 --> 14 new tasks received made up of 5 FGRP2 and 9 BRP5 (PAS)
--> No new data files downloaded but 2 were skipped.
--> skipped skygrid_LATeah0045U_0688.0.dat
--> skipped skygrid_LATeah0024U_1136.0.dat

02:10:22: 192.168.0.184 --> 13 new tasks received made up of 5 FGRP2 and 8 BRP5 (PAS)
--> No new data files downloaded but 3 were skipped.
--> skipped skygrid_LATeah0045U_0688.0.dat
--> skipped skygrid_LATeah0024U_1040.0.dat
--> skipped skygrid_LATeah0045U_0656.0.dat

02:17:31: 192.168.0.186 --> 13 new tasks received made up of 3 FGRP2 and 10 BRP5 (PAS)
--> No new data files downloaded but 1 was skipped.
--> skipped skygrid_LATeah0045U_0688.0.dat

02:24:40: 192.168.0.187 --> 5 new tasks received made up of 5 FGRP2 and 0 BRP5 (PAS)
--> No new data files downloaded but 2 were skipped.
--> skipped skygrid_LATeah0045U_0656.0.dat
--> skipped skygrid_LATeah0045U_0688.0.dat

02:32:02: 192.168.0.188 --> 12 new tasks received made up of 4 FGRP2 and 8 BRP5 (PAS)
--> 2 new data file(s) downloaded and 2 skipped.
--> dloaded skygrid_LATeah0032U_0784.0.dat
--> dloaded skygrid_LATeah0032U_0720.0.dat
--> skipped skygrid_LATeah0045U_0688.0.dat
--> skipped LATeah0032U.dat

02:39:05: 192.168.0.189 --> 14 new tasks received made up of 5 FGRP2 and 9 BRP5 (PAS)
--> No new data files downloaded but 3 were skipped.
--> skipped skygrid_LATeah0045U_0688.0.dat
--> skipped skygrid_LATeah0024U_0048.0.dat
--> skipped skygrid_LATeah0024U_1136.0.dat

02:46:14: 192.168.0.190 --> 35 new tasks received made up of 19 FGRP2 and 16 BRP5 (PAS)
--> No new data files downloaded but 5 were skipped.
--> skipped LATeah0044U.dat
--> skipped skygrid_LATeah0044U_0432.0.dat
--> skipped skygrid_LATeah0045U_0688.0.dat
--> skipped skygrid_LATeah0027U_1104.0.dat
--> skipped skygrid_LATeah0044U_0656.0.dat

02:53:17: 192.168.0.191 --> 16 new tasks received made up of 6 FGRP2 and 10 BRP5 (PAS)
--> No new data files downloaded but 3 were skipped.
--> skipped skygrid_LATeah0024U_1168.0.dat
--> skipped skygrid_LATeah0045U_0688.0.dat
--> skipped skygrid_LATeah0045U_0656.0.dat

03:00:22: 192.168.0.192 --> 11 new tasks received made up of 4 FGRP2 and 7 BRP5 (PAS)
--> 1 new data file(s) downloaded and 3 skipped.
--> dloaded skygrid_LATeah0032U_0400.0.dat
--> skipped LATeah0032U.dat
--> skipped skygrid_LATeah0045U_0624.0.dat
--> skipped skygrid_LATeah0045U_0656.0.dat

03:07:24: 192.168.0.193 --> 21 new tasks received made up of 10 FGRP2 and 11 BRP5 (PAS)
--> No new data files downloaded but 4 were skipped.
--> skipped skygrid_LATeah0045U_0720.0.dat
--> skipped skygrid_LATeah0045U_0688.0.dat
--> skipped LATeah0031U.dat
--> skipped skygrid_LATeah0031U_0816.0.dat

03:14:26: 192.168.0.196 --> 23 new tasks received made up of 8 FGRP2 and 15 BRP5 (PAS)
--> No new data files downloaded but 2 were skipped.
--> skipped skygrid_LATeah0045U_0656.0.dat
--> skipped skygrid_LATeah0045U_0688.0.dat

03:21:26: 192.168.0.197 --> 19 new tasks received made up of 8 FGRP2 and 11 BRP5 (PAS)
--> No new data files downloaded but 2 were skipped.
--> skipped skygrid_LATeah0045U_0688.0.dat
--> skipped skygrid_LATeah0045U_0656.0.dat

03:28:26: 192.168.0.199 --> 12 new tasks received made up of 4 FGRP2 and 8 BRP5 (PAS)
--> No new data files downloaded but 2 were skipped.
--> skipped skygrid_LATeah0045U_0656.0.dat
--> skipped skygrid_LATeah0045U_0720.0.dat

===========================================================================================

Total of 332 new tasks overall - 128 FGRP2 tasks and 204 BRP5 (PAS) - for 20 active hosts.
A total of 8 new data files were downloaded and 55 potential downloads were skipped.
Run finished for 20 active host(s) and 0 skipped host(s) on Sat Oct 26 03:35:26 EST 2013.
###########################################################################################[/pre]

One peculiar thing I've noticed is that some of the above hosts actually don't report fully all the skipped downloads. I haven't investigated it thoroughly but it seems that hosts running a V7 BOINC don't fully record in stdoutdae.txt (which is what I'm using to extract info about what gets downloaded and what gets skipped) the details of every file whose download was skipped. It doesn't affect anything and I'm not bothered by it. It just means that there were probably even more than the total of 55 skipped downloads reported above. Over the last couple of weeks I seem to be avoiding the download of between 50 and 100 of these files every day so I consider it well worth the effort of working out a plan of how to do it and then writing a script that actually worked correctly :-).

Cheers,
Gary.

fadedrose
fadedrose
Joined: 6 Apr 13
Posts: 263
Credit: 316,405
RAC: 0

I just got one of those

I just got one of those reruns..033U...and it's not what I'd call short. Just finished a similar one that came in at over 50 hours that took about 40.

http://einsteinathome.org/task/407974315

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,205
Credit: 43,113,378,263
RAC: 45,140,514

RE: ... it's not what I'd

Quote:
... it's not what I'd call short.


No it's not ... and that's because of the change in task size (by a factor of 11) some time ago. Most of the reruns in the range that Bernd originally specified were the older, shorter tasks. I think the change in task size occurred around LATeah0026U.dat. So any reruns for tasks that were originally issued in the large format after that point are going to be large as well.

I'm guessing that the validator problems referred to in the first post have extended further than originally thought. That's OK - you win some and you lose some but that's life.

Cheers,
Gary.

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 3,938
Credit: 199,766,723
RAC: 47,546

Sorry, I noticed that my

Sorry, I noticed that my original posting was ambiguous.

There were indeed a couple of tasks that needed to be re-ran that were of the newer, long type, but I didn't even find these worth a note as they should blend perfectly with the newly generated ones. So I meant to post a note only about the shorter tasks that were now mixed in between.

In any case the last WUs that needed to be re-ran were generated and sent out yesterday; apart from a few re-sends you shouldn't get no more _RERUNs.

We (mainly Holger, "his" student Colin Clark and Heinz-Bernd) are currently re-working the application. One of the goals is to get rid of the necessity for the skygrid files completely. We hope to release the new application together with new data from the Fermi satellite later this year in a new "run" (as a new "application" in BOINC terms), probably (not surprisingly) named "FGRP3".

BM

BM

Sid
Sid
Joined: 17 Oct 10
Posts: 145
Credit: 466,073,677
RAC: 315,484

RE: We hope to release the

Quote:

We hope to release the new application together with new data from the Fermi satellite later this year in a new "run" (as a new "application" in BOINC terms), probably (not surprisingly) named "FGRP3".

BM

Will this new application FGRP3 use GPU or are we talking about pure CPU application?

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 3,938
Credit: 199,766,723
RAC: 47,546

The goal is to get the

The goal is to get the results of the OpenCL (GPU) version to match with the results of the "gold standard" CPU version. It it not clear yet whether this can be achieved at the initial launch of FGRP3. Possibly we will launch FGRP3 as CPU-only app and later extend it to GPUs (which is roughly what we planned for FGRP2, but failed to achieve).

BM

BM

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,205
Credit: 43,113,378,263
RAC: 45,140,514

RE: In any case the last

Quote:
In any case the last tasks that needed to be re-ran were generated and sent out yesterday; apart from a few re-sends you shouldn't get no more _RERUNs.


Thanks very much for taking the time to explain the situation and to give us some insight into future directions. It's much appreciated.

When I looked through this mornings download logs, I noticed a reduced number of new FGRP2 tasks - ie no large numbers of short duration _RERUNs anymore. In previous days there were lots of those. Virtually all of the new tasks were for LATeah0045U.dat, and these are new and not part of any previous issue. As predicted, I did get 2 _RERUNs which were 'resends' of previously issued RERUNs - they had a suffix of _2 or higher. I also got a quite lonely first time issue of a _RERUN - it had a _1 suffix.

My file share cache of LATeah0xxxU.dat files and all the associated skygrids stands at around 4GB (around 1K files) so I'm quite pleased that I'll be able to retire a lot of these and get the size of the cache down considerably as the chance of resends of _RERUNs diminishes. That will be very easy to do as the functionality is built into the shell script.

Cheers,
Gary.

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 3,938
Credit: 199,766,723
RAC: 47,546

FGRP3 is being tested over at

FGRP3 is being tested over at Albert@Home.

BM

BM

Mumak
Joined: 26 Feb 13
Posts: 325
Credit: 1,839,208,624
RAC: 566,274

RE: FGRP3 is being tested

Quote:

FGRP3 is being tested over at Albert@Home.

BM

Is that ported to GPU ?

-----

Alex
Alex
Joined: 1 Mar 05
Posts: 449
Credit: 338,845,946
RAC: 3,406

RE: FGRP3 is being tested

Quote:

FGRP3 is being tested over at Albert@Home.

BM

Could you please take a look over to Albert, there seems to be a problem with work distribution.

Alexander

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.