Trouble with Gamma-ray pulsar search #2 v0.01

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4,065
Credit: 226,641,539
RAC: 28,547

I built and published a new

I built and published a new Linux App (0.02) which should fix this problem fo rnow. For a number of reasons this doesn't include David's fix, but instead was built with an old version of the BOINC API that is known to work.

BM

BM

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,570
Credit: 81,611,069,685
RAC: 67,273,960

I've downloaded the new app

I've downloaded the new app and edited the state file to 're-version' all existing tasks in the cache to 0.02. On restart, everything seems fine and the existing tasks have started crunching with the new app. Progress seems exactly the same as with the 0.01 version.

The next step is to try the upgrade to BOINC 7.0.42 again and see if the 'exit 0s' problem has been fixed - I'm sure it will be, but you never know until you try :-).

EDIT: Thanks very much Bernd!! BOINC is now 7.0.42, FGRP2 v0.02 is running happily with no sign of any 'exit 0s'. Everything looks great so far! I'll have to wait a bit to see if completed tasks validate - not expecting any problem, though :-). Now to try out the new app configuration feature!

EDIT2: One 0.02 task has now been validated. The only noteworthy thing that has happened is that one of the 2 BRP4 tasks that was crunching alongside the very first FGRP2_0.02 task decided to have an 'exit 0' moment. It happened soon after 7.0.42 was first launched and the three 'in progress' tasks had restarted - perhaps 3-5 minutes after restart. I had just increased the cache size a little and there were new tasks being downloaded. It's now more than an hour later and there have been no repeats of this single incident.

EDIT3: Perhaps I spoke too soon. It's now close to two hours on BOINC 7.0.42 and both in-flight BRP4 tasks (new tasks started about 15mins ago) have had a single 'exit 0' each, just a few minutes ago. They've restarted and are making progress without further incident (for the moment). I've not previously noticed any 'exit 0s' on this particular host when it was on 7.0.31 or earlier BOINCs. Does the BRP4 app need some 'protection' against 'exit 0s' as well?

Cheers,
Gary.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,570
Credit: 81,611,069,685
RAC: 67,273,960

RE: Does the BRP4 app need

Quote:
Does the BRP4 app need some 'protection' against 'exit 0s' as well?


To answer my own question - perhaps it does.

The host linked to in my previous message has been running for 2 days now on 7.0.42 with an app_config.xml file that sets to 0.45 rather than the standard 0.5 supplied by the project. This means that when running GPU tasks 2x, one CPU core will not be automatically excluded from use. On a dual core host, the HD7770 card running 2x, I could now run 2 CPU tasks (FGRP2) to see what would happen.

I had previously done this experiment on NVIDIA GPUs, 550Ti, 650, and 650Ti. In each of those cases, the GPU crunch time running 2x was not affected by loading all CPU cores with a CPU task. The CPU task crunch time was increased a little with all cores loaded, but there was still a significant increase in output that justified running this way. This was particularly true for the 'Kepler' GPUs. It didn't seem to matter if the CPU was a dual, a quad, or a dual with HT (4 virtual cores). I was interested to see what happened in the AMD GPU case.

With the HD7770 both GPU and CPU tasks take longer to crunch. With two days of data to analyse, the interesting finding is that the extra CPU task running does compensate for the slower rates of both types of tasks and does seem to give a very slight overall gain, but really too small to worry about.

I intend to revert to the default settings. I could leave things as they are but I'm troubled by the continuing steady flow of occasional 'exit 0s' from the BRP4 app as reported previously, even though they do complete and validate. It would seem that I need to go back to 7.0.31 in order to cure that. Unless someone knows for sure, I guess I'll need to do a bit of research to see if there are any 'gotchas' (like there are with AP) in backing out of the app config mechanism :-).

Cheers,
Gary.

Neil Newell
Neil Newell
Joined: 20 Nov 12
Posts: 176
Credit: 169,699,457
RAC: 0

RE: I built and published a

Quote:

I built and published a new Linux App (0.02) which should fix this problem fo rnow. For a number of reasons this doesn't include David's fix, but instead was built with an old version of the BOINC API that is known to work.

BM

This definitely fixed it here, thanks. The old 0.01 app always failed within a few seconds, the new 0.02 app has been rock solid on linux, both 7.0.28 and 7.1-prerelease (probably 7.0.36/.37, as Claggy explained above).

Justin La Sotten
Justin La Sotten
Joined: 9 Dec 05
Posts: 6
Credit: 2,930,440
RAC: 0

RE: I built and published a

Quote:

I built and published a new Linux App (0.02) which should fix this problem fo rnow. For a number of reasons this doesn't include David's fix, but instead was built with an old version of the BOINC API that is known to work.

BM

I just wanted to say thank you, the new app seems to be working great!

Claggy
Claggy
Joined: 29 Dec 06
Posts: 560
Credit: 2,609,841
RAC: 175

I'm getting for some reason

I'm getting for some reason JPLEPH.405 downloaded on one request, then on the next request it's getting deleted, then getting downloaded again later:

15/01/2013 10:49:04 | Einstein@Home | [sched_op] Starting scheduler request
15/01/2013 10:49:04 | Einstein@Home | Sending scheduler request: To fetch work.
15/01/2013 10:49:04 | Einstein@Home | Requesting new tasks for CPU
15/01/2013 10:49:04 | Einstein@Home | [sched_op] CPU work request: 5267.86 seconds; 0.00 devices
15/01/2013 10:49:04 | Einstein@Home | [sched_op] NVIDIA work request: 0.00 seconds; 0.00 devices
15/01/2013 10:49:07 | Einstein@Home | Scheduler request completed: got 1 new tasks
15/01/2013 10:49:07 | Einstein@Home | [sched_op] Server version 611
15/01/2013 10:49:07 | Einstein@Home | Project requested delay of 60 seconds
15/01/2013 10:49:07 | Einstein@Home | [sched_op] estimated total CPU task duration: 9956 seconds
15/01/2013 10:49:07 | Einstein@Home | [sched_op] estimated total NVIDIA task duration: 0 seconds
15/01/2013 10:49:07 | Einstein@Home | [sched_op] Deferring communication for 1 min 0 sec
15/01/2013 10:49:07 | Einstein@Home | [sched_op] Reason: requested by project
15/01/2013 10:49:10 | Einstein@Home | Started download of hsgamma_FGRP2_0.01_windows_intelx86.exe
15/01/2013 10:49:10 | Einstein@Home | Started download of LATeah0010U.dat
15/01/2013 10:49:10 | Einstein@Home | Started download of skygrid_LATeah0010U_1232.0.dat
15/01/2013 10:49:10 | Einstein@Home | Started download of JPLEPH.405
15/01/2013 10:49:22 | Einstein@Home | Finished download of LATeah0010U.dat
15/01/2013 10:49:45 | Einstein@Home | Finished download of skygrid_LATeah0010U_1232.0.dat
15/01/2013 10:50:07 | Einstein@Home | Finished download of hsgamma_FGRP2_0.01_windows_intelx86.exe
15/01/2013 10:50:14 | Einstein@Home | Finished download of JPLEPH.405
15/01/2013 10:50:14 | Einstein@Home | Starting task LATeah0010U_1232.0_58760_0.0_1 using hsgamma_FGRP2 version 1 in slot 4
15/01/2013 10:51:04 | Einstein@Home | [sched_op] Starting scheduler request
15/01/2013 10:51:04 | Einstein@Home | Sending scheduler request: To fetch work.
15/01/2013 10:51:04 | Einstein@Home | Requesting new tasks for CPU
15/01/2013 10:51:04 | Einstein@Home | [sched_op] CPU work request: 6283.45 seconds; 0.00 devices
15/01/2013 10:51:04 | Einstein@Home | [sched_op] NVIDIA work request: 0.00 seconds; 0.00 devices
15/01/2013 10:51:10 | Einstein@Home | Scheduler request completed: got 1 new tasks
15/01/2013 10:51:10 | Einstein@Home | [sched_op] Server version 611
15/01/2013 10:51:10 | Einstein@Home | BOINC will delete file JPLEPH.405 (no longer needed)
15/01/2013 10:51:10 | Einstein@Home | Project requested delay of 60 seconds
15/01/2013 10:51:10 | Einstein@Home | [sched_op] estimated total CPU task duration: 9955 seconds
15/01/2013 10:51:10 | Einstein@Home | [sched_op] estimated total NVIDIA task duration: 0 seconds
15/01/2013 10:51:10 | Einstein@Home | [sched_op] Deferring communication for 1 min 0 sec
15/01/2013 10:51:10 | Einstein@Home | [sched_op] Reason: requested by project

15/01/2013 20:54:58 | Einstein@Home | [sched_op] Starting scheduler request
15/01/2013 20:54:58 | Einstein@Home | Sending scheduler request: To fetch work.
15/01/2013 20:54:58 | Einstein@Home | Requesting new tasks for CPU
15/01/2013 20:54:58 | Einstein@Home | [sched_op] CPU work request: 44928.00 seconds; 2.00 devices
15/01/2013 20:54:58 | Einstein@Home | [sched_op] NVIDIA work request: 0.00 seconds; 0.00 devices
15/01/2013 20:55:02 | Einstein@Home | Scheduler request completed: got 5 new tasks
15/01/2013 20:55:02 | Einstein@Home | [sched_op] Server version 611
15/01/2013 20:55:02 | Einstein@Home | Project requested delay of 60 seconds
15/01/2013 20:55:02 | Einstein@Home | [sched_op] estimated total CPU task duration: 47874 seconds
15/01/2013 20:55:02 | Einstein@Home | [sched_op] estimated total NVIDIA task duration: 0 seconds
15/01/2013 20:55:02 | Einstein@Home | [sched_op] Deferring communication for 1 min 0 sec
15/01/2013 20:55:02 | Einstein@Home | [sched_op] Reason: requested by project
15/01/2013 20:55:04 | Einstein@Home | Started download of LATeah0010U.dat
15/01/2013 20:55:04 | Einstein@Home | Started download of skygrid_LATeah0010U_1264.0.dat
15/01/2013 20:55:04 | Einstein@Home | Started download of JPLEPH.405
15/01/2013 20:55:04 | Einstein@Home | Started download of skygrid_LATeah0010U_1296.0.dat
15/01/2013 20:55:09 | Einstein@Home | Finished download of LATeah0010U.dat
15/01/2013 20:55:27 | Einstein@Home | Finished download of skygrid_LATeah0010U_1264.0.dat
15/01/2013 20:55:27 | Einstein@Home | Finished download of skygrid_LATeah0010U_1296.0.dat
15/01/2013 20:55:40 | Einstein@Home | Finished download of JPLEPH.405
15/01/2013 20:55:40 | Einstein@Home | Starting task LATeah0010U_1296.0_187860_0.0_0 using hsgamma_FGRP2 version 1 in slot 0
15/01/2013 20:55:40 | Einstein@Home | Starting task LATeah0010U_1296.0_188540_0.0_0 using hsgamma_FGRP2 version 1 in slot 1
15/01/2013 20:56:03 | Einstein@Home | [sched_op] Starting scheduler request
15/01/2013 20:56:03 | Einstein@Home | Sending scheduler request: To fetch work.
15/01/2013 20:56:03 | Einstein@Home | Requesting new tasks for CPU
15/01/2013 20:56:03 | Einstein@Home | [sched_op] CPU work request: 3346.82 seconds; 0.00 devices
15/01/2013 20:56:03 | Einstein@Home | [sched_op] NVIDIA work request: 0.00 seconds; 0.00 devices
15/01/2013 20:56:10 | Einstein@Home | Scheduler request completed: got 1 new tasks
15/01/2013 20:56:10 | Einstein@Home | [sched_op] Server version 611
15/01/2013 20:56:10 | Einstein@Home | BOINC will delete file JPLEPH.405 (no longer needed)
15/01/2013 20:56:10 | Einstein@Home | Project requested delay of 60 seconds
15/01/2013 20:56:10 | Einstein@Home | [sched_op] estimated total CPU task duration: 25847 seconds
15/01/2013 20:56:10 | Einstein@Home | [sched_op] estimated total NVIDIA task duration: 0 seconds
15/01/2013 20:56:10 | Einstein@Home | [sched_op] Deferring communication for 1 min 0 sec
15/01/2013 20:56:10 | Einstein@Home | [sched_op] Reason: requested by project
15/01/2013 20:56:12 | Einstein@Home | Started download of einstein_S6BucketLVE_1.04_windows_intelx86__SSE2.exe

Any Idea why, this is on my E8500 host with Boinc 7.0.44

Claggy

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,570
Credit: 81,611,069,685
RAC: 67,273,960

RE: I'm getting for some

Quote:

I'm getting for some reason JPLEPH.405 downloaded on one request, then on the next request it's getting deleted, then getting downloaded again later:

.....

Any Idea why, this is on my E8500 host with Boinc 7.0.44

Claggy


In all GW runs, there have been 'sun*' and 'earth*' ephemeris files which are required by all tasks and so are marked with the tag to prevent them from ever being deleted.

I'm guessing that the 'EPH' is short for ephemeris and that 'JPLEPH.405' should therefore also be a file. I just had a look in a state file of mine and yes, it is there. However, that doesn't mean much as I have a vague recollection that I also saw this back at the start of FGRP1 and manually added the tag to prevent the problem and then promptly forgot about it. When I build a new machine, I don't install BOINC or attach to a project. I have a live USB external hard drive to install the OS, which is also a full repository of all the OS updates and a repository of various BOINC versions and all of the static files associated a project and its various searches. This also includes state file templates which are preloaded with blocks for all the standard files that would otherwise have to be downloaded after attaching to a project. That certainly includes files like JPLEPH.405 and I would have made it when I first added it to the template.

So I can't really tell for sure how it's being distributed these days but if you check your state file and it's not we should certainly refer it to Bernd. With the hefty size of the FGRP2 run, that file is going to be required for quite a while.

Cheers,
Gary.

archae86
archae86
Joined: 6 Dec 05
Posts: 3,070
Credit: 6,011,852,332
RAC: 2,503,249

RE: you check your state

Quote:
you check your state file and it's not we should certainly refer it to Bernd. With the hefty size of the FGRP2 run, that file is going to be required for quite a while.


Browsing my way through the initial section of one host's client-State.xml that has a myriad ...

I marked all occurrences of the sticky tag.

Many, many files have the sticky tag, including:

earth_09_11
sun_09_11
S6GC1_T60h_v1_Segments.seg
h1_0349.15_S6GC1 and many similar h1... files
l1_0349.15_S6GC1 and many similar l1... files

Many, many other files do not have the sticky tag, on my host, including JPLEPH.405

Here am looking for

after the status tag and before the download url tag(s) in a file section.

In case this helps.

Mike Hewson
Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6,356
Credit: 182,534,832
RAC: 658,909

Yup. NASA's Jet Propulsion

Yup. NASA's Jet Propulsion Laboratory keeps track of who is where and when in the solar system and is thus crucial for the 'de-Dopplering' of received signals that we do at E@H, referring findings about signals to the 'barycentre' ( ~ center of mass ) of the solar system. The 'when' aspect is the time of reception of the signals by whichever instrument(s) and not when we are analysing.

Cheers, Mike.

( edit ) Thus my guess is that the '.405' extension acts as an index into a larger set of files, each referring to a particular period of time.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,570
Credit: 81,611,069,685
RAC: 67,273,960

RE: Many, many files have

Quote:

Many, many files have the sticky tag, including:

earth_09_11
sun_09_11
S6GC1_T60h_v1_Segments.seg
h1_0349.15_S6GC1 and many similar h1... files
l1_0349.15_S6GC1 and many similar l1... files


Yes, these all need to be kept at all times for locality scheduling to work efficiently. One file type that is not sticky that I tend to make sticky manually is the skygrid... file. I find it annoying if you have a couple of quite different data frequency ranges in play and you temporarily run out of tasks for one frequency, the skygrid file for that range will be deleted only to be re-downloaded the very next time you get more tasks. Because I 'pre-seed' hosts with a big range of large data files before allowing them to ask for work, I make the appropriate skygrid file sticky while I'm setting up the template. For the life of the S6LV1 run, my data downloads were a tiny fraction of what they would have been otherwise.

I think this is what is happening to Claggy. He gets some FGRP2 tasks which are completed before his host requests any more. So the JPLEPH.405 file (now confirmed not to be sticky, thank you) gets deleted and re-downloaded the next time his host scores some new FGRP2 tasks. Like sun* and earth* files in the GW runs, JPLEPH.405 really needs to be sticky for the life of the FGRP2 run.

Cheers,
Gary.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.