Trouble with Gamma-ray pulsar search #2 v0.01

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4312

Credit: 250594203

RAC: 34401

I built and published a new

28 Dec 2012 22:03:54 UTC

Message 114065

(moderation:

)

I built and published a new Linux App (0.02) which should fix this problem fo rnow. For a number of reasons this doesn't include David's fix, but instead was built with an old version of the BOINC API that is known to work.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5872

Credit: 117705505750

RAC: 35075168

I've downloaded the new app

29 Dec 2012 1:15:23 UTC

Message 114066 in response to message 114065

(moderation:

)

I've downloaded the new app and edited the state file to 're-version' all existing tasks in the cache to 0.02. On restart, everything seems fine and the existing tasks have started crunching with the new app. Progress seems exactly the same as with the 0.01 version.

The next step is to try the upgrade to BOINC 7.0.42 again and see if the 'exit 0s' problem has been fixed - I'm sure it will be, but you never know until you try :-).

EDIT: Thanks very much Bernd!! BOINC is now 7.0.42, FGRP2 v0.02 is running happily with no sign of any 'exit 0s'. Everything looks great so far! I'll have to wait a bit to see if completed tasks validate - not expecting any problem, though :-). Now to try out the new app configuration feature!

EDIT2: One 0.02 task has now been validated. The only noteworthy thing that has happened is that one of the 2 BRP4 tasks that was crunching alongside the very first FGRP2_0.02 task decided to have an 'exit 0' moment. It happened soon after 7.0.42 was first launched and the three 'in progress' tasks had restarted - perhaps 3-5 minutes after restart. I had just increased the cache size a little and there were new tasks being downloaded. It's now more than an hour later and there have been no repeats of this single incident.

EDIT3: Perhaps I spoke too soon. It's now close to two hours on BOINC 7.0.42 and both in-flight BRP4 tasks (new tasks started about 15mins ago) have had a single 'exit 0' each, just a few minutes ago. They've restarted and are making progress without further incident (for the moment). I've not previously noticed any 'exit 0s' on this particular host when it was on 7.0.31 or earlier BOINCs. Does the BRP4 app need some 'protection' against 'exit 0s' as well?

Cheers,
Gary.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5872

Credit: 117705505750

RAC: 35075168

RE: Does the BRP4 app need

31 Dec 2012 0:20:58 UTC

Message 114067 in response to message 114066

(moderation:

)

Quote:

Does the BRP4 app need some 'protection' against 'exit 0s' as well?

To answer my own question - perhaps it does.

The host linked to in my previous message has been running for 2 days now on 7.0.42 with an app_config.xml file that sets to 0.45 rather than the standard 0.5 supplied by the project. This means that when running GPU tasks 2x, one CPU core will not be automatically excluded from use. On a dual core host, the HD7770 card running 2x, I could now run 2 CPU tasks (FGRP2) to see what would happen.

I had previously done this experiment on NVIDIA GPUs, 550Ti, 650, and 650Ti. In each of those cases, the GPU crunch time running 2x was not affected by loading all CPU cores with a CPU task. The CPU task crunch time was increased a little with all cores loaded, but there was still a significant increase in output that justified running this way. This was particularly true for the 'Kepler' GPUs. It didn't seem to matter if the CPU was a dual, a quad, or a dual with HT (4 virtual cores). I was interested to see what happened in the AMD GPU case.

With the HD7770 both GPU and CPU tasks take longer to crunch. With two days of data to analyse, the interesting finding is that the extra CPU task running does compensate for the slower rates of both types of tasks and does seem to give a very slight overall gain, but really too small to worry about.

I intend to revert to the default settings. I could leave things as they are but I'm troubled by the continuing steady flow of occasional 'exit 0s' from the BRP4 app as reported previously, even though they do complete and validate. It would seem that I need to go back to 7.0.31 in order to cure that. Unless someone knows for sure, I guess I'll need to do a bit of research to see if there are any 'gotchas' (like there are with AP) in backing out of the app config mechanism :-).

Cheers,
Gary.

Neil Newell

Joined: 20 Nov 12

Posts: 176

Credit: 169699457

RAC: 0

RE: I built and published a

31 Dec 2012 17:27:23 UTC

Message 114068 in response to message 114065

(moderation:

)

Quote:

I built and published a new Linux App (0.02) which should fix this problem fo rnow. For a number of reasons this doesn't include David's fix, but instead was built with an old version of the BOINC API that is known to work.

BM

This definitely fixed it here, thanks. The old 0.01 app always failed within a few seconds, the new 0.02 app has been rock solid on linux, both 7.0.28 and 7.1-prerelease (probably 7.0.36/.37, as Claggy explained above).

Justin La Sotten

Joined: 9 Dec 05

Posts: 6

Credit: 2930440

RAC: 0

RE: I built and published a

14 Jan 2013 15:54:56 UTC

Message 114069 in response to message 114065

(moderation:

)

Quote:

I built and published a new Linux App (0.02) which should fix this problem fo rnow. For a number of reasons this doesn't include David's fix, but instead was built with an old version of the BOINC API that is known to work.

BM

I just wanted to say thank you, the new app seems to be working great!

Claggy

Joined: 29 Dec 06

Posts: 560

Credit: 2699403

RAC: 0

I'm getting for some reason

15 Jan 2013 21:16:32 UTC

Message 114070

(moderation:

)

I'm getting for some reason JPLEPH.405 downloaded on one request, then on the next request it's getting deleted, then getting downloaded again later:

Any Idea why, this is on my E8500 host with Boinc 7.0.44

Claggy

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5872

Credit: 117705505750

RAC: 35075168

RE: I'm getting for some

16 Jan 2013 0:48:06 UTC

Message 114071 in response to message 114070

(moderation:

)

Quote:

I'm getting for some reason JPLEPH.405 downloaded on one request, then on the next request it's getting deleted, then getting downloaded again later:

.....

Any Idea why, this is on my E8500 host with Boinc 7.0.44

Claggy

In all GW runs, there have been 'sun*' and 'earth*' ephemeris files which are required by all tasks and so are marked with the tag to prevent them from ever being deleted.

I'm guessing that the 'EPH' is short for ephemeris and that 'JPLEPH.405' should therefore also be a file. I just had a look in a state file of mine and yes, it is there. However, that doesn't mean much as I have a vague recollection that I also saw this back at the start of FGRP1 and manually added the tag to prevent the problem and then promptly forgot about it. When I build a new machine, I don't install BOINC or attach to a project. I have a live USB external hard drive to install the OS, which is also a full repository of all the OS updates and a repository of various BOINC versions and all of the static files associated a project and its various searches. This also includes state file templates which are preloaded with blocks for all the standard files that would otherwise have to be downloaded after attaching to a project. That certainly includes files like JPLEPH.405 and I would have made it when I first added it to the template.

So I can't really tell for sure how it's being distributed these days but if you check your state file and it's not we should certainly refer it to Bernd. With the hefty size of the FGRP2 run, that file is going to be required for quite a while.

Cheers,
Gary.

archae86

Joined: 6 Dec 05

Posts: 3157

Credit: 7226524928

RAC: 1070275

RE: you check your state

16 Jan 2013 1:28:24 UTC

Message 114072 in response to message 114071

(moderation:

)

Quote:

you check your state file and it's not we should certainly refer it to Bernd. With the hefty size of the FGRP2 run, that file is going to be required for quite a while.

Browsing my way through the initial section of one host's client-State.xml that has a myriad ...

I marked all occurrences of the sticky tag.

Many, many files have the sticky tag, including:

earth_09_11
sun_09_11
S6GC1_T60h_v1_Segments.seg
h1_0349.15_S6GC1 and many similar h1... files
l1_0349.15_S6GC1 and many similar l1... files

Many, many other files do not have the sticky tag, on my host, including JPLEPH.405

Here am looking for

after the status tag and before the download url tag(s) in a file section.

In case this helps.

Mike Hewson

Moderator

Joined: 1 Dec 05

Posts: 6589

Credit: 318351699

RAC: 392516

Yup. NASA's Jet Propulsion

16 Jan 2013 1:29:49 UTC

Message 114073

(moderation:

)

Yup. NASA's Jet Propulsion Laboratory keeps track of who is where and when in the solar system and is thus crucial for the 'de-Dopplering' of received signals that we do at E@H, referring findings about signals to the 'barycentre' ( ~ center of mass ) of the solar system. The 'when' aspect is the time of reception of the signals by whichever instrument(s) and not when we are analysing.

Cheers, Mike.

( edit ) Thus my guess is that the '.405' extension acts as an index into a larger set of files, each referring to a particular period of time.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5872

Credit: 117705505750

RAC: 35075168

RE: Many, many files have

16 Jan 2013 10:46:16 UTC

Message 114074 in response to message 114072

(moderation:

)

Quote:

Many, many files have the sticky tag, including:

earth_09_11
sun_09_11
S6GC1_T60h_v1_Segments.seg
h1_0349.15_S6GC1 and many similar h1... files
l1_0349.15_S6GC1 and many similar l1... files

Yes, these all need to be kept at all times for locality scheduling to work efficiently. One file type that is not sticky that I tend to make sticky manually is the skygrid... file. I find it annoying if you have a couple of quite different data frequency ranges in play and you temporarily run out of tasks for one frequency, the skygrid file for that range will be deleted only to be re-downloaded the very next time you get more tasks. Because I 'pre-seed' hosts with a big range of large data files before allowing them to ask for work, I make the appropriate skygrid file sticky while I'm setting up the template. For the life of the S6LV1 run, my data downloads were a tiny fraction of what they would have been otherwise.

I think this is what is happening to Claggy. He gets some FGRP2 tasks which are completed before his host requests any more. So the JPLEPH.405 file (now confirmed not to be sticky, thank you) gets deleted and re-downloaded the next time his host scores some new FGRP2 tasks. Like sun* and earth* files in the GW runs, JPLEPH.405 really needs to be sticky for the life of the FGRP2 run.

Cheers,
Gary.

Trouble with Gamma-ray pulsar search #2 v0.01

Forums › Cruncher's Corner

Comment viewing options

Forums › Cruncher's Corner