The O2-All Sky Gravitational Wave Search on GPUs - discussion thread.

cecht
cecht
Joined: 7 Mar 18
Posts: 1534
Credit: 2907442106
RAC: 2159353

Matt White wrote:Since I do

Matt White wrote:
Since I do not use a custom app config file, and my preferences in the project section are set to run any and all tasks (other than CPU apps where GPU work is available), I'm assuming (we all know how dangerous that can be) that the project is sending me what it considers to be most helpful. And right now, that is all GW work.

That simple approach makes a lot of sense. Occam's razor to the rescue!

Matt White wrote:
That being said, and in my humble opinion, nothing is unimportant, since there are enough volunteers out there so all of the bases should be covered. Participating with E@H and BOINC projects as a whole is very much a team sport, and, sooner or later, every piece of data must be analyzed, and every number crunched, in order to make the big "discovery". So even if that "honor" is only shared by a few computers, separating the wheat from the chaff is a collective effort and since, in that way we all share in the effort, we all share the discovery.

'Tis true and well said.

Matt White wrote:
Just my two cents. :)

Thanks!

Ideas are not fixed, nor should they be; we live in model-dependent reality.

Mad_Max
Mad_Max
Joined: 2 Jan 10
Posts: 154
Credit: 2213994718
RAC: 405058

Gary Roberts wrote:The

Gary Roberts wrote:

The purpose of this message is to document some tests with the GW V1.07 app that I've performed over the last couple of days.  For the SI based hosts (1st group) I chose the machine with a HD 7850 GPU that I've been using to test the most recent amdgpu driver with OpenCL from AMDGPU-PRO.  I've run the new app in two separate tests using the 2016 OS and fglrx driver for one test and the recent OS with OpenCL from AMDGPU-PRO for the 2nd test.  In both these tests, tasks were able to start crunching (at a respectable rate initially) but soon degenerated markedly to a crawl.  I've described what I saw happening in this message.

The upshot of both those tests was that after reaching around 30% completed in around 30 mins, the progress counter reset to 0%  and then progressed for many hours at an extremely slow rate so that the projected finishing time was many days in the future and most certainly longer than an average CPU core would have taken to crunch the task.  In one test I let things run for about 10 hours and about 5 hours for the other.  In both cases the % completed was still in single digit territory.  There is a mechanism in place to fail tasks that are taking too long and I wasn't too keen to get a "time limit exceeded" failure message, so I aborted the task in each case.

I see the same problems with 1st gen of AMD GCN GPUs with GW GPU apps. And this was this way from previous GW GPUs app as i have described there back in July:  https://einsteinathome.org/content/gravitational-wave-search-o2-all-sky-search-o2as20-500?page=7#comment-172306
(this and few messages after in same thread)
All tested versions (1.05, 1.06, 1.07) have this same problem.

Some important note - i have check logs and i do not see any actual "progress reset"(or roll back) it is only BOINC progress counter glitch - probable due to EXTREMELY slow work of this app on 1st gen GCN. I allowed few task to run until it hit max time wall before aborting the rest. Just 1 task per GPU with all default settings. It managed to compute only about ~15-20% of the tasks after full day of work (while expected rate would be about 7-10 tasks per day for this hardware - HD 7870).

There are few examples:
https://einsteinathome.org/task/875575267
https://einsteinathome.org/task/875575305
https://einsteinathome.org/task/875587683
https://einsteinathome.org/task/875588216

If you read through these logs there are total:13129 data points to analyze in each WU (691 sky position with 19 dots each).
And logs shows that app runs EXTREMELY slow from very beginning: about only 2 dots per minute - each "c" symbol is a checkpoint done every ~2 mins   (so projection run time > 100 hours).
While for RX570 for example it runs about 140 dots per min: https://einsteinathome.org/task/875674784
~x100 speed difference of app while hardware difference < x2 for this GPUs pair (RX 570 vs HD 7870)
Just HUGE slowdown for unknown reason from the beginning.

While in the  BOINC GUI i see same situation as you: decent progress in 1st hour up to few dozen % and then reset back to <1%. So i think first part is it just an incorrect BOINC approximation of a progress of abnormally slow app and second part is a real progress reports from the E@H GW app.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117653706096
RAC: 35185853

Bernd Machenschalk wrote:....

Bernd Machenschalk wrote:
.... I'm currently looking into the ~72,5% of the cases, where the results are "further off".

Thank you very much for providing the stats about validation.  I'm sure all volunteers are pleased to know what's going on and that every effort is being made to solve the issue.

Bernd Machenschalk wrote:
My suggestion would be that while we are still struggelling with validation not to run more than one task on a single GPU.

Good luck with sorting this out!  I'll set up a couple of hosts with different CPU architectures crunching single GPU tasks on Polaris GPUs.  I'm keen to understand if older CPUs are contributing to the problem so I can make a decision about retiring older equipment.

Cheers,
Gary.

Matt White
Matt White
Joined: 9 Jul 19
Posts: 120
Credit: 280798376
RAC: 0

archae86 wrote:I predict they

archae86 wrote:

I predict they will be found invalid.

Consider: by the rules of the beta, even though you can't see your first quorum partner, it had to be a CPU job.  Both your task and that one passed the basic sanity checks, most likely, but then when your results were compared they were insufficiently similar to declare both valid.  So a tie-breaker task was sent out to a CPU host.  A betting person would think it likely that your initial CPU quorum partner is more likely to agree with the tie-breaker than you are.

Good insight, I'll let you know how things pan out.

Clear skies,
Matt
Matt White
Matt White
Joined: 9 Jul 19
Posts: 120
Credit: 280798376
RAC: 0

I should have checked first.

I should have checked first. The 5 tasks marked inconclusive have either validated, or been removed from the queue. No new error or invalid tasks in the count.

Clear skies,
Matt
archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7224444931
RAC: 1021318

Matt White wrote:I should

Matt White wrote:
I should have checked first. The 5 tasks marked inconclusive have either validated, or been removed from the queue. No new error or invalid tasks in the count.

I've been a bit perplexed on when and how one can see the inconclusives.  My current belief is that a method which works is to filter for the GW application, but take the "All" option, then sort by CPU time.  One can then use in page-find for "incon" on each page of results in appropriate plausible CPU time.

Using this method, for your host 12785296 I spot 3 inconclusives

For your host 12785591 I spot 11 inconclusives (I did not count one that probably ran CPU-only based on total CPU time).

One might expect them to be visible if one imposes the "pending" filter, but they are not.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117653706096
RAC: 35185853

Mad_Max wrote:I see the same

Mad_Max wrote:
I see the same problems with 1st gen of AMD GCN GPUs with GW GPU apps. And this was this way from previous GW GPUs app as i have described there back in July:  https://einsteinathome.org/content/gravitational-wave-search-o2-all-sky-search-o2as20-500?page=7#comment-172306

I clearly remember your July messages.  At the time I wasn't running GW tasks on older GPUs so had no direct experience.  Now I have, so I fully understand exactly the points that you make :-).

Mad_Max wrote:
Some important note - i have check logs and i do not see any actual "progress reset"(or roll back) it is only BOINC progress counter glitch - probable due to EXTREMELY slow work of this app on 1st gen GCN. I allowed few task to run until it hit max time wall before aborting the rest. Just 1 task per GPU with all default settings. It managed to compute only about ~15-20% of the tasks after full day of work (while expected rate would be about 7-10 tasks per day for this hardware - HD 7870).

Interesting that you mention "check logs" because I had exactly the same idea to follow the data as it accumulated in stderr.txt in the slot directory for a crunching task.  I ended up capturing 7 snapshots, the final two of them covering the transition from what must have been BOINC's 'simulated' progress to some sort of 'real' progress.

My understanding had been that the 'simulated' progress only lasted until a checkpoint was written.  With your comments and with what the series of snapshots of stderr.txt showed, the simulated progress must go on for a lot longer than the appearance of the first checkpoint.  I actually confirmed that the very first 'c' character must be the first checkpoint by using the 'properties' button in BOINC Manager to show that a checkpoint existed at that stage.

Here is the very first snapshot of stderr.txt.  It was taken just after the progress counter had started moving.  Nothing moved for about the first minute so I waited until the counter was ticking over.  You can see already a 'c' character which I interpreted as a checkpoint.

2019-08-20 13:17:18.5603 (2859) [normal]: Reading input data ... 2019-08-20 13:17:22.3439 (2859) [normal]: Search FstatMethod used: 'ResampOpenCL'
2019-08-20 13:17:22.3440 (2859) [normal]: Recalc FstatMethod used: 'DemodSSE'
2019-08-20 13:17:22.3440 (2859) [normal]: OpenCL Device used for Search/Recalc and/or semi coherent step: 'Pitcairn (Platform: AMD Accelerated Parallel Processing, global memory: 1981 MiB)'
2019-08-20 13:17:22.3440 (2859) [normal]: OpenCL version is used for the semi-coherent step!
2019-08-20 13:17:25.8323 (2859) [normal]: Number of segments: 64, total number of SFTs in segments: 10190
done.
% --- GPS reference time = 1177858472.0000 , GPS data mid time = 1177858472.0000
2019-08-20 13:17:25.8558 (2859) [normal]: dFreqStack = 3.340013e-06, df1dot = 1.637397e-10, df2dot = 0.000000e+00, df3dot = 0.000000e+00
% --- Setup, N = 64, T = 216000 s, Tobs = 19750204 s, gammaRefine = 500, gamma2Refine = 28226, gamma3Refine = 1

DEPRECATION WARNING: program has invoked obsolete function InitDopplerSkyScan(). Please see XLALInitDopplerSkyScan() for information about a replacement.
2019-08-20 13:17:27.5677 (2859) [normal]: INFO: No checkpoint checkpoint.cpt found - starting from scratch
% --- Cpt:0, total:13110, sky:1/690, f1dot:1/19

0.% --- CG:989248 FG:14971 f1dotmin_fg:-2.724189077486e-09 df1dot_fg:3.268256487026e-13 f2dotmin_fg:0 df2dot_fg:0 f3dotmin_fg:0 df3dot_fg:1
...c
....

I continued recording snapshots every couple of minutes.  Here is the record of dots and checkpoints from the sixth snapshot when the task progress was still moving at the good speed.  This was the final snapshot before the progress reset to zero - probably around 25 mins after startup.

2019-08-20 13:17:27.5677 (2859) [normal]: INFO: No checkpoint checkpoint.cpt found - starting from scratch
% --- Cpt:0, total:13110, sky:1/690, f1dot:1/19

0.% --- CG:989248 FG:14971 f1dotmin_fg:-2.724189077486e-09 df1dot_fg:3.268256487026e-13 f2dotmin_fg:0 df2dot_fg:0 f3dotmin_fg:0 df3dot_fg:1
...c
.....c
.....c
.....c

1.....c
.....c
.....c
....
2.c
.....c
.....c
.....c
...
3..c
.....c
.....c
.....c
..
4...c
.....c
.....c
.....c
.
5....c
.....c
.....c
.....c

6.....c
.....c
.....c
....
7.

And just to show no evidence of anything strange happening in stderr.txt, here is the relevant bit that had been added for the 7th snapshot which was taken as soon as I'd seen the progress reset to zero and then start moving very slowly.

6.....c
.....c
.....c
....
7.c
.....c
.....c
.....c
...
8..c
....c
..

There's nothing unusual to show for or explain why the progress reset so it's a bit of a mystery as to what is really going on to cause the reset.

Mad_Max wrote:
While in the  BOINC GUI i see same situation as you: decent progress in 1st hour up to few dozen % and then reset back to <1%. So i think first part is it just an incorrect BOINC approximation of a progress of abnormally slow app and second part is a real progress reports from the E@H GW app.

It would be interesting to know why BOINC waited for so long to stop using simulated progress and start using some sort of app based 'real' progress.  There seems to be lots of prior checkpoints from which BOINC could have calculated a much better estimate of fraction done at that point in time.

Cheers,
Gary.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117653706096
RAC: 35185853

archae86 wrote:Matt White

archae86 wrote:
Matt White wrote:
I should have checked first. The 5 tasks marked inconclusive have either validated, or been removed from the queue. No new error or invalid tasks in the count.

I've been a bit perplexed on when and how one can see the inconclusives. 

....

I think you made an earlier post about inconclusives (and perhaps invalids) 'disappearing' or something along those lines - I haven't gone back to find exactly how you described it.  From various snippets of information that come out from time to time, there can be 'manual intervention' of various sorts going on behind the scenes.  With the relaxing of validation 'rules', it seems likely that Work Units might be sent back through the validator and have their 'state' changed as a result.  An inconclusive or an invalid might suddenly become valid :-)

Cheers,
Gary.

Mad_Max
Mad_Max
Joined: 2 Jan 10
Posts: 154
Credit: 2213994718
RAC: 405058

Gary Roberts wrote:There's

Gary Roberts wrote:

There's nothing unusual to show for or explain why the progress reset so it's a bit of a mystery as to what is really going on to cause the reset.

It would be interesting to know why BOINC waited for so long to stop using simulated progress and start using some sort of app based 'real' progress.  There seems to be lots of prior checkpoints from which BOINC could have calculated a much better estimate of fraction done at that point in time.

I have done some more "data digging" today (have some spare time finally!) and figured out it.
I monitor status file of one of such "extremely slow" WUs (it is a  "boinc_task_state.xml" file in the "slot" folder of running WU) and appears it is not BOINC issue but of GW app. It just does not report any progress to the BOINC in the beginning even if it did few checkpoints already. Progress reports does not tied to checkpoints here.

<fraction_done></fraction_done> parameter in boinc_task_state.xml (used to report progress % from the app to BOINC) remains at zero for about half an hour before very first update. So only one thing BOINC can do in such situation - is "simulate" progress by using elapsed time and calculated run time estimations.
And after very first progress report it resets own "simulated" progress to real progress % received from the app. It is a moment then we see reset from few dozen % back to <1%.
I think it can be overridden by use of <fraction_done_exact/> option in the app_config.xml file for GW app. Although I have not tested it yet.

I think normally running GW GPU app behaves the same, but at a normal computation speed it is hard to notice because this period with no progress reports from the app to BOINC lasts only about 1 min of real time instead of ~half on hour o more and because "simulated" progress in normal case is close enough to the real progress.  So at the time of switching from "simulated" to "real" there is only a very small adjustment which hard to notice. And only then app is slowed down by few dozen times by unknown problem on older GPUs such minor adjustment become a huge swing in progress counter. 

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117653706096
RAC: 35185853

Mad_Max wrote:I have done

Mad_Max wrote:
I have done some more "data digging" today (have some spare time finally!) and figured out ...

Well done!! and thanks very much for the detailed explanation.  I'm sure you're absolutely right!!

Everything you point out makes perfect sense.  It's always very satisfying to find out why these strange things happen so I'm very grateful to you for your persistence in tracking down the cause.  I'll certainly be interested in anything you work out about the use of <fraction_done_exact/> if you do go ahead and test that option.

I'm setting up a couple of hosts with a range of different generation CPUs and with Polaris GPUs to provide some data about whether or not older CPUs play any role in the plague of invalid results coming from more modern GPUs.  For the moment, I'm not intending to try anything further on SI GPUs.

 

Cheers,
Gary.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.