The O2-All Sky Gravitational Wave Search on GPUs - discussion thread.

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4273
Credit: 245209538
RAC: 13147

We have two "file upload

We have two "file upload handlers" running (for different file sizes). Occasionally these hang, we still don't know why. We are still investigating, but with rather low priority. For the time being we are automatically restarting these every 6h automatically. For each restart the file upload handlers are offline for 5-10 mins, so we don't want to do this too often.

BM

Betreger
Betreger
Joined: 25 Feb 05
Posts: 987
Credit: 1433268991
RAC: 597716

My invalids continue on a

My invalids continue on a GTX1060. Is the object of this exercise to develop the app or to do more GW science?

Mad_Max
Mad_Max
Joined: 2 Jan 10
Posts: 153
Credit: 2139842826
RAC: 224382

Gary Roberts wrote:Well

Gary Roberts wrote:

Well done!! and thanks very much for the detailed explanation.  I'm sure you're absolutely right!!

Everything you point out makes perfect sense.  It's always very satisfying to find out why these strange things happen so I'm very grateful to you for your persistence in tracking down the cause.  I'll certainly be interested in anything you work out about the use of <fraction_done_exact/> if you do go ahead and test that option.

I did this test. And yes - normally working (on newer GPUs) GPU GW tasks also have same "progress reset" behavior as seen on older improperly working GPUs.
Only it just reset back from 3-7% progress to zero depending on GPU and CPU speed, while with abnormally slow older GCN GPUs it may take up to 50-70% of "simulated progress" before progress reset back to 0% when app make a first progress report to BOINC.

But it turned out that <fraction_done_exact/> option in app_config does not alter this situation. Boinc still use and show "simulated" progress including "progress resets" if it does not have actual progress info from the app.

Looks like this option only affect estimation of remaining run time of the task. Without it BOINC use general estimation based on "size of the task", hardware benchmark and project DCF minus already elapsed time (like Size/HW_speed*DFC - elapsed time).

With this option on it uses simple strict formula for running tasks:  (elapsed time) / (fraction done), ignoring DFC, benchmarks and "tasks size". (Although it still use them for other tasks waiting in work queue.)

But it does not affect progress bar: it always shows "simulated progress" if app does not report actual progress for any reason.
A little weird decision from BOINC programmers...

Betreger
Betreger
Joined: 25 Feb 05
Posts: 987
Credit: 1433268991
RAC: 597716

It looks as though the horrid

It looks as though the horrid rate of invalid results on my GTX1060  running 1X has been ameliorated. I'll let it run that way for a day or so and if it remains good I'll try 2X. 

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4751
Credit: 17676289227
RAC: 5777914

When the last of my GW gpu

When the last of my GW gpu tasks got into high priority mode due to deadlines, and tasks started running on two cards simultaneously, my crunch times stretched out enormously.  Where I would normally run in 1400-1600 seconds, task times stretched out to as high as 20,000 seconds.

 

Zalster
Zalster
Joined: 26 Nov 13
Posts: 3117
Credit: 4050672230
RAC: 0

Keith Myers wrote:When the

Keith Myers wrote:
When the last of my GW gpu tasks got into high priority mode due to deadlines, and tasks started running on two cards simultaneously, my crunch times stretched out enormously.  Where I would normally run in 1400-1600 seconds, task times stretched out to as high as 20,000 seconds.

 

That is weird. I have a dual machine doing nothing but GW gpu task and the task run normally. Maybe the high priority mode has something to do with it?  

Jim1348
Jim1348
Joined: 19 Jan 06
Posts: 463
Credit: 257957147
RAC: 0

I just got my first valid on

I just got my first valid on an RX 570 (Win7 64-bit) after about a dozen invalids.  It was validated against a Linux machine.

https://einsteinathome.org/workunit/418963472

Is it a fluke or is Bernd tweaking the validator?

 

Matt White
Matt White
Joined: 9 Jul 19
Posts: 120
Credit: 280798376
RAC: 0

Jim1348 wrote:I just got my

Jim1348 wrote:

I just got my first valid on an RX 570 (Win7 64-bit) after about a dozen invalids.  It was validated against a Linux machine.

https://einsteinathome.org/workunit/418963472

Is it a fluke or is Bernd tweaking the validator?

 

Since this is a beta task, I would speculate that quite a bit of tweaking is going on in the background.

Clear skies,
Matt
Matt White
Matt White
Joined: 9 Jul 19
Posts: 120
Credit: 280798376
RAC: 0

However, a significant change

However, a significant change in the application (most likely any change) would be indicated by a revision number change.

Clear skies,
Matt
Jim1348
Jim1348
Joined: 19 Jan 06
Posts: 463
Credit: 257957147
RAC: 0

Jim1348 wrote:I just got my

Jim1348 wrote:
I just got my first valid on an RX 570 (Win7 64-bit) after about a dozen invalids.  It was validated against a Linux machine.

I now have five validated,and no more invalids.  That is encouraging.  But the real point of interest is that they have been validated against three Linux machines, a FirePro D500 running under Darwin, and a Titan X running under Windows 10.

If it can do that, it can do anything.  I think Bernd has nailed it, and least for this card.  I hope it holds true for the others too.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.