Observations on FGRBP1 1.18 for Windows

archae86

Joined: 6 Dec 05

Posts: 3165

Credit: 7384901687

RAC: 2054269

16 Jan 2017 16:14:35 UTC

Topic 204689

(moderation:

)

A few hours ago, in the Technical News forum Bernd announced that version 1.18 has been released with beta status. His announcement and many subsequent observations posted in that thread document an appreciable speedup compared to version 1.17 on a wide variety of hosts. Very roughly a typical speedup might be 25%.

I have four hosts currently running Einstein GPU work, three of which have dual GPUs. I suspended in-process 1.17 work in order to get an early look at 1.18 behavior on all seven GPU cards (Nvidia types 750Ti, 970, 1050, 1060, and 1070). On all save one all seems well, with successful completions in appreciably shorted elapsed time.

As it happens, the single-GPU host, which has had some occasional troubles in recent weeks, but had run fine for days, ran OK for about 20 minutes with a single 1.18 task. But very shortly after I added a second it failed. Further it failed in such a way that subsequent 1.18 and 1.17 tasks failed very quickly. In other words the system had somehow gotten into a lethal condition. A full power-down reboot cleared this condition, and the system resumed apparently normal processing. But shortly after I got brave and again allowed two 1.18 tasks, it failed again. Now I can't do further testing as the project denies it new work until tomorrow on the grounds that the daily quota of 12 tasks is exceeded. I understand that failures lower the task limit, but the system has only 27 error returns reported against it today, which I would not have expected to put me out of business.

Anyway, I can't experiment until the end of the task day (which I think is midnight UTC), but in the short term intend to forbid beta tasks, to see if 1.17 still works, and then to restrict to single task, in case two tasks is part of the issue here.

I imagine the tasks which actually ran for a while and then failed may have the most useful failure symptoms. Here is the end of the stderr log for a few such:

Case 1:

% nf1dots: 31  df1dot: 3.344368011e-015  f1dot_start: -1e-013  f1dot_band: 1e-013
% Filling array of photon pairs
ERROR: /home/bema/fermilat/src/bridge_fft_clfft.c:923: clFinish failed. status=-36
ERROR: opencl_ts_2_phase_diff_sorted() returned with error 7343459
07:45:01 (4468): [CRITICAL]: ERROR: MAIN() returned with error '-36'
FPU status flags:  PRECISION
07:45:13 (4468): [normal]: done. calling boinc_finish(28).
07:45:13 (4468): called boinc_finish

Case 2:

% Filling array of photon pairs
Error during OpenCL FFT (error: -36)
ERROR: gen_fft_execute() returned with error 7343372
07:45:01 (4212): [CRITICAL]: ERROR: MAIN() returned with error '5'
FPU status flags:  PRECISION
07:45:13 (4212): [normal]: done. calling boinc_finish(69).
07:45:13 (4212): called boinc_finish

Jim1348

Joined: 19 Jan 06

Posts: 463

Credit: 257957147

RAC: 0

Very interesting. On your

16 Jan 2017 17:36:20 UTC

Message 154176

(moderation:

)

Very interesting. On your GTX 970, I see that the run times vary by a factor of 2, both for 1.18 and 1.17. Do you know what might account for that (maybe running two WU at once)? I haven't seen that much variation for my GTX 750 Ti's, which are on Win7 64-bit.

archae86

Joined: 6 Dec 05

Posts: 3165

Credit: 7384901687

RAC: 2054269

Jim1348 wrote: On your GTX

16 Jan 2017 17:52:51 UTC

Message 154177 in response to message 154176

(moderation:

)

Jim1348 wrote:

On your GTX 970, I see that the run times vary by a factor of 2, both for 1.18 and 1.17.

Well they don't, actually. The issue is that the host (as two of my other three) has two dissimilar GPUs installed. On that case BOINC only reports one upstream--generally the one with the higher CUDA capability, if there is a difference. Often one can figure out which GPU actually ran a given WU by looking around in the stderr file for the string "GTX". However the stderr files for the current Gamma Ray Pulsar search often (always?) lack this traditional entry.

Anyway, on that specific host the slow units ran on the GTX 750Ti, and the fast ones on the GTX 970. Within model, the elapsed time variation from unit to unit is much, much less.

Jim1348

Joined: 19 Jan 06

Posts: 463

Credit: 257957147

RAC: 0

archae86 wrote:Anyway, on

16 Jan 2017 18:51:09 UTC

Message 154181 in response to message 154177

(moderation:

)

archae86 wrote:

Anyway, on that specific host the slow units ran on the GTX 750Ti, and the fast ones on the GTX 970. Within model, the elapsed time variation from unit to unit is much, much less.

OK, that is actually more interesting to me, since I am considering switching work from the GTX 750 Ti's to either a GTX 960 or 970, and that shows the comparison directly. Thanks.

Logforme

Joined: 13 Aug 10

Posts: 332

Credit: 1714373961

RAC: 0

On my HD7970 the "Validate

16 Jan 2017 21:01:33 UTC

Message 154184

(moderation:

)

On my HD7970 the "Validate Error" rate is way up with 1.18.

So far it's 10 validate errors vs 21 valids. With 1.17 the validate error / validated ratio was much better.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5887

Credit: 119543168207

RAC: 24851378

archae86 wrote:... the slow

16 Jan 2017 21:24:10 UTC

Message 154185 in response to message 154177

(moderation:

)

archae86 wrote:

... the slow units ran on the GTX 750Ti ...

What improvement are you seeing on the 750Ti? Is it equivalent to what you get on the 970?

Thanks.

Cheers,
Gary.

Holmis

Joined: 4 Jan 05

Posts: 1118

Credit: 1055935564

RAC: 0

I've got 0 validate errors on

16 Jan 2017 21:24:43 UTC

Message 154186 in response to message 154184

(moderation:

)

I've got 0 validate errors on my GTX970 for both 1.17 and 1.18.

A more optimized app will/should put more stress on the hardware so it might be a good time to check the running conditions and maybe make some adjustments to running parameters. Even a validate error once in a while might "invalidate" an overclock as the wasted time might be more than the gain from the overclock.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5887

Credit: 119543168207

RAC: 24851378

Logforme wrote:On my HD7970

16 Jan 2017 21:42:51 UTC

Message 154188 in response to message 154184

(moderation:

)

Logforme wrote:

On my HD7970 the "Validate Error" rate is way up with 1.18.

So far it's 10 validate errors vs 21 valids ...

After this little scare, I've quickly checked a host that would have done the most 1.18s so far. It has dual HD7850s. It has 11 validated 1.18s, a lot more pending, and zero invalids. So far so good, fingers crossed :-).

Cheers,
Gary.

archae86

Joined: 6 Dec 05

Posts: 3165

Credit: 7384901687

RAC: 2054269

Gary Roberts wrote:What

16 Jan 2017 21:49:59 UTC

Message 154190 in response to message 154185

(moderation:

)

Gary Roberts wrote:

What improvement are you seeing on the 750Ti? Is it equivalent to what you get on the 970

Caveats: The sample sizes are not terribly large, and I clipped off a handful of outliers, believing them likely the result of mixed running as either to multiplicity or application, but without checking that to be true. On the plus side, these are formal averages, not eyeball estimates.

The elapsed time of 1.18 as a fraction of 1.17 was considerably more improved for my 970

GTX 970 0.533

GTX 750Ti 0.642

Mind you, 0.642 is nothing to sneeze at, but I had not realized until you asked the question just how drastic the 970 improvement was. I've not done the computation for my Pascal cards yet.

Jim1348

Joined: 19 Jan 06

Posts: 463

Credit: 257957147

RAC: 0

I have two GTX 960s each fed

16 Jan 2017 22:23:29 UTC

Message 154191

(moderation:

)

I have two GTX 960s each fed by a core of an i7-4790 running under Ubuntu 16.10. Looking at the completion times (not all validated, but none invalid):

1.17 -> 2450 seconds

1.18 -> 1525 seconds

So the ratio is 0.622; very nice.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5887

Credit: 119543168207

RAC: 24851378

archae86 wrote:The elapsed

16 Jan 2017 22:26:44 UTC

Message 154192 in response to message 154190

(moderation:

)

archae86 wrote:

The elapsed time of 1.18 as a fraction of 1.17 was considerably more improved for my 970

GTX 970 0.533

GTX 750Ti 0.642

Mind you, 0.642 is nothing to sneeze at, but I had not realized until you asked the question just how drastic the 970 improvement was.

Thanks very much for that!

The 970 value is pretty much in line with what Holmis reported in the Technical News thread. I guess 970 owners in general will be highly delighted :-).

I have a couple of 750Tis so it will be good to see them improving their rather poor performance whilst using the previous version. I might actually fire up a GTX650 and see if it gets any benefit.

Cheers,
Gary.

Observations on FGRBP1 1.18 for Windows

Forums › Cruncher's Corner

Comment viewing options

Forums › Cruncher's Corner