Pascal again available, Turing may be coming soon

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2143
Credit: 2955646517
RAC: 718921

Well done everyone - I think

Well done everyone - I think things are falling in to place. Keith's property page (to save flipping to another tab)

That very clearly shows no checkpointing, and the std_err.txt shows no activity after

% Filling array of photon pairs

I'll head downstairs later and do the same trick of pulling the stderr out of the slot directory of a running task, to see what should happen next.

I think we have clear evidence of a stalled task making zero progress after many hours - disguised by the BOINC client. That's what Gary calls 'some sort of simulated progress' and I call pseudo-progress: it's a real thing, designed to avoid frightening the users (it gives them false confidence that something useful is happening - as Keith has experienced in this case).

I think this situation is actually worse for the project than the previous/Windows behaviour of immediate crashing. It's a huge waste of powerful resources.

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2143
Credit: 2955646517
RAC: 718921

Here's a working property

Here's a working property page for comparison:

Timings are consistent - running two-up on a GTX 970

The std_err continues

% Starting semicoherent search over f0 and f1.
% nf1dots: 38 df1dot: 2.71528666e-015 f1dot_start: -1e-013 f1dot_band: 1e-013
% Filling array of photon pairs
.
.
<snip repetitions>
.
.
% Binary point 2/1018
% Starting semicoherent search over f0 and f1.
% nf1dots: 38 df1dot: 2.71528666e-015 f1dot_start: -1e-013 f1dot_band: 1e-013
% Filling array of photon pairs
.
.
Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2143
Credit: 2955646517
RAC: 718921

And here's the

And here's the pseudo-progress culprit:

https://github.com/BOINC/boinc/blob/master/client/app.cpp#L702
https://github.com/BOINC/boinc/commit/506c3b6e419e5b524ad72fcd16e363cd0e296814
https://github.com/BOINC/boinc/commit/61d6d9a20ab70389b50416eebed8cb04deecc43f

// if the app hasn't reported fraction done or reported > 1,
// and a minute has elapsed, estimate fraction done in a
// way that constantly increases and approaches 1.
//

That's designed - quite reasonably - to cope with badly-written apps which don't provide progress information. But it also catches well-written apps (like here) which normally DO provide the information, and where the failure to do so is important debugging information.

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7219894931
RAC: 951511

There might be some

There might be some information content in comparison of the Task Properties snapshot posted by Richard Haselgrove and the task representation posted at Keith's aborted Turing highpay task.  As the 101 other aborted tasks on Keith's list all show zero CPU and run time, I currently suppose they were strangled in the cradle by Keith without running, and that this one is "The one".   The Task Name also matches what Richard posted. 

However, CPU time and run time do not.  Instead of more than eight hours, those show as 26.19 seconds of run time, and 25.98 seconds of CPU time.    For comparison, the comparable errored out task entry for a highpay task I ran on my 2080 Turing card under Windows 10 is visible here my self-terminated highpay task.  That one lists 21 seconds of run time and 7.56 seconds of CPU time.

Broadly speaking, both the Windows and the Linux tasks hit some critical point after a bit over 20 elapsed seconds, though there is clear evidence of abnormal behavior well before that in the Windows case (GPU usage, clock rate, and temperature all indicate the GPU stopped doing anything much in less than three seconds, starting around elapsed time seven seconds).  But the Windows case actually terminates at that critical point, while the Linux case seems to settle into a static state, which neither logs checkpoints, nor accumulates reportable CPU or elapsed time (at least when the matter is terminated with a GUI abort by the user).

As to comparison of the stderr, early extra lines in the Windows case appear to have been added after failure, which are lacking in the Linux case which did not go through any orderly program internal termination.  Moving on, both cases show similar initial steps, ending with three lines looking like this:


<pre>% Starting semicoherent search over f0 and f1.
% nf1dots: 38  df1dot: 2.71528666e-15  f1dot_start: -1e-13  f1dot_band: 1e-13
% Filling array of photon pairs</pre>

Those appear to be the last normal progress lines, with the next lines in the Windows (self-terminated) case reading the familiar


<pre>ERROR: /home/bema/fermilat/src/bridge_fft_clfft.c:948: clFinish failed. status=-36
ERROR: opencl_ts_2_phase_diff_sorted() returned with error -956081888
18:26:53 (17832): [CRITICAL]: ERROR: MAIN() returned with error '-36'
FPU status flags:  PRECISION
18:27:04 (17832): [normal]: done. calling boinc_finish(28).</pre>

While the next lines in the Linux user-aborted case read 


<pre>Warning:  Program terminating, but clFFT resources not freed.
Please consider explicitly calling clfftTeardown( ).</pre>

Keith, did you happen to notice whether such indicators as clock rate, GPU temperature, and GPU usage suggested your 2080 was doing "something" during the long hours?

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2143
Credit: 2955646517
RAC: 718921

archae86 wrote:The Task Name

archae86 wrote:

The Task Name also matches what Richard posted. 

However, CPU time and run time do not.  Instead of more than eight hours, those show as 26.19 seconds of run time, and 25.98 seconds of CPU time.    For comparison, the comparable errored out task entry for a highpay task I ran on my 2080 Turing card under Windows 10 is visible here my self-terminated highpay task.  That one lists 21 seconds of run time and 7.56 seconds of CPU time.

That's a slightly off-topic (secondary) observation, but one we should come back to later once we've got these tasks running on the RTX cards.

In theory, BOINC should be measuring both the CPU time and the elapsed time for the child application, and should report those measurements accurately back to the server. The other clues in the reported task result

Sent:  6 Dec 2018 15:22:33 GMT
Received:  7 Dec 2018 1:44:45 GMT

are consistent with the screenshot and Keith's description: the reported run times are not. It seems as if the reported times have been rewound, possibly by the user abort action? From the BOINC point of view, I think we should investigate that as a potential bug. I think I know how I'd approach that under Windows, and I might experiment over the weekend. But I have no idea what corroborative timing tools might be appropriate under Linux.

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4963
Credit: 18708895173
RAC: 6321596

Quote:Keith, did you happen

Quote:
Keith, did you happen to notice whether such indicators as clock rate, GPU temperature, and GPU usage suggested your 2080 was doing "something" during the long hours?

Yes, the clocks stayed up on the card as well as the gpu utilization.  The temps stayed up also. So it burned 9 hours of power for naught.

 

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117515836936
RAC: 35400441

The tasks based on the 0104X

The tasks based on the 0104X data file are now history.  I've put some information about the very latest file in this thread..  The new tasks seem to still be relatively fast running - based on a single observation.

Peter, if you are willing, it should be quite interesting to see if these behave any differently for you.

 

Cheers,
Gary.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117515836936
RAC: 35400441

Richard Haselgrove wrote:...

Richard Haselgrove wrote:
... but one we should come back to later once we've got these tasks running on the RTX cards.

Richard,
Your optimism is very reassuring :-).  Thank you so much for spending time on this.  It's great to have your input.

When I first saw the change in data file, my immediate thought was that the Devs must have decided to go back to the slow, boring, safe and non-controversial standard type of task.  I couldn't resist trying one out and was intrigued to see that it seems to be a third variety of these tasks closer to the problematic type rather than the usual type.

I sure hope these will run for Peter (and any other RTX owners trying to contribute).

 

Cheers,
Gary.

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7219894931
RAC: 951511

Gary Roberts wrote:Peter, if

Gary Roberts wrote:
Peter, if you are willing, it should be quite interesting to see if these behave any differently for you.

Willing, eager, bumped up my fetch request to get one, suspended intervening tasks, and tried one on my 2080.  It failed promptly, seemingly with the same syndrome as the previous flavor of high-pay tasks.

1. GPU usage dropped back down within about three seconds of engaging. 

2. Total elapsed time before the task errored out was logged as 21 seconds.

3. In the task list as displayed by BOINCTasks, it shows in the status columns "Reported: Computation error (28,)"

4. you can review stderr for a failing task at first failing task

The primary in sequence error lines look familiar: 

ERROR: /home/bema/fermilat/src/bridge_fft_clfft.c:948: clFinish failed. status=-36
ERROR: opencl_ts_2_phase_diff_sorted() returned with error -956081888
18:47:40 (11592): [CRITICAL]: ERROR: MAIN() returned with error '-36'
FPU status flags: PRECISION

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117515836936
RAC: 35400441

archae86 wrote:...  It failed

archae86 wrote:
...  It failed promptly, seemingly with the same syndrome as the previous flavor of high-pay tasks.

What a nuisance!  Thanks for trying.   I found 3 error results in the list for your host - 2 for 2001L tasks and one for a 0104X.  Maybe it took a bit of time for the results to make it into the database where they could be seen.

I was intrigued to see that you got quite a few tasks recently - a surprising number of the previous low-pay resends as well as the new 2001L tasks.  As I looked down the list of those 2001Ls, there were a *lot* of resends as well.  There must be a few people having some sort of problem with those at the moment for resends to be in such numbers at the very beginning.  I don't recall seeing lots of resends so soon after the start of a new file.

I looked at the stderr output for one of the 2001Ls.  I'm not a programmer so it doesn't help me understand the problem.  Seems pretty similar to what you reported for the 0104X tasks.  It would be nice if someone from the project would give some guidance about the whole matter.

 

Cheers,
Gary.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.