Well done everyone - I think things are falling in to place. Keith's property page (to save flipping to another tab)
That very clearly shows no checkpointing, and the std_err.txt shows no activity after
% Filling array of photon pairs
I'll head downstairs later and do the same trick of pulling the stderr out of the slot directory of a running task, to see what should happen next.
I think we have clear evidence of a stalled task making zero progress after many hours - disguised by the BOINC client. That's what Gary calls 'some sort of simulated progress' and I call pseudo-progress: it's a real thing, designed to avoid frightening the users (it gives them false confidence that something useful is happening - as Keith has experienced in this case).
I think this situation is actually worse for the project than the previous/Windows behaviour of immediate crashing. It's a huge waste of powerful resources.
// if the app hasn't reported fraction done or reported > 1,
// and a minute has elapsed, estimate fraction done in a
// way that constantly increases and approaches 1.
//
That's designed - quite reasonably - to cope with badly-written apps which don't provide progress information. But it also catches well-written apps (like here) which normally DO provide the information, and where the failure to do so is important debugging information.
There might be some information content in comparison of the Task Properties snapshot posted by Richard Haselgrove and the task representation posted at Keith's aborted Turing highpay task. As the 101 other aborted tasks on Keith's list all show zero CPU and run time, I currently suppose they were strangled in the cradle by Keith without running, and that this one is "The one". The Task Name also matches what Richard posted.
However, CPU time and run time do not. Instead of more than eight hours, those show as 26.19 seconds of run time, and 25.98 seconds of CPU time. For comparison, the comparable errored out task entry for a highpay task I ran on my 2080 Turing card under Windows 10 is visible here my self-terminated highpay task. That one lists 21 seconds of run time and 7.56 seconds of CPU time.
Broadly speaking, both the Windows and the Linux tasks hit some critical point after a bit over 20 elapsed seconds, though there is clear evidence of abnormal behavior well before that in the Windows case (GPU usage, clock rate, and temperature all indicate the GPU stopped doing anything much in less than three seconds, starting around elapsed time seven seconds). But the Windows case actually terminates at that critical point, while the Linux case seems to settle into a static state, which neither logs checkpoints, nor accumulates reportable CPU or elapsed time (at least when the matter is terminated with a GUI abort by the user).
As to comparison of the stderr, early extra lines in the Windows case appear to have been added after failure, which are lacking in the Linux case which did not go through any orderly program internal termination. Moving on, both cases show similar initial steps, ending with three lines looking like this:
<pre>% Starting semicoherent search over f0 and f1.
% nf1dots: 38 df1dot: 2.71528666e-15 f1dot_start: -1e-13 f1dot_band: 1e-13
% Filling array of photon pairs</pre>
Those appear to be the last normal progress lines, with the next lines in the Windows (self-terminated) case reading the familiar
<pre>ERROR: /home/bema/fermilat/src/bridge_fft_clfft.c:948: clFinish failed. status=-36
ERROR: opencl_ts_2_phase_diff_sorted() returned with error -956081888
18:26:53 (17832): [CRITICAL]: ERROR: MAIN() returned with error '-36'
FPU status flags: PRECISION
18:27:04 (17832): [normal]: done. calling boinc_finish(28).</pre>
While the next lines in the Linux user-aborted case read
<pre>Warning: Program terminating, but clFFT resources not freed.
Please consider explicitly calling clfftTeardown( ).</pre>
Keith, did you happen to notice whether such indicators as clock rate, GPU temperature, and GPU usage suggested your 2080 was doing "something" during the long hours?
However, CPU time and run time do not. Instead of more than eight hours, those show as 26.19 seconds of run time, and 25.98 seconds of CPU time. For comparison, the comparable errored out task entry for a highpay task I ran on my 2080 Turing card under Windows 10 is visible here my self-terminated highpay task. That one lists 21 seconds of run time and 7.56 seconds of CPU time.
That's a slightly off-topic (secondary) observation, but one we should come back to later once we've got these tasks running on the RTX cards.
In theory, BOINC should be measuring both the CPU time and the elapsed time for the child application, and should report those measurements accurately back to the server. The other clues in the reported task result
Sent: 6 Dec 2018 15:22:33 GMT
Received: 7 Dec 2018 1:44:45 GMT
are consistent with the screenshot and Keith's description: the reported run times are not. It seems as if the reported times have been rewound, possibly by the user abort action? From the BOINC point of view, I think we should investigate that as a potential bug. I think I know how I'd approach that under Windows, and I might experiment over the weekend. But I have no idea what corroborative timing tools might be appropriate under Linux.
Keith, did you happen to notice whether such indicators as clock rate, GPU temperature, and GPU usage suggested your 2080 was doing "something" during the long hours?
Yes, the clocks stayed up on the card as well as the gpu utilization. The temps stayed up also. So it burned 9 hours of power for naught.
The tasks based on the 0104X data file are now history. I've put some information about the very latest file in this thread.. The new tasks seem to still be relatively fast running - based on a single observation.
Peter, if you are willing, it should be quite interesting to see if these behave any differently for you.
... but one we should come back to later once we've got these tasks running on the RTX cards.
Richard,
Your optimism is very reassuring :-). Thank you so much for spending time on this. It's great to have your input.
When I first saw the change in data file, my immediate thought was that the Devs must have decided to go back to the slow, boring, safe and non-controversial standard type of task. I couldn't resist trying one out and was intrigued to see that it seems to be a third variety of these tasks closer to the problematic type rather than the usual type.
I sure hope these will run for Peter (and any other RTX owners trying to contribute).
Peter, if you are willing, it should be quite interesting to see if these behave any differently for you.
Willing, eager, bumped up my fetch request to get one, suspended intervening tasks, and tried one on my 2080. It failed promptly, seemingly with the same syndrome as the previous flavor of high-pay tasks.
1. GPU usage dropped back down within about three seconds of engaging.
2. Total elapsed time before the task errored out was logged as 21 seconds.
3. In the task list as displayed by BOINCTasks, it shows in the status columns "Reported: Computation error (28,)"
The primary in sequence error lines look familiar:
ERROR: /home/bema/fermilat/src/bridge_fft_clfft.c:948: clFinish failed. status=-36
ERROR: opencl_ts_2_phase_diff_sorted() returned with error -956081888
18:47:40 (11592): [CRITICAL]: ERROR: MAIN() returned with error '-36'
FPU status flags: PRECISION
... It failed promptly, seemingly with the same syndrome as the previous flavor of high-pay tasks.
What a nuisance! Thanks for trying. I found 3 error results in the list for your host - 2 for 2001L tasks and one for a 0104X. Maybe it took a bit of time for the results to make it into the database where they could be seen.
I was intrigued to see that you got quite a few tasks recently - a surprising number of the previous low-pay resends as well as the new 2001L tasks. As I looked down the list of those 2001Ls, there were a *lot* of resends as well. There must be a few people having some sort of problem with those at the moment for resends to be in such numbers at the very beginning. I don't recall seeing lots of resends so soon after the start of a new file.
I looked at the stderr output for one of the 2001Ls. I'm not a programmer so it doesn't help me understand the problem. Seems pretty similar to what you reported for the 0104X tasks. It would be nice if someone from the project would give some guidance about the whole matter.
Well done everyone - I think
)
Well done everyone - I think things are falling in to place. Keith's property page (to save flipping to another tab)
That very clearly shows no checkpointing, and the std_err.txt shows no activity after
% Filling array of photon pairs
I'll head downstairs later and do the same trick of pulling the stderr out of the slot directory of a running task, to see what should happen next.
I think we have clear evidence of a stalled task making zero progress after many hours - disguised by the BOINC client. That's what Gary calls 'some sort of simulated progress' and I call pseudo-progress: it's a real thing, designed to avoid frightening the users (it gives them false confidence that something useful is happening - as Keith has experienced in this case).
I think this situation is actually worse for the project than the previous/Windows behaviour of immediate crashing. It's a huge waste of powerful resources.
Here's a working property
)
Here's a working property page for comparison:
Timings are consistent - running two-up on a GTX 970
The std_err continues
And here's the
)
And here's the pseudo-progress culprit:
https://github.com/BOINC/boinc/blob/master/client/app.cpp#L702
https://github.com/BOINC/boinc/commit/506c3b6e419e5b524ad72fcd16e363cd0e296814
https://github.com/BOINC/boinc/commit/61d6d9a20ab70389b50416eebed8cb04deecc43f
That's designed - quite reasonably - to cope with badly-written apps which don't provide progress information. But it also catches well-written apps (like here) which normally DO provide the information, and where the failure to do so is important debugging information.
There might be some
)
There might be some information content in comparison of the Task Properties snapshot posted by Richard Haselgrove and the task representation posted at Keith's aborted Turing highpay task. As the 101 other aborted tasks on Keith's list all show zero CPU and run time, I currently suppose they were strangled in the cradle by Keith without running, and that this one is "The one". The Task Name also matches what Richard posted.
However, CPU time and run time do not. Instead of more than eight hours, those show as 26.19 seconds of run time, and 25.98 seconds of CPU time. For comparison, the comparable errored out task entry for a highpay task I ran on my 2080 Turing card under Windows 10 is visible here my self-terminated highpay task. That one lists 21 seconds of run time and 7.56 seconds of CPU time.
Broadly speaking, both the Windows and the Linux tasks hit some critical point after a bit over 20 elapsed seconds, though there is clear evidence of abnormal behavior well before that in the Windows case (GPU usage, clock rate, and temperature all indicate the GPU stopped doing anything much in less than three seconds, starting around elapsed time seven seconds). But the Windows case actually terminates at that critical point, while the Linux case seems to settle into a static state, which neither logs checkpoints, nor accumulates reportable CPU or elapsed time (at least when the matter is terminated with a GUI abort by the user).
As to comparison of the stderr, early extra lines in the Windows case appear to have been added after failure, which are lacking in the Linux case which did not go through any orderly program internal termination. Moving on, both cases show similar initial steps, ending with three lines looking like this:
Those appear to be the last normal progress lines, with the next lines in the Windows (self-terminated) case reading the familiar
While the next lines in the Linux user-aborted case read
Keith, did you happen to notice whether such indicators as clock rate, GPU temperature, and GPU usage suggested your 2080 was doing "something" during the long hours?
archae86 wrote:The Task Name
)
That's a slightly off-topic (secondary) observation, but one we should come back to later once we've got these tasks running on the RTX cards.
In theory, BOINC should be measuring both the CPU time and the elapsed time for the child application, and should report those measurements accurately back to the server. The other clues in the reported task result
Sent: 6 Dec 2018 15:22:33 GMT
Received: 7 Dec 2018 1:44:45 GMT
are consistent with the screenshot and Keith's description: the reported run times are not. It seems as if the reported times have been rewound, possibly by the user abort action? From the BOINC point of view, I think we should investigate that as a potential bug. I think I know how I'd approach that under Windows, and I might experiment over the weekend. But I have no idea what corroborative timing tools might be appropriate under Linux.
Quote:Keith, did you happen
)
Yes, the clocks stayed up on the card as well as the gpu utilization. The temps stayed up also. So it burned 9 hours of power for naught.
The tasks based on the 0104X
)
The tasks based on the 0104X data file are now history. I've put some information about the very latest file in this thread.. The new tasks seem to still be relatively fast running - based on a single observation.
Peter, if you are willing, it should be quite interesting to see if these behave any differently for you.
Cheers,
Gary.
Richard Haselgrove wrote:...
)
Richard,
Your optimism is very reassuring :-). Thank you so much for spending time on this. It's great to have your input.
When I first saw the change in data file, my immediate thought was that the Devs must have decided to go back to the slow, boring, safe and non-controversial standard type of task. I couldn't resist trying one out and was intrigued to see that it seems to be a third variety of these tasks closer to the problematic type rather than the usual type.
I sure hope these will run for Peter (and any other RTX owners trying to contribute).
Cheers,
Gary.
Gary Roberts wrote:Peter, if
)
Willing, eager, bumped up my fetch request to get one, suspended intervening tasks, and tried one on my 2080. It failed promptly, seemingly with the same syndrome as the previous flavor of high-pay tasks.
1. GPU usage dropped back down within about three seconds of engaging.
2. Total elapsed time before the task errored out was logged as 21 seconds.
3. In the task list as displayed by BOINCTasks, it shows in the status columns "Reported: Computation error (28,)"
4. you can review stderr for a failing task at first failing task
The primary in sequence error lines look familiar:
ERROR: /home/bema/fermilat/src/bridge_fft_clfft.c:948: clFinish failed. status=-36
ERROR: opencl_ts_2_phase_diff_sorted() returned with error -956081888
18:47:40 (11592): [CRITICAL]: ERROR: MAIN() returned with error '-36'
FPU status flags: PRECISION
archae86 wrote:... It failed
)
What a nuisance! Thanks for trying. I found 3 error results in the list for your host - 2 for 2001L tasks and one for a 0104X. Maybe it took a bit of time for the results to make it into the database where they could be seen.
I was intrigued to see that you got quite a few tasks recently - a surprising number of the previous low-pay resends as well as the new 2001L tasks. As I looked down the list of those 2001Ls, there were a *lot* of resends as well. There must be a few people having some sort of problem with those at the moment for resends to be in such numbers at the very beginning. I don't recall seeing lots of resends so soon after the start of a new file.
I looked at the stderr output for one of the 2001Ls. I'm not a programmer so it doesn't help me understand the problem. Seems pretty similar to what you reported for the 0104X tasks. It would be nice if someone from the project would give some guidance about the whole matter.
Cheers,
Gary.