The application
Gravitational Wave search O2 Multi-Directional GPU v2.09 (GW-opencl-ati-Beta)windows_x86_64,
running on host https://einsteinathome.org/host/12839586, often fails with these remarks among the output:
DEPRECATION WARNING: program has invoked obsolete function XLALGetVersionString(). Please see XLALVCSInfoString() for information about a replacement.
DEPRECATION WARNING: program has invoked obsolete function InitDopplerSkyScan(). Please see XLALInitDopplerSkyScan() for information about a replacement.
Copyright © 2024 Einstein@Home. All rights reserved.
Harmless. Everyone has those
)
Harmless. Everyone has those entries in their stderr.txt outputs for completed tasks.
Dirk Broer wrote:GPU v2.09
)
That looks like a laptop with Radeon Vega 8 integrated graphics chip.
Have you observed how high max temperatures will get while computer is crunching a GW GPU task ? Excessive heat could potentially cause problems and tasks to fail at some point.
Have you been doing anything else with the computer when the tasks have failed? Or has it been left purely for crunching? It's a modern computer but maybe some sort of heavy loading at the same time while crunchng with the integrated gpu chip could cause tasks to fail.
I would also try to make sure it's running the latest chipset drivers (or AMD drivers) that a user is possibly able to install on it.
Dirk Broer wrote:The
)
Thanks for providing the link to your host. Also, thanks for looking into the stderr output to try to diagnose the problem yourself.
As Keith mentions, the following is a standard warning message just to advise the authors of the science app that there are later versions of library routines available. Warnings are just that - warnings. They are for information purposes and are not normally the cause of errors/failures. Obviously the app authors would be fully aware of these warnings and would have deliberately chosen not to use the later version - for some reason not known to we lesser mortals :-).
So, ignoring the warnings, just continue reading the output to find the real error message. In the example I looked at, I found the following:-
which was followed by a 'call stack' which appears to be a list of functions in play at the time of the error. After that call stack, you get:-
Followed by a whole bunch of Windows runtime debugger output - useless to anyone other than possibly the app developer or people working on drivers, etc.
If you google the 'unhandled exception' message you get things like:-
so that doesn't really tell you much, other than if an exception handler had been written into the app, perhaps the problem may have been recoverable (ie handled) rather than just causing the app to crash.
I looked at what sort of hardware you have - a Ryzen 5 3400G with Radeon Vega graphics and BOINC detects that GPU as "AMD AMD 15D8:C8 (13349MB)" - so I guess some sort of APU with access to a bunch of system RAM. My guess is that perhaps the load caused by crunching is close to the limit of what the hardware/cooling system can cope with. There were a couple of tasks that succeeded but most failed. Are you sure the cooling system is up to the heat load from crunching?
The other possibility is to do with driver issues. I tend to steer clear of using APUs for crunching as I've seen lots of comments in the past about driver bugs causing problems with those sorts of devices. Are you using the latest drivers for your hardware?
One final point. If you look at all the errors, they all have failed after the same amount of run time - ~10,100 secs. Since the 2 completed tasks took much less than that, it looks like crunching had completely stopped on the failed tasks at earlier stages and that the 'standard task time limit until the plug is pulled' mechanism might have been invoked in causing each task ultimately to be stopped. Perhaps BOINC attempting to stop a stalled task then became the 'exception' event. This is just speculation on my part but it seems really weird that all failed tasks have the same run time. This seems to suggest a driver issue that caused crunching to stall at that earlier stage.
Cheers,
Gary.
The real problem occurs much
)
The real problem occurs much earlier than that:
Exit status:197 (0x000000C5) EXIT_TIME_LIMIT_EXCEEDED
BOINC deliberately throws an exception, and thus triggers the stack trace, when the tasks over-run, in case it's a programming bug that gets stuck in an endless loop. But if it's only a single computer, running a mature application, that won't be likely.
The application is 'GW-opencl-ati-Beta'. I haven't been following closely, but has anyone else reported problems during the beta test? Might still be worth getting a programmer to take a look. But the proximate cause sounds like a deliberate programming escape-hatch:
Richard Haselgrove wrote:The
)
Hi Richard, Thanks for chiming in and pointing out something I completely overlooked :-).
I was so fixated on explaining the lack of importance of the warnings that I went straight to the detailed log of stderr messages below the warnings without even looking at the summary at the top. As I don't run Windows, the copious debugger output is a complete mystery to me and I didn't even know that BOINC implemented the time limit by 'throwing an exception' - whatever that really means or implies :-).
As I was composing my reply, the elapsed time of all the failed tasks being constant did convince me that the plug was being pulled by BOINC due to an exit time limit being exceeded but I was more fixated on looking at the number of dots and the number of 'c' checkpoint indicators that a couple of different tasks showed.
For example, the task I linked had a total of 10 rows of completed dots terminated with the 'c' checkpoint indicator. That task had a CPU time component of 157 sec. The log showed the task as having 37 checkpoints. The 10 'c' indicators for 157 sec allows a rough guess of around 16 secs per checkpoint.
I looked at several others and noted much the same behaviour. There seemed to be a correlation between the number of 'c' checkpoint indicators and the amount of CPU time. There were two validated tasks that have since disappeared. I remember them as having CPU times around 700 seconds (if I remember correctly) which seemed also to correlate - ie. 37 checkpoints in 700 secs gives 19 secs per checkpoint - the same sort of ballpark. All this suggested that 'normal' crunching did exist for a certain period for each of the error tasks.
To me, that seemed to indicate that the problem occurred at some point where the GPU locked up or stalled for some reason. In other words a driver/hardware issue rather than a problem or bug in the app. The fact that 2 tasks had completed and validated also supported that the app wasn't the issue.
The 'announcement' for the beta app is here. It was to deal with a situation where if a GPU task was suspended, there would be an entry in the stderr output that "resources were not freed". I was concerned that this might be related to tasks failing due to lack of memory. A couple of posts later in that same thread, I provided more background info about the reasons for the modified app, if you're interested. As I understand it, it was a small change and no further issue for (or impact on) crunching performance.
Cheers,
Gary.
Gary Roberts wrote: As I
)
Well, the critical point to grasp on to is that there's one starting point for the problems, and everything else is a side effect. You'll be aware of the people who fixate on 'output file missing', without realising that is just a side effect of some earlier problem.
In this case, the task stopped - BOINC stopped the task - because the time limit was exceeded. End of. You'll also be familiar with Linux doing that via 'process got signal n', or suchlike. The Windows equivalent in this case is a 'breakpoint' - a programmer's device to stop the program in its tracks, but preserve the entrails for inspection. There might be something useful in there, but to be honest it's unlikely.
I also saw those, but I noticed something different. Here is the trace from the most recent task:
...................................................................................INFO: Major Windows version: 6 c ....................................................................................c ....................................................................................c ....................................................................................c ....................................................................................c ....................................................................................c ...................-------------------
Most of the "I'm alive" markers are periods, but towards the end they change to dashes. Why? Is that trying to tell us something?
Quote:Most of the "I'm alive"
)
That's interesting. Assuming the periods are the BOINC "heartbeat", where in the code does the heartbeat change from dots to dashes?
Richard can probably find the line of code that does that.
No, it'll be an Einstein
)
No, it'll be an Einstein heartbeat. stderr.txt is written by the science app, not by BOINC. Gary will probably know which developer to ask to take a look.
Richard Haselgrove wrote:Most
)
No, I don't believe so :-).
These "markers" are a 2 character string (a period followed by a space) with no newline to follow. I believe they indicate an individual calculation loop being completed. At a stage where it's time for a checkpoint to be written, the marker becomes a 'c' followed by a newline instead.
When the science app stalls, whatever then intervenes just starts its output hard up against the last period/space pair that was printed before the stall. My guess is that the space separated row of dashes is just intended as a 'separator line' for the new output that immediately follows in response to the problem. This pattern is repeated later, following the block that starts with "Call stack:". There is a further identical string of dashes that precedes the "Unhandled Exception Detected..." message.
Cheers,
Gary.