Computation Error -- Output File Absent

bluestang
bluestang
Joined: 13 Apr 15
Posts: 34
Credit: 2492970228
RAC: 1602
Topic 216857

Ever since my Vega 64 started running the 1025L and 1031L data sets I've been getting way too many errors that I haven't gotten before.  This is from the BOINC Manager event log for 1 of the WUs...

 

11/7/2018 3:17:51 PM | Einstein@Home | Computation for task LATeah1031L_172.0_0_0.0_11723628_0 finished
11/7/2018 3:17:51 PM | Einstein@Home | Output file LATeah1031L_172.0_0_0.0_11723628_0_0 for task LATeah1031L_172.0_0_0.0_11723628_0 absent
11/7/2018 3:17:51 PM | Einstein@Home | Output file LATeah1031L_172.0_0_0.0_11723628_0_1 for task LATeah1031L_172.0_0_0.0_11723628_0 absent

 

Anyone else having any issues with these newer WUs?

Here is the host... https://einsteinathome.org/host/12668367

Holmis
Joined: 4 Jan 05
Posts: 1118
Credit: 1055935564
RAC: 0

I'm running a Vega 56 and I

I'm running a Vega 56 and I don't see computation errors, just the occasional invalid task.

Do you overclock?
Any resent upgrades of the graphics driver?
Have you checked the temperature of the card and other components?
Are you confident that the power supply is up to the task?

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2142
Credit: 2774525024
RAC: 844115

'Output file absent' is a

'Output file absent' is a symptom, not a cause. It crashed - there will be a reason in the std_err or the logs. Please try to track back to the original cause.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5850
Credit: 110033133390
RAC: 22401876

bluestang wrote:Ever since my

bluestang wrote:
Ever since my Vega 64 started running the 1025L and 1031L data sets I've been getting way too many errors that I haven't gotten before....

I don't have any Vega GPUs but I do have a lot of different Polaris - RX 460 -> RX 580.  I'm not seeing any real differences with unexpected/unusual compute errors as far as the different data types are concerned.

The event log information probably isn't the best place to get a handle on what is causing a problem.  The "output file absent" stuff is just BOINC saying that when the science app finished (in this case with a compute error) there weren't any of the usual result outputs to gather up and send back to the project.  Hardly surprising since the app had crashed.  What you need to do is click on the task ID link for a failed task on the website to see what the science app itself reported in the extra information stream, the std_err.txt output.  If you scroll to the end of that error output, you can see what the app was doing at the point of failure. For one of yours, this gave things like

% Binary point 505/1631
% Starting semicoherent search over f0 and f1.
% nf1dots: 41 df1dot: 2.512676418e-015 f1dot_start: -1e-013 f1dot_band: 1e-013
% Filling array of photon pairs
Error in computing index of fft input array, i:-1027760820 pair:281377
ERROR: prepare_ts_2_phase_diff_sorted() returned with error 18935032
22:46:35 (1480): [CRITICAL]: ERROR: MAIN() returned with error '1'
FPU status flags: PRECISION
22:46:47 (1480): [normal]: done. calling boinc_finish(65).
22:46:47 (1480): called boinc_finish

</stderr_txt>

I'm not a programmer, just a volunteer like you so I have no idea exactly why after processing 504 out of the total of 1631 binary points to be analysed (apparently quite successfully) the process should fail on binary point 505.  So whilst we really don't get much further without the input of the author of the app, we can probably make a few educated guesses.

Crunching pushes the hardware quite hard and can easily expose a whole range of issues from bugs in the app itself to problems with immature drivers or flaws in the hardware/firmware that have not yet been identified and worked around with driver updates.  There are also potential issues with heat/voltage/frequency that might push something over the edge.  My experience has been that many issues just like this can be mitigated by 'backing off' a little in what you are trying to extract out of the hardware, whilst making sure that heat removal and power quality are optimal.  Over the years, I've seen this solve many problems like this.

There are two quite different categories of task type for the FGRPB1G search but this is not something all that new - it's been going on for most of this year with data files alternating between the two distinct types. The latest transition happened right at the end of October.  The previous series (files like LATeah0104[TUVW].dat) changed to files like 1025L, 1031L, and then the current one 1032L, and tasks based on these take significantly longer to crunch than ones for the 0104? series. The new Turing series GPUs seem to be having the reverse problem to you.  They fail on the faster running previous series but seem to be OK on the current slower running series.

In looking at the details for your host, I notice that BOINC sees [5] GPUs.  Whilst it says 'Vega 64' they could be different but I'm wondering why BOINC says [5].  Do you really have 5 GPUs?   And what about other hosts that claim 10 or 20 GPU instances?   Are you running some sort of modified BOINC version or modified configuration that is inflating the GPU count?  Could any of that be why tasks are failing?

 

Cheers,
Gary.

bluestang
bluestang
Joined: 13 Apr 15
Posts: 34
Credit: 2492970228
RAC: 1602

Temps are fine.  Power supply

Temps are fine.  Power supply is fine.  Drivers unchanged.  stderr file shows issues with connecting/uploading to Einstein when Computation Errors hit.

I changed from running 3 concurrent task to 2 for now and it seems that helped.  I'll let it run like that until I drain my cache and see what a new batch of downloads brings.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5850
Credit: 110033133390
RAC: 22401876

bluestang wrote:...stderr

bluestang wrote:
...stderr file shows issues with connecting/uploading to Einstein when Computation Errors hit.

Perhaps you mean the BOINC event log.  The stderr file just shows messages from the app itself and the app doesn't talk directly to the project.  Because the database is so large, congestion/slow response is quite common but it would likely be just coincidence if the science app threw a comp error at the point when poor project response to BOINC requests was causing BOINC to back off and report comms issues.

bluestang wrote:
I changed from running 3 concurrent task to 2 for now and it seems that helped.

That's actually a good idea.  I'd forgotten about it, but last year when I first put some RX 580s into service, I tested both 2x and 3x and found that there was a performance increase (but fairly marginal) for 3x over 2x.  Some time later (earlier this year) when I was doing some kernel and amdgpu driver module upgrades, I noticed that after the update, (still running 3x), one of the tasks seemed to run quite a bit slower than the other two such that one in each three crunch times was quite a bit longer than the other two.  I didn't see tasks failing, just this instability of crunch time.

This could be 'cured' by running 2x.  Not only were the crunch times consistent again, 2x was now performing as well as or even a little 'better' than the previously stable 3x.  I saw similar behaviour over different hosts and different AMD GPUs so I changed any that were running 3x back to 2x and everything's been fine ever since.  Hopefully you will see your problem solved as well.

 

Cheers,
Gary.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.