Computation finished,... Output file absent GPU OpenCL tasks LATeah0010L

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117879124814
RAC: 34729615

DiablosOffens wrote:I've got

DiablosOffens wrote:

I've got a similar problem. Some project apps always fail to find the output file at 100% completion. Here are the last view lines from task 671324524:

% Following up candidate number: 10
% Refining in S
% Following-up in P
% Writing follow-up output file.
FPU status flags:  PRECISION
00:11:13 (18052): [normal]: done. calling boinc_finish(0).
00:11:13 (18052): called boinc_finish

</stderr_txt>
<message>
upload failure: <file_xfer_error>
  <file_name>LATeah0037L_1132.0_0_0.0_5794335_1_0</file_name>
  <error_code>-161 (not found)</error_code>
</file_xfer_error>
<file_xfer_error>
  <file_name>LATeah0037L_1132.0_0_0.0_5794335_1_1</file_name>
  <error_code>-161 (not found)</error_code>
</file_xfer_error>

</message>
]]>

Your problem is a bit different to the one that started this thread but seems pretty similar to one I reported yesterday here.  My problem is with test tasks for a brand new search that will perhaps be starting in full on Monday.  There are a very small number of these at the moment and I happened to get four of them.  All mine have failed.

The error message is remarkably similar to what you are reporting.  As of the time of writing, the server status page (for the new FGRP5 run) shows 98 tasks sent, 60 still in progress and 38 failed.  There are no successfully returned tasks.  I only checked two of my own and a couple of the quorum partners and the failure mechanism seems the same.  I imagine someone will be looking at this first thing on Monday morning.

In my case (and I presume in the others as well) this is nothing to do with symlinks to slots directories on a separate drive.  It will be interesting to know what the problem is.  I guess it will also be something to do with moving the raw results from the slot directory to the project directory prior to uploading.  It's only affecting the new search and not the FGRPB1 and FGRPB1G existing searches.

Are you really sure your issue is due to the slots directories being on another drive?  Have you tried (temporarily) disabling the symlink and creating a local slots directory to see if the tasks then complete successfully?

 

 

Cheers,
Gary.

DiablosOffens
DiablosOffens
Joined: 14 Jul 05
Posts: 2
Credit: 1368780
RAC: 0

Ok, I recreated the slots

Ok, I recreated the slots directory next to the projects directory and received a new task (#671775425) for the same app.

This task just finished and was successfully uploaded. Here is the output of the BOINC client:

13.08.2017 18:14:54 | Einstein@Home | Computation for task LATeah0037L_1172.0_0_0.0_5455485_1 finished
13.08.2017 18:14:57 | Einstein@Home | Started upload of LATeah0037L_1172.0_0_0.0_5455485_1_0
13.08.2017 18:14:57 | Einstein@Home | Started upload of LATeah0037L_1172.0_0_0.0_5455485_1_1
13.08.2017 18:14:59 | Einstein@Home | Finished upload of LATeah0037L_1172.0_0_0.0_5455485_1_0
13.08.2017 18:14:59 | Einstein@Home | Finished upload of LATeah0037L_1172.0_0_0.0_5455485_1_1

And the output logged by the server:

% Following up candidate number: 10
% Refining in S
% Following-up in P
% C 11 1265
% Writing follow-up output file.
FPU status flags:  PRECISION
18:14:51 (1240): [normal]: done. calling boinc_finish(0).
18:14:51 (1240): called boinc_finish

</stderr_txt>
]]>

So at least on my side, it was exactly the same problem which I discovered before.

But I can only guess what the problem is on your side. There is to less information about your setup or other circumstances. As a hint, you're right, it has nothing to do with symlinks on its own, but the fact that the move operation has to move the file over volume boundaries has something to do with it. It would also be the case if you could specify which paths are used as projects and slots directories and both are on different volumes. The same issue could also occur if the user of the process for the project app has not full access to both directories and is missing special permissions (only default r/w isn't enough in Windows) which are needed for move operations.

Gordon Haverland
Gordon Haverland
Joined: 28 Oct 16
Posts: 20
Credit: 428489605
RAC: 0

I too had a bunch of tasks

I too had a bunch of tasks end prematurely, all saying that 2 output files are absent from each run.  I had just connected to Einstein@Home, so these were initial jobs (8 GPU jobs).  The CPU job is waiting to start.  Having this machine (12566250) do GPU tasks for E@H is new.  This is a dual core AMD64 running Devuan Ascii/Ceres with Mesa providing OpenCL.  The GPU is a HD5450 (a tiny GPU).

 

Looking at the jobs at E@H, the stderr output is always empty (or, I checked 5 of 8, and they were empty).  I went looking in the "slots/" directory for output, and I see slots of 0, 1 and 2 with SetiAtHome and WorldCommunityGrid jobs, I don't see E@H jobs in a slot.

 

I wouldn't doubt I have something missing.  Clinfo seems to think things are okay.

 

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117879124814
RAC: 34729615

Gordon Haverland wrote:I too

Gordon Haverland wrote:
I too had a bunch of tasks end prematurely, all saying that 2 output files are absent from each run.

This is not the real problem.  I think this is just the consequence of a quite separate sequence of events.  I occasionally see this after restarting a machine where something (usually to do with the GPU) has crashed.  I've just seen this exact message on a machine that had stopped crunching overnight.

Very often, the machine hasn't crashed but the graphics have.  I know this because if I use the magic key sequence <alt>+<sysreq>+R E I S U B (if you don't know about this just google REISUB) the machine will do exactly what it is being told to do - ask processes to terminate, kill anything not responding, close open files, sync the disks, unmount file systems, shut down  and reboot.  If I use this procedure, there never seems to be a problem on restart.  If the kernel itself (not just the graphics) has crashed, these keys do nothing and a hard reset will be needed.  On restarting after a hard reset, when BOINC is launched, occasionally one of the running GPU tasks will exit with a computation error (output files absent message).  This is what happened to me just now.

I interpret the message this way - someone please correct me if you know better.  BOINC has decided that crunching cannot proceed (for whatever reason) on this particular task so the task exits - perhaps in such a way as to allow the next step of copying/renaming any output files to be attempted.  Of course, if the machine previously crashed in such a way that any output was not saved, the output files won't exist and can't be copied/renamed - hence the message you see.  I regard the message as information only - it's a symptom, not a cause.

Gordon Haverland wrote:
This is a dual core AMD64 running Devuan Ascii/Ceres with Mesa providing OpenCL.  The GPU is a HD5450 (a tiny GPU).

I'm guessing that Mesa/LLVM is the problem, combined with the fact that the GPU may not have the capability to properly run the Einstein GPU app.  I've seen other examples of problems posted here that seem to be related to Mesa/LLVM.

 

Gordon Haverland wrote:
Looking at the jobs at E@H, the stderr output is always empty (or, I checked 5 of 8, and they were empty).

I went to your tasks list and clicked on the task ID for just one of your failed tasks.  The stderr output certainly isn't empty.  Here is an excerpt.

Using OpenCL platform provided by: Mesa
Using OpenCL device "AMD CEDAR (DRM 2.48.0 / 4.9.0-3-amd64, LLVM 4.0.1)" by: AMD
Max allocation limit: 751619276
Global mem size: 1073741824
OpenCL compiling FAILED! : -11 . Error message: input.cl:7:26: error: unsupported OpenCL extension 'cl_khr_fp64' - ignoring
input.cl:10:30: error: unknown type name 'double2'; did you mean 'double'?
input.cl:10:30: error: use of type 'double' requires cl_khr_fp64 extension to be enabled

 This is NOT what causes your problem.  This is a small test compile checking for double precision support - which your card doesn't have, that's all.  it simply means that the final followup stage of processing the ten most likely candidate signals cannot be done on this GPU so it will be transferred back to be done on the CPU.  The only adverse effect is that the followup stage will take longer than it normally would if it could be done on the GPU.

 I believe your real problem is what comes next in the stderr output.

LLVM ERROR: Cannot select: 0x2170830: i32,ch = AtomicCmpSwap<Volatile LDST4[%1405(addrspace=1)]> 0x14600d0, 0x220d5c8, 0x15ed7b8, 0x21f3750
0x220d5c8: i32,ch = CopyFromReg 0x14600d0, Register:i32 %vreg200
0x15ed9c0: i32 = Register %vreg200
0x15ed7b8: i32,ch = CopyFromReg 0x14600d0, Register:i32 %vreg202
0x220d768: i32 = Register %vreg202
0x21f3750: i32 = bitcast 0x21713f8
0x21713f8: f32 = fadd 0x220d700, 0x1592de8
0x220d700: f32,ch = CopyFromReg 0x14600d0, Register:f32 %vreg194
0x220bf38: f32 = Register %vreg194
0x1592de8: f32 = bitcast 0x15ed7b8
0x15ed7b8: i32,ch = CopyFromReg 0x14600d0, Register:i32 %vreg202
0x220d768: i32 = Register %vreg202
In function: kernel_ts_2_phase_diff_sorted

I have absolutely no idea what all that means but maybe someone else will :-).

It's very commendable that you're prepared to support crunching at Einstein even if it is likely to be painfully slow because of the old, unsuitable hardware.  If you want to run GPU tasks, the old CPU isn't the impediment, it's the GPU.  If you install a relatively cheap, modern GPU like a GTX1050 or an RX 460, they are quite miserly on power and produce great output.  I have great success with RX 460s in almost 10 year old dual or quad core boxes.  An RX 460 in a E6300 Pentium dual core host crunching 2x on the GPU and a single CPU task uses less than 150W from the wall and gives a RAC >250K.  The main potential problem is that you need to research if your distro can provide a properly working driver.

 

Cheers,
Gary.

Gordon Haverland
Gordon Haverland
Joined: 28 Oct 16
Posts: 20
Credit: 428489605
RAC: 0

Okay, so I was looking in the

Okay, so I was looking in the wrong place for info.  Thanks.

 

I am getting tired of catalst/fglrx, and wanted to try Mesa.  I am considering getting the RX550 low profile card to replace this low profile HD5450.  But it may be that the computer in question doesn't have a big enough power supply for that.

 

The machine I use for a desktop, has a dual CPU in it and a R7-250.  The R7-250 is supposed to be in a machine with an A10-7860K APU in it, but I never could get crimson (catalyst for dual GPU) to work.  But, I have a RX460 and a 8250e CPU to put in that "desktop" machine when it gets upgrade to Devuan.  The A10 is sitting in an ATX tower, but the intention is to move it into a mini-ITX case (with the R7-250).  That would leave the ATX tower available for a new Ryzen 1600X and dual RX560 to move in.  Which would leave the only old GPU in my server, which is a HD6450.

At some point, I should get comfortable converting these Debian/Jessie machines to Devuan.  :-)

 

Gordon Haverland
Gordon Haverland
Joined: 28 Oct 16
Posts: 20
Credit: 428489605
RAC: 0

Thinking something might be

Thinking something might be missing, I installed the va, vdpau and vulkan mesa drivers.  I also went into the einstein project directory, and ran ldd against any file that wasn't an image.  Most of the time, these files were executables, so nothing useful was found.  The few executables present all had the libraries they were supposed to.  But none of those files looked to me like the GPU program(s).

 

Gordon Haverland
Gordon Haverland
Joined: 28 Oct 16
Posts: 20
Credit: 428489605
RAC: 0

I think it was at the SETI

I think it was at the SETI website, but there was a note about needing multiarch installed, and that running ldd against the binaries downloaded could show you that while you had some specific library present on your system, what you needed was the version for a different architecture.  Mostly this was for amd64 systems needing the x32 versions of libraries.

 

But if BOINC cleans things up after an error running a job, how does one find the executable in order to run ldd against it?

 

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.