Computation finished,... Output file absent GPU OpenCL tasks LATeah0010L

Love, joy, peace, patience, kindness, generosity, faithfulness, gentleness, + self-control are the fruit of the Holy Spirit . God is love, Jesus proves it. Dios es amor. Chirst has died, He is risen, He will come again.  LPa H
Love, joy, peac...
Joined: 16 Nov 07
Posts: 3
Credit: 7545952
RAC: 2269
Topic 204065

I hope this isn't a duplicated topic...I searched & found no match.

Lately I get no Cuda WUs only OpenCL but they all crash.  I've run GPU tasks before (months ago) BRPS152 cuda 32.  

I recently updated my NVidia drivers, and I should be able to run OpenCL tasks.

2016-12-26 11:45:17 AM | | Starting BOINC client version 7.6.22 for windows_x86_64
2016-12-26 11:45:17 AM | | log flags: sched_ops, task
2016-12-26 11:45:17 AM | | Libraries: libcurl/7.45.0 OpenSSL/1.0.2d zlib/1.2.8
2016-12-26 11:45:17 AM | | Data directory: C:\ProgramData\BOINC\B7 0
2016-12-26 11:45:18 AM | | CUDA: NVIDIA GPU 0: GeForce GT 640 (driver version 376.33, CUDA version 8.0, compute capability 3.5, 1024MB, 115MB available, 803 GFLOPS peak)

2016-12-26 11:45:18 AM | | OpenCL: NVIDIA GPU 0: GeForce GT 640 (driver version 376.33, device version OpenCL 1.2 CUDA, 1024MB, 115MB available, 803 GFLOPS peak)
2016-12-26 11:45:18 AM | | Processor: 8 GenuineIntel Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz [Family 6 Model 42 Stepping 7]
2016-12-26 11:45:18 AM | | Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss htt tm pni ssse3 cx16 sse4_1 sse4_2 popcnt aes syscall nx lm avx vmx smx tm2 pbe
2016-12-26 11:45:18 AM | | OS: Microsoft Windows 10: Core x64 Edition, (10.00.14393.00)
2016-12-26 11:45:18 AM | | Memory: 7.98 GB physical, 24.12 GB virtual
2016-12-26 11:45:18 AM | | Disk: 918.83 GB total, 806.43 GB free
2016-12-26 11:45:18 AM | | Local time is UTC -5 hours
2016-12-26 11:45:18 AM | | VirtualBox version: 5.0.12  

However, the Einstein OpenCL WUs always finish 1) abruptly (no error status) but 2) with 'Output file... absent'


2016-12-26 11:35:03 AM | Einstein@Home | Resetting project
2016-12-26 11:35:30 AM | Einstein@Home | work fetch resumed by user
2016-12-26 11:35:32 AM | Einstein@Home | update requested by user
2016-12-26 11:35:34 AM | Einstein@Home | Master file download succeeded
2016-12-26 11:35:39 AM | Einstein@Home | Sending scheduler request: Requested by user.
2016-12-26 11:35:39 AM | Einstein@Home | Requesting new tasks for NVIDIA GPU
2016-12-26 11:35:41 AM | Einstein@Home | Scheduler request completed: got 1 new tasks
2016-12-26 11:36:11 AM | Einstein@Home | Starting task LATeah0010L_820.0_0_0.0_2685700_0
2016-12-26 11:36:32 AM | Einstein@Home | Computation for task LATeah0010L_820.0_0_0.0_2685700_0 finished
2016-12-26 11:36:32 AM | Einstein@Home | Output file LATeah0010L_820.0_0_0.0_2685700_0_0 for task LATeah0010L_820.0_0_0.0_2685700_0 absent
2016-12-26 11:36:32 AM | Einstein@Home | Output file LATeah0010L_820.0_0_0.0_2685700_0_1 for task LATeah0010L_820.0_0_0.0_2685700_0 absent
2016-12-26 11:36:49 AM | Einstein@Home | update requested by user
2016-12-26 11:36:52 AM | Einstein@Home | Sending scheduler request: Requested by user.
2016-12-26 11:36:52 AM | Einstein@Home | Reporting 1 completed tasks
2016-12-26 11:36:52 AM | Einstein@Home | Requesting new tasks for NVIDIA GPU
2016-12-26 11:36:54 AM | Einstein@Home | Scheduler request completed: got 1 new tasks
2016-12-26 11:36:58 AM | Einstein@Home | Starting task LATeah0010L_820.0_0_0.0_2615420_0
2016-12-26 11:37:19 AM | Einstein@Home | Computation for task LATeah0010L_820.0_0_0.0_2615420_0 finished
2016-12-26 11:37:19 AM | Einstein@Home | Output file LATeah0010L_820.0_0_0.0_2615420_0_0 for task LATeah0010L_820.0_0_0.0_2615420_0 absent
2016-12-26 11:37:19 AM | Einstein@Home | Output file LATeah0010L_820.0_0_0.0_2615420_0_1 for task LATeah0010L_820.0_0_0.0_2615420_0 absent
2016-12-26 11:37:19 AM | Einstein@Home | work fetch suspended by user
2016-12-26 11:37:29 AM | Einstein@Home | update requested by user
2016-12-26 11:37:30 AM | Einstein@Home | Sending scheduler request: Requested by user.
2016-12-26 11:37:30 AM | Einstein@Home | Reporting 1 completed tasks
2016-12-26 11:37:30 AM | Einstein@Home | Not requesting tasks: "no new tasks" requested via Manager
2016-12-26 11:37:31 AM | Einstein@Home | Scheduler request completed
2016-12-26 11:37:41 AM | Einstein@Home | Resetting project
2016-12-26 11:38:08 AM | Einstein@Home | Resetting project
2016-12-26 11:38:46 AM | Einstein@Home | work fetch resumed by user
2016-12-26 11:38:47 AM | Einstein@Home | update requested by user
2016-12-26 11:38:48 AM | Einstein@Home | Master file download succeeded
2016-12-26 11:38:53 AM | Einstein@Home | Sending scheduler request: Requested by user.
2016-12-26 11:38:53 AM | Einstein@Home | Requesting new tasks for NVIDIA GPU
2016-12-26 11:38:55 AM | Einstein@Home | Scheduler request completed: got 1 new tasks
2016-12-26 11:38:59 AM | Einstein@Home | work fetch suspended by user
2016-12-26 11:39:20 AM | Einstein@Home | Starting task LATeah0010L_820.0_0_0.0_6755665_0
2016-12-26 11:39:41 AM | Einstein@Home | Computation for task LATeah0010L_820.0_0_0.0_6755665_0 finished
2016-12-26 11:39:41 AM | Einstein@Home | Output file LATeah0010L_820.0_0_0.0_6755665_0_0 for task LATeah0010L_820.0_0_0.0_6755665_0 absent
2016-12-26 11:39:41 AM | Einstein@Home | Output file LATeah0010L_820.0_0_0.0_6755665_0_1 for task LATeah0010L_820.0_0_0.0_6755665_0 absent  

What can I do?  Please help!

LLP, PhD PE

I think therefor I THINK I am
I think but this is not the origin of my existence, it is not the source of my being
I think but my thinking only proves my existence in my own thoughts not to anyone else
God is Love (Jesus proves it) therefor we are

Richie
Richie
Joined: 7 Mar 14
Posts: 656
Credit: 1702989778
RAC: 0

While waiting for better

While waiting for better advice you could try updating Boinc to newer version 7.6.33:

https://boinc.berkeley.edu/download_all.php

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7229838194
RAC: 1155113

You will probably find more

You will probably find more enlightenment for this type of problem in reviewing the stderr returned by the task than by observing the message log.

For example this can be found for one of your tasks here.

I'm copying from that task stderr to here the lines which seem most particular to your failure

<message> The remote adapter is not compatible. (0x3c) - exit code 60 (0x3c) </message>

and later

% Filling array of photon pairs
Error in OpenCL context: CL_MEM_OBJECT_ALLOCATION_FAILURE error executing CL_COMMAND_WRITE_BUFFER on GeForce GT 640 (Device 0).

Error during OpenCL bloc_info host->device transfer - qsort (error: -4)
Error in OpenCL context: CL_MEM_OBJECT_ALLOCATION_FAILURE error executing CL_COMMAND_NDRANGE_KERNEL on GeForce GT 640 (Device 0).

ERROR: /home/bema/fermilat/src/bridge_fft_clfft.c:867: Clear fft_vec failed. status=-4
ERROR: opencl_ts_2_phase_diff_sorted() returned with error 0
11:50:36 (19684): [CRITICAL]: ERROR: MAIN() returned with error '-4'
FPU status flags: PRECISION
11:50:48 (19684): [normal]: done. calling boinc_finish(60).

Sadly I'm not the one to diagnose what is wrong, but perhaps someone else will come along who can.  One possibility is that a GT 640 just cannot run these tasks.

AgentB
AgentB
Joined: 17 Mar 12
Posts: 915
Credit: 513211304
RAC: 0

archae86 wrote: One

archae86 wrote:
One possibility is that a GT 640 just cannot run these tasks.

It appears that the GT 640 here has only 1GB of memory.  Currently this is insufficient, but changes will come soon which may help.

See https://einsteinathome.org/content/no-new-work-recxeived-2-weeks#comment-153370

Bill
Bill
Joined: 2 Jun 17
Posts: 38
Credit: 329025120
RAC: 176566

I'm assuming I am having

I'm assuming I am having similar results but with slightly different hardware.  Here is my stderr for one WU:

 

<core_client_version>7.6.33</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code -1073740940 (0xc0000374)
</message>
<stderr_txt>
08:37:17 (9196): [normal]: This Einstein@home App was built at: Feb 15 2017 09:23:49

08:37:17 (9196): [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/hsgamma_FGRPB1G_1.20_windows_x86_64__FGRPopencl1K-nvidia.exe'.
08:37:17 (9196): [debug]: 1.1e+016 fp, 3.7e+009 fp/s, 2804377 s, 778h59m37s39
08:37:17 (9196): [normal]: % CPU usage: 1.000000, GPU usage: 1.000000
command line: projects/einstein.phys.uwm.edu/hsgamma_FGRPB1G_1.20_windows_x86_64__FGRPopencl1K-nvidia.exe --inputfile ../../projects/einstein.phys.uwm.edu/LATeah0035L.dat --alpha 4.42281478648 --delta -0.0345027837249 --skyRadius 2.152570e-06 --ldiBins 15 --f0start 868.0 --f0Band 8.0 --firstSkyPoint 0 --numSkyPoints 1 --f1dot -1e-13 --f1dotBand 1e-13 --df1dot 3.344368011e-15 --ephemdir ..\..\projects\einstein.phys.uwm.edu\JPLEPH --Tcoh 2097152.0 --toplist 10 --cohFollow 10 --numCells 1 --useWeights 1 --Srefinement 1 --CohSkyRef 1 --cohfullskybox 1 --mmfu 0.1 --reftime 56100 --model 0 --f0orbit 0.005 --mismatch 0.1 --demodbinary 1 --BinaryPointFile ../../projects/einstein.phys.uwm.edu/templates_LATeah0035L_0876_1847360.dat --debug 1 --device 0 -o LATeah0035L_876.0_0_0.0_1847360_0_0.out
output files: 'LATeah0035L_876.0_0_0.0_1847360_0_0.out' '../../projects/einstein.phys.uwm.edu/LATeah0035L_876.0_0_0.0_1847360_0_0' 'LATeah0035L_876.0_0_0.0_1847360_0_0.out.cohfu' '../../projects/einstein.phys.uwm.edu/LATeah0035L_876.0_0_0.0_1847360_0_1'
08:37:17 (9196): [debug]: Flags: X64 SSE SSE2 GNUC X86 GNUX86
08:37:17 (9196): [debug]: Set up communication with graphics process.

</stderr_txt>
]]>

 

I have an NVIDIA NVS 5400M with the latest drivers 1GB, and an Intel HD Graphics 4000.  Just checking to see if I am doing something else wrong, or if I need to wait for a patch.

 

Bear with me, I'm a noob with BOINC; I'm not sure what all information is pertinent, but I can get it if needed.
Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117779745296
RAC: 34783579

Hi Bill, Welcome to the

Hi Bill,
Welcome to the Einstein project!

One of the Devs may be able to give you a better answer but I'll try to suggest what might be going on.  Here is the last part of the message you provided, just as things go pear shaped.

Bill_73 wrote:
08:37:17 (9196): [debug]: Set up communication with graphics process.

I looked at a few task IDs from your list and at a quick glance they all seem to fail at this point.  For most, when you look at the tasks list, there is a very small run time (elapsed time) and essentially zero CPU time.  There is one, near the top of the list, that has significant amounts of time recorded before failure.  In other words, crunching is possible with your setup, and it's probably not the driver at fault.

If you click the task ID for that task that made some progress, you can get a bit of an idea of some of the stages that crunching goes through.  At the very start of that task's output, you can see a <message> ... </message> block containing the error message.  If a task finishes without error, that block wont be there.  All the other lines are what you normally see, including the following excerpt

15:25:26 (12704): [debug]: Set up communication with graphics process.
boinc_get_opencl_ids returned [0000000000357930 , 0000000000357810]
Using OpenCL platform provided by: NVIDIA Corporation
Using OpenCL device "NVS 5400M" by: NVIDIA Corporation
Max allocation limit: 268435456
Global mem size: 1073741824
OpenCL device has FP64 support
% Opening inputfile: ../../projects/einstein.phys.uwm.edu/LATeah0035L.dat
% Total amount of photon times: 30007
% Preparing toplist of length: 10
% Read 1255 binary points
read_checkpoint(): Couldn't open file 'LATeah0035L_836.0_0_0.0_1021570_0_0.out.cpt': No such file or directory (2)
% fft_size: 16777216 (0x1000000); alloc: 67108872
% Sky point 1/1
% Binary point 1/1255
% Creating FFT plan.
% fft length: 16777216 (0x1000000)
% Scratch buffer size: 136314880
% Starting semicoherent search over f0 and f1.
% nf1dots: 31 df1dot: 3.344368011e-015 f1dot_start: -1e-013 f1dot_band: 1e-013
% Filling array of photon pairs
.
.
.

 This gives the stuff you are supposed to see immediately following "communication with graphics process" where your tasks are failing.  The message about "no such file or directory" is quite normal because at the very start of crunching of any task, there is no saved checkpoint (see later) to restart from.

If you continue looking through that output you will see increasing numbers of "binary points" being completed.  There are 1255 to do all up so you can see how many out of 1255 have been done as you scroll through.  At regular intervals, you will see ones with an extra line like the following

.
.
% C 0 6
% Binary point 7/1255
% Starting semicoherent search over f0 and f1.
% nf1dots: 31 df1dot: 3.344368011e-015 f1dot_start: -1e-013 f1dot_band: 1e-013
% Filling array of photon pairs
.
.

The extra line is the one "% C 0 6"  which means that at the completion of binary point 6 out of 1255, a 'checkpoint' was written to disk.  A checkpoint is created when the state of a task is saved at regular intervals so that if BOINC is stopped and restarted at any point, the crunching of a task can resume from the last saved checkpoint rather than having to restart from the beginning.

You will see that behaviour in action if you continue scrolling through the output until you get to binary point 52/1255.  Before 52 was finished, crunching was stopped and then restarted for some reason.  You will see the normal startup messages, including a successful "communication with graphics process" and you will also see "% checkpoint read: skypoint 0 binarypoint 48" where the state of crunching was loaded from the last saved checkpoint.

One thing that puzzles me is the number of times the crunching of this task was resumed from checkpoints.  Do you have BOINC settings that suspend crunching if the user is active, or something like that?  Eventually, after binary point 167/1255 (where a checkpoint was saved) crunching was stopped again and, on attempting to resume, the task failed at the problematic "communication with graphics process" stage.

Maybe one of the Devs can give you more info about what is going on at this point.  These tasks do require almost all of the 1GB memory your GPU has.  If there are other processes (your normal work on that machine) using GPU memory maybe it's a problem to do with how these processes interact with crunching.  Maybe you could test this by seeing if a task will run without error on its own with no other competing processes.

I hope the above is of some use in helping you to sort out what is going on.

 

Cheers,
Gary.

Bill
Bill
Joined: 2 Jun 17
Posts: 38
Credit: 329025120
RAC: 176566

Gary, Thanks for the help! 

Gary,

Thanks for the help!  I'm not sure I get everything you've explained, but I'll read through it a few more times and see if I have any questions.

In the meantime, I think I can fill you in on a few of your questions.  I am working on the computer while running Einstein/BOINC in the background.  Occasionally, I may snooze the GPU and/or CPU if I am running a program that needs more resources, or if I can't deal with the lag.

Yes, I noticed that one WU happened to crunch numbers for a period of time, and when I checked on it later in the morning it has a computation error, as well as the rest of the GPU WUs that I had in the queue.

The Intel GPU is used for the laptop display.  The Nvidia GPU is used for an external monitor.  So, when I an processing with the GPUs, both displays are typically on.  I think when the WUs we are talking about were processed when both displays were on, so I doubt the computation error occurred when I activated the display to be run by the Nvidia GPU (for example).

This brings up another problem that I have with the GPUs.  If I have my laptop in the docking station (and connected to the external monitor), the GPUs are not identified when I start up BOINC.  However, if I have BOINC closed, open up the Nvidia control panel, and disable the Nvidia GPU from displaying to the external monitor, then BOINC will recognize the GPUs when I start it up.  After BOINC starts, then I turn on the external display, and life moves on.  I am also running Seti@home, and it crunches a lot of WUs on both GPUs with no problems.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117779745296
RAC: 34783579

Hi Bill, I don't own a

Hi Bill,

I don't own a laptop so have no experience with the intricacies of using one for GPU crunching.  However I am familiar with the lag you get when crunching on older Nvidia GPUs.  I have some GTX650s and a couple of 750Tis.  The 750Tis are not too bad but the 650s make a machine pretty much unusable for anything else when crunching.  Apart from the lag, their crunching performance is quite poor.

Previous GPU searches had both OpenCL and CUDA versions of the search app and older Nvidia GPUs performed very well using CUDA.  There is work going on (at lower priority) to develop a CUDA version of the current OpenCL app.  I believe the code has been ported but there are performance issues that need to be addressed before any release of an app can happen.  There are more important issues taking priority so there's no indication of how long it might take.

Hopefully, one of the Devs might be able to give some idea about why the tasks are failing with your setup.  Perhaps other volunteers might have experienced the issues you mention with BOINC recognising the Nvidia GPU.  I'm sorry that I don't have any suggestions to give you.

 

Cheers,
Gary.

Christian Beer
Christian Beer
Joined: 9 Feb 05
Posts: 595
Credit: 188569457
RAC: 170522

Hi, I searched a bit for the

Hi,

I searched a bit for the error code of your tasks (0xc0000374) which is a Microsoft errorcode for Heap corruption. Looking through other forum reports it seems that some application on the computer produces this heap corruption and the Einstein@Home app suffers from that. In one case it was a Logitech Gaming Software that produced the Heap corruption but didn't crash by itself.

I would suggest to think about what program you updated or installed prior to recognizing the error. Or boot in Clean Mode and see if the app starts. It could also help to remove the Nvidia Driver (using a driver cleaner tool) and install a fresh copy because it seems the app can't enumerate the OpenCL devices which is done by querying the driver (but that doesn't mean the driver is the source of the eap corruption).

Zalster
Zalster
Joined: 26 Nov 13
Posts: 3117
Credit: 4050672230
RAC: 0

Late to this party.  I can

Late to this party.  I can say for a fact that the OpenCL doesn't like to be paused while crunching a work unit.  Not so noticable on the 900 series, definitely on the 10x0 series.  The resulting driver crash results in computational errors and driver crashes. All work units after the crash will error. The only fix I have found is to suspend all non crunching work units and allow all currently crunching to finish before pausing, or exiting boinc.  Failure to do so will result in a crash.  Don't know why the 900 series is immune from this event, it just is. I run purely crunchers, nothing else on these machines. I'm pretty sure it's the opencl app, I can run seti opencl without problems so I think its the level of refinement that is causing it.  Don't get me wrong, love the work you have put into it, but there are some caveats that need to be placed so people are aware.  I've talked with my group about it and made them aware never to cold turkey their machines.

 

edit ...

 

i run multiple work units at the same time as do my teammates.. best efficient of our gpus 

DiablosOffens
DiablosOffens
Joined: 14 Jul 05
Posts: 2
Credit: 1368780
RAC: 0

I've got a similar problem.

I've got a similar problem. Some project apps always fail to find the output file at 100% completion. Here are the last view lines from task 671324524:

% Following up candidate number: 10
% Refining in S
% Following-up in P
% Writing follow-up output file.
FPU status flags:  PRECISION
00:11:13 (18052): [normal]: done. calling boinc_finish(0).
00:11:13 (18052): called boinc_finish

</stderr_txt>
<message>
upload failure: <file_xfer_error>
  <file_name>LATeah0037L_1132.0_0_0.0_5794335_1_0</file_name>
  <error_code>-161 (not found)</error_code>
</file_xfer_error>
<file_xfer_error>
  <file_name>LATeah0037L_1132.0_0_0.0_5794335_1_1</file_name>
  <error_code>-161 (not found)</error_code>
</file_xfer_error>

</message>
]]>

I debugged the problem to the point where boinc_finish was called. It tries to move the file back to the project directory, but my slot dir is a symbolic link to another drive. So, the move operation fails because they forgot to set a special flag to move files from one drive to another drive. I created a pull request for this in the past and they merged it into the master branch:

https://github.com/BOINC/boinc/pull/1449

But the problem with these BOINC project apps is that they statically link a relatively outdated BOINC-API library where this fix wasn't present.

So now, I have a request to the project owners: Please recompile your project apps with the newest BOINC-API library!

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.