I hope this isn't a duplicated topic...I searched & found no match.
Lately I get no Cuda WUs only OpenCL but they all crash. I've run GPU tasks before (months ago) BRPS152 cuda 32.
I recently updated my NVidia drivers, and I should be able to run OpenCL tasks.
2016-12-26 11:45:17 AM | | Starting BOINC client version 7.6.22 for windows_x86_64 2016-12-26 11:45:17 AM | | log flags: sched_ops, task 2016-12-26 11:45:17 AM | | Libraries: libcurl/7.45.0 OpenSSL/1.0.2d zlib/1.2.8 2016-12-26 11:45:17 AM | | Data directory: C:\ProgramData\BOINC\B7 0 2016-12-26 11:45:18 AM | | CUDA: NVIDIA GPU 0: GeForce GT 640 (driver version 376.33, CUDA version 8.0, compute capability 3.5, 1024MB, 115MB available, 803 GFLOPS peak) 2016-12-26 11:45:18 AM | | OpenCL: NVIDIA GPU 0: GeForce GT 640 (driver version 376.33, device version OpenCL 1.2 CUDA, 1024MB, 115MB available, 803 GFLOPS peak) 2016-12-26 11:45:18 AM | | Processor: 8 GenuineIntel Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz [Family 6 Model 42 Stepping 7] 2016-12-26 11:45:18 AM | | Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss htt tm pni ssse3 cx16 sse4_1 sse4_2 popcnt aes syscall nx lm avx vmx smx tm2 pbe 2016-12-26 11:45:18 AM | | OS: Microsoft Windows 10: Core x64 Edition, (10.00.14393.00) 2016-12-26 11:45:18 AM | | Memory: 7.98 GB physical, 24.12 GB virtual 2016-12-26 11:45:18 AM | | Disk: 918.83 GB total, 806.43 GB free 2016-12-26 11:45:18 AM | | Local time is UTC -5 hours 2016-12-26 11:45:18 AM | | VirtualBox version: 5.0.12
However, the Einstein OpenCL WUs always finish 1) abruptly (no error status) but 2) with 'Output file... absent'
2016-12-26 11:35:03 AM | Einstein@Home | Resetting project 2016-12-26 11:35:30 AM | Einstein@Home | work fetch resumed by user 2016-12-26 11:35:32 AM | Einstein@Home | update requested by user 2016-12-26 11:35:34 AM | Einstein@Home | Master file download succeeded 2016-12-26 11:35:39 AM | Einstein@Home | Sending scheduler request: Requested by user. 2016-12-26 11:35:39 AM | Einstein@Home | Requesting new tasks for NVIDIA GPU 2016-12-26 11:35:41 AM | Einstein@Home | Scheduler request completed: got 1 new tasks 2016-12-26 11:36:11 AM | Einstein@Home | Starting task LATeah0010L_820.0_0_0.0_2685700_0 2016-12-26 11:36:32 AM | Einstein@Home | Computation for task LATeah0010L_820.0_0_0.0_2685700_0 finished 2016-12-26 11:36:32 AM | Einstein@Home | Output file LATeah0010L_820.0_0_0.0_2685700_0_0 for task LATeah0010L_820.0_0_0.0_2685700_0 absent 2016-12-26 11:36:32 AM | Einstein@Home | Output file LATeah0010L_820.0_0_0.0_2685700_0_1 for task LATeah0010L_820.0_0_0.0_2685700_0 absent 2016-12-26 11:36:49 AM | Einstein@Home | update requested by user 2016-12-26 11:36:52 AM | Einstein@Home | Sending scheduler request: Requested by user. 2016-12-26 11:36:52 AM | Einstein@Home | Reporting 1 completed tasks 2016-12-26 11:36:52 AM | Einstein@Home | Requesting new tasks for NVIDIA GPU 2016-12-26 11:36:54 AM | Einstein@Home | Scheduler request completed: got 1 new tasks 2016-12-26 11:36:58 AM | Einstein@Home | Starting task LATeah0010L_820.0_0_0.0_2615420_0 2016-12-26 11:37:19 AM | Einstein@Home | Computation for task LATeah0010L_820.0_0_0.0_2615420_0 finished 2016-12-26 11:37:19 AM | Einstein@Home | Output file LATeah0010L_820.0_0_0.0_2615420_0_0 for task LATeah0010L_820.0_0_0.0_2615420_0 absent 2016-12-26 11:37:19 AM | Einstein@Home | Output file LATeah0010L_820.0_0_0.0_2615420_0_1 for task LATeah0010L_820.0_0_0.0_2615420_0 absent 2016-12-26 11:37:19 AM | Einstein@Home | work fetch suspended by user 2016-12-26 11:37:29 AM | Einstein@Home | update requested by user 2016-12-26 11:37:30 AM | Einstein@Home | Sending scheduler request: Requested by user. 2016-12-26 11:37:30 AM | Einstein@Home | Reporting 1 completed tasks 2016-12-26 11:37:30 AM | Einstein@Home | Not requesting tasks: "no new tasks" requested via Manager 2016-12-26 11:37:31 AM | Einstein@Home | Scheduler request completed 2016-12-26 11:37:41 AM | Einstein@Home | Resetting project 2016-12-26 11:38:08 AM | Einstein@Home | Resetting project 2016-12-26 11:38:46 AM | Einstein@Home | work fetch resumed by user 2016-12-26 11:38:47 AM | Einstein@Home | update requested by user 2016-12-26 11:38:48 AM | Einstein@Home | Master file download succeeded 2016-12-26 11:38:53 AM | Einstein@Home | Sending scheduler request: Requested by user. 2016-12-26 11:38:53 AM | Einstein@Home | Requesting new tasks for NVIDIA GPU 2016-12-26 11:38:55 AM | Einstein@Home | Scheduler request completed: got 1 new tasks 2016-12-26 11:38:59 AM | Einstein@Home | work fetch suspended by user 2016-12-26 11:39:20 AM | Einstein@Home | Starting task LATeah0010L_820.0_0_0.0_6755665_0 2016-12-26 11:39:41 AM | Einstein@Home | Computation for task LATeah0010L_820.0_0_0.0_6755665_0 finished 2016-12-26 11:39:41 AM | Einstein@Home | Output file LATeah0010L_820.0_0_0.0_6755665_0_0 for task LATeah0010L_820.0_0_0.0_6755665_0 absent 2016-12-26 11:39:41 AM | Einstein@Home | Output file LATeah0010L_820.0_0_0.0_6755665_0_1 for task LATeah0010L_820.0_0_0.0_6755665_0 absent
What can I do? Please help!
LLP, PhD PE
I think therefor I THINK I am
I think but this is not the origin of my existence, it is not the source of my being
I think but my thinking only proves my existence in my own thoughts not to anyone else
God is Love (Jesus proves it) therefor we are
Copyright © 2024 Einstein@Home. All rights reserved.
While waiting for better
)
While waiting for better advice you could try updating Boinc to newer version 7.6.33:
https://boinc.berkeley.edu/download_all.php
You will probably find more
)
You will probably find more enlightenment for this type of problem in reviewing the stderr returned by the task than by observing the message log.
For example this can be found for one of your tasks here.
I'm copying from that task stderr to here the lines which seem most particular to your failure
<message> The remote adapter is not compatible. (0x3c) - exit code 60 (0x3c) </message>
and later
Error during OpenCL bloc_info host->device transfer - qsort (error: -4)
Error in OpenCL context: CL_MEM_OBJECT_ALLOCATION_FAILURE error executing CL_COMMAND_NDRANGE_KERNEL on GeForce GT 640 (Device 0).
ERROR: /home/bema/fermilat/src/bridge_fft_clfft.c:867: Clear fft_vec failed. status=-4
ERROR: opencl_ts_2_phase_diff_sorted() returned with error 0
11:50:36 (19684): [CRITICAL]: ERROR: MAIN() returned with error '-4'
FPU status flags: PRECISION
11:50:48 (19684): [normal]: done. calling boinc_finish(60).
Sadly I'm not the one to diagnose what is wrong, but perhaps someone else will come along who can. One possibility is that a GT 640 just cannot run these tasks.
archae86 wrote: One
)
It appears that the GT 640 here has only 1GB of memory. Currently this is insufficient, but changes will come soon which may help.
See https://einsteinathome.org/content/no-new-work-recxeived-2-weeks#comment-153370
I'm assuming I am having
)
I'm assuming I am having similar results but with slightly different hardware. Here is my stderr for one WU:
08:37:17 (9196): [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/hsgamma_FGRPB1G_1.20_windows_x86_64__FGRPopencl1K-nvidia.exe'.
08:37:17 (9196): [debug]: 1.1e+016 fp, 3.7e+009 fp/s, 2804377 s, 778h59m37s39
08:37:17 (9196): [normal]: % CPU usage: 1.000000, GPU usage: 1.000000
command line: projects/einstein.phys.uwm.edu/hsgamma_FGRPB1G_1.20_windows_x86_64__FGRPopencl1K-nvidia.exe --inputfile ../../projects/einstein.phys.uwm.edu/LATeah0035L.dat --alpha 4.42281478648 --delta -0.0345027837249 --skyRadius 2.152570e-06 --ldiBins 15 --f0start 868.0 --f0Band 8.0 --firstSkyPoint 0 --numSkyPoints 1 --f1dot -1e-13 --f1dotBand 1e-13 --df1dot 3.344368011e-15 --ephemdir ..\..\projects\einstein.phys.uwm.edu\JPLEPH --Tcoh 2097152.0 --toplist 10 --cohFollow 10 --numCells 1 --useWeights 1 --Srefinement 1 --CohSkyRef 1 --cohfullskybox 1 --mmfu 0.1 --reftime 56100 --model 0 --f0orbit 0.005 --mismatch 0.1 --demodbinary 1 --BinaryPointFile ../../projects/einstein.phys.uwm.edu/templates_LATeah0035L_0876_1847360.dat --debug 1 --device 0 -o LATeah0035L_876.0_0_0.0_1847360_0_0.out
output files: 'LATeah0035L_876.0_0_0.0_1847360_0_0.out' '../../projects/einstein.phys.uwm.edu/LATeah0035L_876.0_0_0.0_1847360_0_0' 'LATeah0035L_876.0_0_0.0_1847360_0_0.out.cohfu' '../../projects/einstein.phys.uwm.edu/LATeah0035L_876.0_0_0.0_1847360_0_1'
08:37:17 (9196): [debug]: Flags: X64 SSE SSE2 GNUC X86 GNUX86
08:37:17 (9196): [debug]: Set up communication with graphics process.
</stderr_txt>
]]>
Hi Bill, Welcome to the
)
Hi Bill,
Welcome to the Einstein project!
One of the Devs may be able to give you a better answer but I'll try to suggest what might be going on. Here is the last part of the message you provided, just as things go pear shaped.
I looked at a few task IDs from your list and at a quick glance they all seem to fail at this point. For most, when you look at the tasks list, there is a very small run time (elapsed time) and essentially zero CPU time. There is one, near the top of the list, that has significant amounts of time recorded before failure. In other words, crunching is possible with your setup, and it's probably not the driver at fault.
If you click the task ID for that task that made some progress, you can get a bit of an idea of some of the stages that crunching goes through. At the very start of that task's output, you can see a <message> ... </message> block containing the error message. If a task finishes without error, that block wont be there. All the other lines are what you normally see, including the following excerpt
This gives the stuff you are supposed to see immediately following "communication with graphics process" where your tasks are failing. The message about "no such file or directory" is quite normal because at the very start of crunching of any task, there is no saved checkpoint (see later) to restart from.
If you continue looking through that output you will see increasing numbers of "binary points" being completed. There are 1255 to do all up so you can see how many out of 1255 have been done as you scroll through. At regular intervals, you will see ones with an extra line like the following
The extra line is the one "% C 0 6" which means that at the completion of binary point 6 out of 1255, a 'checkpoint' was written to disk. A checkpoint is created when the state of a task is saved at regular intervals so that if BOINC is stopped and restarted at any point, the crunching of a task can resume from the last saved checkpoint rather than having to restart from the beginning.
You will see that behaviour in action if you continue scrolling through the output until you get to binary point 52/1255. Before 52 was finished, crunching was stopped and then restarted for some reason. You will see the normal startup messages, including a successful "communication with graphics process" and you will also see "% checkpoint read: skypoint 0 binarypoint 48" where the state of crunching was loaded from the last saved checkpoint.
One thing that puzzles me is the number of times the crunching of this task was resumed from checkpoints. Do you have BOINC settings that suspend crunching if the user is active, or something like that? Eventually, after binary point 167/1255 (where a checkpoint was saved) crunching was stopped again and, on attempting to resume, the task failed at the problematic "communication with graphics process" stage.
Maybe one of the Devs can give you more info about what is going on at this point. These tasks do require almost all of the 1GB memory your GPU has. If there are other processes (your normal work on that machine) using GPU memory maybe it's a problem to do with how these processes interact with crunching. Maybe you could test this by seeing if a task will run without error on its own with no other competing processes.
I hope the above is of some use in helping you to sort out what is going on.
Cheers,
Gary.
Gary, Thanks for the help!
)
Gary,
Thanks for the help! I'm not sure I get everything you've explained, but I'll read through it a few more times and see if I have any questions.
In the meantime, I think I can fill you in on a few of your questions. I am working on the computer while running Einstein/BOINC in the background. Occasionally, I may snooze the GPU and/or CPU if I am running a program that needs more resources, or if I can't deal with the lag.
Yes, I noticed that one WU happened to crunch numbers for a period of time, and when I checked on it later in the morning it has a computation error, as well as the rest of the GPU WUs that I had in the queue.
The Intel GPU is used for the laptop display. The Nvidia GPU is used for an external monitor. So, when I an processing with the GPUs, both displays are typically on. I think when the WUs we are talking about were processed when both displays were on, so I doubt the computation error occurred when I activated the display to be run by the Nvidia GPU (for example).
This brings up another problem that I have with the GPUs. If I have my laptop in the docking station (and connected to the external monitor), the GPUs are not identified when I start up BOINC. However, if I have BOINC closed, open up the Nvidia control panel, and disable the Nvidia GPU from displaying to the external monitor, then BOINC will recognize the GPUs when I start it up. After BOINC starts, then I turn on the external display, and life moves on. I am also running Seti@home, and it crunches a lot of WUs on both GPUs with no problems.
Hi Bill, I don't own a
)
Hi Bill,
I don't own a laptop so have no experience with the intricacies of using one for GPU crunching. However I am familiar with the lag you get when crunching on older Nvidia GPUs. I have some GTX650s and a couple of 750Tis. The 750Tis are not too bad but the 650s make a machine pretty much unusable for anything else when crunching. Apart from the lag, their crunching performance is quite poor.
Previous GPU searches had both OpenCL and CUDA versions of the search app and older Nvidia GPUs performed very well using CUDA. There is work going on (at lower priority) to develop a CUDA version of the current OpenCL app. I believe the code has been ported but there are performance issues that need to be addressed before any release of an app can happen. There are more important issues taking priority so there's no indication of how long it might take.
Hopefully, one of the Devs might be able to give some idea about why the tasks are failing with your setup. Perhaps other volunteers might have experienced the issues you mention with BOINC recognising the Nvidia GPU. I'm sorry that I don't have any suggestions to give you.
Cheers,
Gary.
Hi, I searched a bit for the
)
Hi,
I searched a bit for the error code of your tasks (0xc0000374) which is a Microsoft errorcode for Heap corruption. Looking through other forum reports it seems that some application on the computer produces this heap corruption and the Einstein@Home app suffers from that. In one case it was a Logitech Gaming Software that produced the Heap corruption but didn't crash by itself.
I would suggest to think about what program you updated or installed prior to recognizing the error. Or boot in Clean Mode and see if the app starts. It could also help to remove the Nvidia Driver (using a driver cleaner tool) and install a fresh copy because it seems the app can't enumerate the OpenCL devices which is done by querying the driver (but that doesn't mean the driver is the source of the eap corruption).
Late to this party. I can
)
Late to this party. I can say for a fact that the OpenCL doesn't like to be paused while crunching a work unit. Not so noticable on the 900 series, definitely on the 10x0 series. The resulting driver crash results in computational errors and driver crashes. All work units after the crash will error. The only fix I have found is to suspend all non crunching work units and allow all currently crunching to finish before pausing, or exiting boinc. Failure to do so will result in a crash. Don't know why the 900 series is immune from this event, it just is. I run purely crunchers, nothing else on these machines. I'm pretty sure it's the opencl app, I can run seti opencl without problems so I think its the level of refinement that is causing it. Don't get me wrong, love the work you have put into it, but there are some caveats that need to be placed so people are aware. I've talked with my group about it and made them aware never to cold turkey their machines.
edit ...
i run multiple work units at the same time as do my teammates.. best efficient of our gpus
I've got a similar problem.
)
I've got a similar problem. Some project apps always fail to find the output file at 100% completion. Here are the last view lines from task 671324524:
I debugged the problem to the point where boinc_finish was called. It tries to move the file back to the project directory, but my slot dir is a symbolic link to another drive. So, the move operation fails because they forgot to set a special flag to move files from one drive to another drive. I created a pull request for this in the past and they merged it into the master branch:
https://github.com/BOINC/boinc/pull/1449
But the problem with these BOINC project apps is that they statically link a relatively outdated BOINC-API library where this fix wasn't present.
So now, I have a request to the project owners: Please recompile your project apps with the newest BOINC-API library!