single gtx670, pair of gtx650ti; has 1 valid result out of 46 results. could wrong checkpoint be handed off following task switch back?
Did you load the Nvidia drivers 3 separate times or at least twice? With different gpu's it's often necessary due to the pc not getting confused. One unit I looked at said it had an opencl error near the bottom of the page.
It looks like a problem with power. Maybe your PSU have insufficient power or, as these gtx650ti probably do not have additional power connectors, MB is unable to deliver required power to PCIe slots.
Edit: I made mistake, 650ti should have power connectors (650 w/o ti don't have).
What's the exact brand and model of each GPU... and PSU... and motherboard?
What kind of procedure did you use to install those cards? Did you put them on the board all at once and then installed the driver, or did you boot up computer after adding a card?
I would try this:
1. Remove those 650 Ti's so that only the GTX 670 is installed on the board.
2. Use Display Driver Uninstaller (Wagnard DDU) to remove all Nvidia drivers, in Safe Mode.
3. Use a registry cleaner to remove any remnants of old configurations. CCleaner Free for example is a free software for that purpose.
Looking at the the stderror output for each error task shows only the GTX 670 is receiving (and failing) tasks
Using OpenCL platform provided by: NVIDIA Corporation
Using OpenCL device "GeForce GTX 670" by: NVIDIA Corporation
Max allocation limit: 536870912
Global mem size: 2147483648
OpenCL device has FP64 support
% Opening inputfile: ../../projects/einstein.phys.uwm.edu/LATeah0138L.dat
% Total amount of photon times: 30007
% Preparing toplist of length: 10
% Read 1255 binary points
read_checkpoint(): Couldn't open file 'LATeah0138L_1204.0_0_0.0_23414535_2_0.out.cpt': No such file or directory (2)
% fft_size: 16777216 (0x1000000); alloc: 67108872
% Sky point 1/1
% Binary point 1/1255
% Creating FFT plan.
% fft length: 16777216 (0x1000000)
% Scratch buffer size: 136314880
% Starting semicoherent search over f0 and f1.
% nf1dots: 31 df1dot: 3.344368011e-015 f1dot_start: -1e-013 f1dot_band: 1e-013
% Filling array of photon pairs
ERROR: /home/bema/fermilat/src/bridge_fft_clfft.c:1176: clFinish failed. status=-36
error in opencl_qsort
00:00:17 (7136): [CRITICAL]: ERROR: MAIN() returned with error '1'
FPU status flags: PRECISION
I have highlighted the error, and error is occurring early at a point when the application allocates GPU memory. The OpenCL error -36 is CL_INVALID_COMMAND_QUEUE, and this comes up often on Windows Updates breaking OpenCL things. @richie's point 4 below should fix that.
I don't know exactly what memory the 1.20 app needs, but here suggests requires ~1GB is needed, so you may also be getting close to that limit.
I don't know why one task would work before, perhaps the amount of GPU memory which could be allocated may have changed or a driver / OS update has reduced what is able to be allocated.
To compare against say my RX-480 which has 8GB of GPU memory the same app stderror output shows
Using OpenCL platform provided by: Advanced Micro Devices, Inc.
Using OpenCL device "Ellesmere" by: Advanced Micro Devices, Inc.
Max allocation limit: 4244635648
Global mem size: 5970087936
OpenCL device has FP64 support
yesterday everytime the wu stopped with an error. then i changed the config so only one gpu is for einstein and the other one works on another project. that runs fine
later yesterday i tested it again and everything runs fine with both gpu on einstein.
today i started the same pc, as i shut it down yesterday, where everything works perfect and now i only get errors again...every wu stopped with calculation error, UNTIL only that one is left, which was started yesterday on the primary gpu and this one runs perfectly...
so i have to get new wu...which everytime failed again on the other gpu. i then started boinc new on that pc and...now it works again on both gpu! thats ridiculous if even the smallest system fart or something can cause errors for the application!
yesterday everytime the wu stopped with an error. then i changed the config so only one gpu is for einstein and the other one works on another project. that runs fine
later yesterday i tested it again and everything runs fine with both gpu on einstein.
today i started the same pc, as i shut it down yesterday, where everything works perfect and now i only get errors again...every wu stopped with calculation error, UNTIL only that one is left, which was started yesterday on the primary gpu and this one runs perfectly...
so i have to get new wu...which everytime failed again on the other gpu. i then started boinc new on that pc and...now it works again on both gpu! thats ridiculous if even the smallest system fart or something can cause errors for the application!
the gpus are a 780ti and a 660ti...
I would guess your system is getting confused between the two different gpu's and the order in which they start crunching could make all the difference in valid or invalid workunits. Personally i would put one here and one someplace else and workaround the problem, but if that isn't in your plans you might have to find a way to suspend one gpu on startup and then enable it after the other gpu is already crunching. I have no clue how to do that though.
no, that can't be. because the primary gpu works and works on his wu na dthe secondary card crashes every wu at around 0,5%. it crashes the wu ever and ever again and on the same time the primary still works fine...
you can restart boinc multiple times...no changes.
you can reboot the whole pc and if you're lucky on the first try he runs normal on the secondary as well and this doesn't change hours later. if you had bad luck you have to try multiple reboots
the system is a boinc-only-machine without other applications
I'd say it is very unlikely.
)
I'd say it is very unlikely. I've one system with 3 750Ti gpus - two on 16X and one on 8X.
The error rate is almost 0,
BeemerBiker
)
Did you load the Nvidia drivers 3 separate times or at least twice? With different gpu's it's often necessary due to the pc not getting confused. One unit I looked at said it had an opencl error near the bottom of the page.
It looks like a problem with
)
It looks like a problem with power. Maybe your PSU have insufficient power or, as these gtx650ti probably do not have additional power connectors, MB is unable to deliver required power to PCIe slots.
Edit: I made mistake, 650ti should have power connectors (650 w/o ti don't have).
What's the exact brand and
)
What's the exact brand and model of each GPU... and PSU... and motherboard?
What kind of procedure did you use to install those cards? Did you put them on the board all at once and then installed the driver, or did you boot up computer after adding a card?
I would try this:
1. Remove those 650 Ti's so that only the GTX 670 is installed on the board.
2. Use Display Driver Uninstaller (Wagnard DDU) to remove all Nvidia drivers, in Safe Mode.
3. Use a registry cleaner to remove any remnants of old configurations. CCleaner Free for example is a free software for that purpose.
4. Install new Nvidia driver, http://www.nvidia.com/download/driverResults.aspx/123219/en-us and reboot.
5. After a reboot, shutdown computer and install one 650 Ti. Power up and let computer boot up. Do a reboot one time. Then shutdown computer.
6. Install second 650 Ti. Power up and let computer boot up. Do a reboot one time.
7. Download GPU-Z, install it and see if it will display information for all the three GPU's properly.
8. Update Boinc client to version 7.8.2.
Looking at the the stderror
)
Looking at the the stderror output for each error task shows only the GTX 670 is receiving (and failing) tasks
I have highlighted the error, and error is occurring early at a point when the application allocates GPU memory. The OpenCL error -36 is CL_INVALID_COMMAND_QUEUE, and this comes up often on Windows Updates breaking OpenCL things. @richie's point 4 below should fix that.
Available memory for OpenCl
)
Available memory for OpenCl applications is not the same between Nvidia GPU and ATI GPUs.
ATI GPUs get somewhere between 50-70% of use of all RAM on the card.
Nvidia I believe is only allocated to 25%.
There is also some usage by the system so always subtract a little more for use by the OS.
same problem here on two
)
same problem here on two different gpus.
yesterday everytime the wu stopped with an error. then i changed the config so only one gpu is for einstein and the other one works on another project. that runs fine
later yesterday i tested it again and everything runs fine with both gpu on einstein.
today i started the same pc, as i shut it down yesterday, where everything works perfect and now i only get errors again...every wu stopped with calculation error, UNTIL only that one is left, which was started yesterday on the primary gpu and this one runs perfectly...
so i have to get new wu...which everytime failed again on the other gpu. i then started boinc new on that pc and...now it works again on both gpu! thats ridiculous if even the smallest system fart or something can cause errors for the application!
the gpus are a 780ti and a 660ti...
eXtreme Warhead wrote:same
)
I would guess your system is getting confused between the two different gpu's and the order in which they start crunching could make all the difference in valid or invalid workunits. Personally i would put one here and one someplace else and workaround the problem, but if that isn't in your plans you might have to find a way to suspend one gpu on startup and then enable it after the other gpu is already crunching. I have no clue how to do that though.
no, that can't be. because
)
no, that can't be. because the primary gpu works and works on his wu na dthe secondary card crashes every wu at around 0,5%. it crashes the wu ever and ever again and on the same time the primary still works fine...
you can restart boinc multiple times...no changes.
you can reboot the whole pc and if you're lucky on the first try he runs normal on the secondary as well and this doesn't change hours later. if you had bad luck you have to try multiple reboots
the system is a boinc-only-machine without other applications