Validation Error

Juergen Kozok
Juergen Kozok
Joined: 21 Feb 05
Posts: 7
Credit: 3032676
RAC: 1145
Topic 205679

I am facing now since a few weeks the above error. once in a while a calculation get through without this. But the others after a lot of CPU time when it seams to be done is reporting this. I have stopped execution as this is waist of time. Any idea whats Happening and how to correct this?

 

mac
mac
Joined: 12 Aug 06
Posts: 6
Credit: 1431583
RAC: 0
Juergen Kozok
Juergen Kozok
Joined: 21 Feb 05
Posts: 7
Credit: 3032676
RAC: 1145

according to this program the

according to this program the Card Looks like a regular low end GT610, any reason that this is not working? it certainly work with other Projects, eg Seti.

Graphics Processor

GPU Name: GF119
GPU Variant: GF119-300-A1
Architecture: Fermi
Process Size: 40 nm
Transistors: 292 million
Die Size: 79 mm²

Graphics Card

Released: Apr 2nd, 2012 May 14th, 2012
Production Status: Active
Bus Interface: PCIe 2.0 x16
MSI Part #: N610GT-MD2GD3/LP

Clock Speeds

GPU Clock: 810 MHz
Shader Clock: 1620 MHz
Memory Clock: 898 MHz 500 MHz (-44%) 1796 MHz effective 1000 MHz effective

Memory

Memory Size: 1024 MB 2048 MB
Memory Type: DDR3
Memory Bus: 64 bit
Bandwidth: 14.37 GB/s 8.00 GB/s

Render Config

Shading Units: 48
TMUs: 8
ROPs: 4
SM Count: 1
Pixel Rate: 1.620 GPixel/s
Texture Rate: 6.48 GTexel/s
Floating-point performance: 155.52 GFLOPS

Board Design

Slot Width: Single-slot
Length: 5.7 inches 5.67 inches 145 mm 144 mm
TDP: 29 W
Outputs: 1x DVI 1x HDMI 1x VGA
Power Connectors: None
Board Number: P1310

Graphics Features

DirectX: 11.0
OpenGL: 4.5
OpenCL: 1.1
CUDA: 2.1
Shader Model: 5.0
Holmis
Joined: 4 Jan 05
Posts: 1118
Credit: 1055935564
RAC: 0

Validate errors are almost

Validate errors are almost always caused by hardware running at the edge or above the edge for what it's capable of.

Start by checking the temperatures of the card and CPU, clean heat sinks if necessary.
If the card is over clocked then reset to stock clocks or under clock the card.
If any other part of the computer is over clocked then do the same, reset to stock or down clock under stock specs.
Check the condition of the power supply, good stable power is important for stable operations.

mac
mac
Joined: 12 Aug 06
Posts: 6
Credit: 1431583
RAC: 0

it can be driver installation

it can be driver installation problem also

run DDU in safe mode, clean nvidia and install latest driver again

 

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5842
Credit: 109410587843
RAC: 35037759

Juergen Kozok wrote:I am

Juergen Kozok wrote:
I am facing now since a few weeks the above error.

I looked at your current tasks list and they don't show as validate errors or invalid results but rather as computation errors.  It struck me as unusual that all the ones still showing in the database failed at pretty much the same elapsed time.  So I decided to click a task ID link (I chose the oldest one showing) to see what was actually returned to the project.  Here is a snippet of the very last bit of what was returned (with a few tweaks for readability) with the point of error highlighted.

===== Start of log excerpt =====

% Binary point 1255/1255
% Starting semicoherent search over f0 and f1.
% nf1dots: 31 df1dot: 3.344368011e-015 f1dot_start: -1e-013 f1dot_band: 1e-013
% Filling array of photon pairs
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
% C 1 0
% Time spent on semicoherent stage: 39252.7849s
% Writing semicoherent output file.
% Following up candidate number: 1
% Refining in S
% Following-up in P
ERROR: /home/bema/fermilat/src/bridge_fft_clfft.c:1048: clFinish failed. status=-36
20:31:47 (10956): [CRITICAL]: ERROR: MAIN() returned with error '-36'
FPU status flags: PRECISION
20:31:53 (10956): [normal]: done. calling boinc_finish(28).
20:31:53 (10956): called boinc_finish

</stderr_txt>

===== End of log excerpt =====

The GPU has processed the very last 'binary point' (1255 out of 1255) and so the 'follow-up' stage has started.  This is where the top 10 most likely candidates will be examined in detail.  You can see there is an immediate error reported as status=-36.  One of the Devs will have to comment on what that means and what (if anything) can be done about it.

I decided to look at a second task ID (I chose the 2nd oldest) and this time I saw something a bit different.

===== Start of log excerpt =====

% Binary point 1255/1255
% Starting semicoherent search over f0 and f1.
% nf1dots: 31 df1dot: 3.344368011e-015 f1dot_start: -1e-013 f1dot_band: 1e-013
% Filling array of photon pairs
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
% Time spent on semicoherent stage: 1372.6947s
% Writing semicoherent output file.
% Following up candidate number: 1
% Refining in S
% Following-up in P
% C 2 1256
% Following up candidate number: 2
% Refining in S
% Following-up in P
% C 3 1257
% Following up candidate number: 3
% Refining in S
% Following-up in P
09:40:40 (2192): [normal]: This Einstein@home App was built at: Feb 15 2017 09:23:49

09:40:40 (2192): [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/hsgamma_FGRPB1G_1.20_windows_x86_64__FGRPopencl1K-nvidia.exe'.
09:40:40 (2192): [debug]: 1.1e+016 fp, 3.7e+009 fp/s, 2805161 s, 779h12m41s38
09:40:40 (2192): [normal]: % CPU usage: 1.000000, GPU usage: 1.000000
command line: projects/einstein.phys.uwm.edu/hsgamma_FGRPB1G_1.20_windows_x86_64__FGRPopencl1K-nvidia.exe
--inputfile ../../projects/einstein.phys.uwm.edu/LATeah0012L.dat --alpha 4.42281478648 --delta -0.0345027837249
--skyRadius 2.152570e-06 --ldiBins 15 --f0start 1116.0 --f0Band 8.0 --firstSkyPoint 0 --numSkyPoints 1 --f1dot -1e-13 --f1dotBand 1e-13 --df1dot 3.344368011e-15 --ephemdir ..\..\projects\einstein.phys.uwm.edu\JPLEPH --Tcoh 2097152.0 --toplist 10 --cohFollow 10 --numCells 1 --useWeights 1 --Srefinement 1 --CohSkyRef 1 --cohfullskybox 1 --mmfu 0.1 --reftime 56100 --model 0 --f0orbit 0.005 --mismatch 0.1 --demodbinary 1 --BinaryPointFile ../../projects/einstein.phys.uwm.edu/templates_LATeah0012L_1124_22467010.dat --debug 1 --device 0 -o LATeah0012L_1124.0_0_0.0_22467010_1_0.out
output files: 'LATeah0012L_1124.0_0_0.0_22467010_1_0.out' '../../projects/einstein.phys.uwm.edu/LATeah0012L_1124.0_0_0.0_22467010_1_0' 'LATeah0012L_1124.0_0_0.0_22467010_1_0.out.cohfu' '../../projects/einstein.phys.uwm.edu/LATeah0012L_1124.0_0_0.0_22467010_1_1'
09:40:40 (2192): [debug]: Flags: X64 SSE SSE2 GNUC X86 GNUX86
09:40:40 (2192): [debug]: Set up communication with graphics process.
boinc_get_opencl_ids returned [0000000000147900 , 00000000001473B0]
Using OpenCL platform provided by: NVIDIA Corporation
Using OpenCL device "GeForce GT 610" by: NVIDIA Corporation
Max allocation limit: 536870912
Global mem size: 2147483648
OpenCL device has FP64 support
% Opening inputfile: ../../projects/einstein.phys.uwm.edu/LATeah0012L.dat
% Total amount of photon times: 30007
% Preparing toplist of length: 10
% Read 1255 binary points
% checkpoint read: skypoint 3 binarypoint 1257
% fft_size: 16777216 (0x1000000); alloc: 67108872
% Time spent on semicoherent stage: 0.0000s
% Writing semicoherent output file.

% Following up candidate number: 3
% Refining in S
% Following-up in P
Error during OpenCL host->device transfer read coh_followup_list(error: -5)
09:40:54 (2192): [CRITICAL]: ERROR: MAIN() returned with error '1'
FPU status flags: PRECISION
Error in OpenCL context: CL_OUT_OF_RESOURCES error executing CL_COMMAND_READ_BUFFER on GeForce GT 610 (Device 0).

09:40:59 (2192): [normal]: done. calling boinc_finish(65).
09:40:59 (2192): called boinc_finish

</stderr_txt>

===== End of log excerpt =====

In this second example, the followup processing had actually started and the 3rd candidate was being examined when BOINC appears to have been restarted (for whatever reason).  I've highlighted the line that indicates the restart.  During the restart messages, there is a line (I've highlighted it also) that says the card is double precision capable.  I was surprised to see that so I had a look at what was listed in Wikipedia.  If you scroll down to the GT 610, you will see it listed as 'Unknown' for DP capability.

I suspect that perhaps it doesn't have DP and that may be why it crashes if the follow-up stage is attempted on the GPU.  The Devs will really have to sort this one out, particularly as the processing of the candidates had started successfully before the restart and then immediately failed after the restart.  Something is a bit weird with that.  Also the error code is quite different compared to the first example.

 

Cheers,
Gary.

AgentB
AgentB
Joined: 17 Mar 12
Posts: 915
Credit: 513211304
RAC: 0

The oddly described "Printer

The oddly described "Printer out of paper error" =-36 error, seen a couple of folks with these now.

see also https://einsteinathome.org/content/new-user-most-gpu-tasks-failing-status-36-clinvalidcommandqueue

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.