Linux CUDA validation errors

M. Schmitt
M. Schmitt
Joined: 27 Jun 05
Posts: 478
Credit: 15872262
RAC: 0
Topic 195648

Plugged in a CUDA card a few days ago and get a lot of validation errors. Yesterday 11 WUs invalid and only 3 valid. :(
Screensaver and everything else is off. System is OpenSuse 11.3 64 Bit. Invalid tasks.

[pre]==============NVSMI LOG==============

Timestamp : Sat Feb 12 10:38:06 2011

Driver Version : 260.19.36

GPU 0:
Product Name : GeForce GTX 460
PCI Device/Vendor ID : e2210de
PCI Location ID : 0:4:0
Board Serial : 629154929
Display : Connected
Temperature : 46 C
Fan Speed : 40%
Utilization
GPU : 52%
Memory : 14%[/pre]

Any ideas what could be the reason?

Gundolf Jahn
Gundolf Jahn
Joined: 1 Mar 05
Posts: 1079
Credit: 341280
RAC: 0

Linux CUDA validation errors

Validate errors usually are server side problems, nothing wrong on your machine(s).

Gruß,
Gundolf

Computer sind nicht alles im Leben. (Kleiner Scherz)

M. Schmitt
M. Schmitt
Joined: 27 Jun 05
Posts: 478
Credit: 15872262
RAC: 0

So there is hope that this

So there is hope that this problem will be solved soon?
I'd really appreciate. :)

Claggy
Claggy
Joined: 29 Dec 06
Posts: 560
Credit: 2694028
RAC: 0

All your Cuda tasks are

All your Cuda tasks are getting reported straight after the result has been uploaded,
which is too soon after upload, try and find a way of getting Boinc to report them later,
or set NNT until you have a batch of them to report,

Claggy

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 686127309
RAC: 577205

Hi! There are two ways

Hi!

There are two ways that results can end up as being invalid:

1) after being sent to the server, they go straight to invalid (validation error). This means that the server rejects the result just by looking at the individual file (e.g. the file is corrupted as a whole or some values in the file fail a basic sanity range check).

2) the result survives the initial sanity check but fails to agree with the result of another cruncher ("wingman"). This is the "inconclusive validation" scenario. After one or more additional results come in to finally reach the quorum of two agreeing results, all other not-matching results become "invalid".

This second case can happen because of cross-platform validation problems (CPU calculates a bit differently than GPU), see this discussion http://einsteinathome.org/node/195567&nowrap=true#109649.

But I think your results were failing as in the first scenario? That would indicate that the results are a complete mess.

I had a similar string of "bad" results recently, and after a reboot of the machine, all was back to normal with just a few results failing validation as in scenario 2).

I have no idea why it was ok after the reboot or even if the reboot had anything to do with it, it could be that there was a sequence of tasks that somehow are more sensitive to cross-validation problems than others and the reboot just coincided with the end of these results.

So my advise would be to
-reboot,
-watch the temperature of the GPU (if it's always like the one in your thread starting message, it's ok of course. Actually I think it's surprisingingly low for a 50% loaded card and fan at 40%??)
-maybe upgrade the driver. NVIDIA' new 270.* Linux driver fixes a certain problem that kept the E@H app from yielding the CPU to other tasks during GPU computations, and Oliver plans to release a beta-app that would no longer need one full CPU core anymore. So you will want to go to 270 driver anyway I guess.

Happy crunching
HB

M. Schmitt
M. Schmitt
Joined: 27 Jun 05
Posts: 478
Credit: 15872262
RAC: 0

Thanks for the help @you all.

Thanks for the help @you all.

The temperature is always in the range of 47° to 48° C and the box doesn't run 24/7. It's a Gigabyte card with two fans and a bit oc'ed by Gigabyte. I will try the new driver and see if it helps. My results are always invalid as soon as they arrive. Never saw a result 'inconclusive' only the wingman ones when mine was invalid. I wait for a wingman atm that is running Linux too and see what is happening. Today I got 3 valid and 1 invalid result so far, the rest is pending. So there is hope they will not all error out. ;)

I deleted 'return_results_immediately' in my cc_config.xml but the first result reported was invalid.

In some days I will check what happens when I boot Win 7. This might tell something about the Linux driver.

Claggy
Claggy
Joined: 29 Dec 06
Posts: 560
Credit: 2694028
RAC: 0

RE: Thanks for the help

Quote:

Thanks for the help @you all.

The temperature is always in the range of 47° to 48° C and the box doesn't run 24/7. It's a Gigabyte card with two fans and a bit oc'ed by Gigabyte. I will try the new driver and see if it helps. My results are always invalid as soon as they arrive. Never saw a result 'inconclusive' only the wingman ones when mine was invalid. I wait for a wingman atm that is running Linux too and see what is happening. Today I got 3 valid and 1 invalid result so far, the rest is pending. So there is hope they will not all error out. ;)

I deleted 'return_results_immediately' in my cc_config.xml but the first result reported was invalid.

In some days I will check what happens when I boot Win 7. This might tell something about the Linux driver.

Did you do a 'Read config file' after deleting 'return_results_immediately' from your cc_config.xml?

Claggy

M. Schmitt
M. Schmitt
Joined: 27 Jun 05
Posts: 478
Credit: 15872262
RAC: 0

Sure :) I restarted

Sure :)
I restarted BOINC.

3 invalid so far. 3 valid, rest pending.

M. Schmitt
M. Schmitt
Joined: 27 Jun 05
Posts: 478
Credit: 15872262
RAC: 0

Tested CUDA with Win 7 today

Tested CUDA with Win 7 today and didn't get any invalid result so far. Last days the success quote with Linux was really frustrating. So two facts arise:

1) My card is ok.
2) The Linux app is buggy.

I will keep on testing with Win 7 the next days. Started to run 2 tasks parallel today without any problems. Linux is my main system so I might be forced to run the Win app in a VM like some years ago.

Saenger
Saenger
Joined: 15 Feb 05
Posts: 403
Credit: 33009522
RAC: 0

RE: So two facts arise: 1)

Quote:

So two facts arise:

1) My card is ok.
2) The Linux app is buggy.


I can't confirm that, I'm running ubuntu 10.04, drivers 270.18, without any mayor problems besides the usual validation problems between Cuda and non-Cuda for very few WUs.

Grüße vom Sänger

M. Schmitt
M. Schmitt
Joined: 27 Jun 05
Posts: 478
Credit: 15872262
RAC: 0

When I start Linux again to

When I start Linux again to finish my tasks, I will give that 27.xx beta driver a chance. Atm I get invalid results with SSE2 and Windows CUDA wingman. So almost no chance to get valid results.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.