Invalid Tasks

[CSF] Vagelis Giannadakis
[CSF] Vagelis G...
Joined: 18 Mar 05
Posts: 8
Credit: 3760202
RAC: 0
Topic 196949

Hi all,

Old-timer and newbie at the same time on E@H, came back after a loooong time to crunch on my NVIDIA GT 440, after WCG stopped serving GPU WUs.

Just started yesterday, but I am seeing a trend I don't like at all: my tasks run and complete fine, but they are then marked invalid. Here is the list of invalid tasks.

I know it's too soon at this point to draw any conclusions, with so few WUs completed, but as I said, I don't like the trend. I am suspecting the card, but it was working like a charm for WCG!

Perhaps it is the temps. I have a fairly good setup, with fans blowing both in and out of the case, a decent CPU cooler and the card's (GIGABYTE) stock active cooler. Here are my temps:

vagelis@vgserver:~$ sensors
coretemp-isa-0000
Adapter: ISA adapter
Core 0:       +63.0°C  (high = +83.0°C, crit = +99.0°C)
Core 1:       +62.0°C  (high = +83.0°C, crit = +99.0°C)
Core 2:       +61.0°C  (high = +83.0°C, crit = +99.0°C)
Core 3:       +63.0°C  (high = +83.0°C, crit = +99.0°C)

atk0110-acpi-0
Adapter: ACPI interface
Vcore Voltage: +1.22 V (min = +0.80 V, max = +1.60 V)
+3.3V Voltage: +3.39 V (min = +2.97 V, max = +3.63 V)
+5V Voltage: +5.14 V (min = +4.50 V, max = +5.50 V)
+12V Voltage: +12.26 V (min = +10.20 V, max = +13.80 V)
CPU Fan Speed: 2280 RPM (min = 600 RPM)
Chassis1 Fan Speed: 1504 RPM (min = 600 RPM)
Chassis2 Fan Speed: 0 RPM (min = 600 RPM)
Power Fan Speed: 610 RPM (min = 0 RPM)
CPU Temperature: +63.5°C (high = +45.0°C, crit = +45.5°C)
MB Temperature: +35.0°C (high = +45.0°C, crit = +46.0°C)

vagelis@vgserver:~$ nvidia-smi -q -d TEMPERATURE

==============NVSMI LOG==============

Timestamp : Fri May 10 00:17:44 2013
Driver Version : 304.88

Attached GPUs : 1
GPU 0000:01:00.0
Temperature
Gpu : 64 C

It's right after midnight now that I'm writing these lines, so ambient temp has dropped a few degrees. When it's warmer, they tend to be about 70 for the CPU, slightly lower for the GPU (maybe 67-69) and 37-39 for the MB.

I am running on Ubuntu 12.04 with BOINC and NVIDIA drivers from the repos (7.0.27 and 304.88 respectively). Here are the machine's details.

Do you think my card is bad? Or perhaps I'm just piling up all my bad results right from the start and things will get smooth later?

Thanks and regards,
Vagelis

Jord
Joined: 26 Jan 05
Posts: 2952
Credit: 5779100
RAC: 0

Invalid Tasks

Quote:
Here is the list of invalid tasks.


Only you can see that list, we can't. But we can see it here, using hostid instead of userid.

Quote:
Do you think my card is bad? Or perhaps I'm just piling up all my bad results right from the start and things will get smooth later?


No, I think you're sitting on a bad GPU. Though it could be something easy, such as dust built-up, or a bad seating. Take it out, dust it off, put it back in and seat it correctly. Check capacitors while it's out, check that none are bulging, leaking or burnt. Check power cords.

If you have the possibility, try it in another slot.
Or if you have a spare GPU, try that one.

[CSF] Vagelis Giannadakis
[CSF] Vagelis G...
Joined: 18 Mar 05
Posts: 8
Credit: 3760202
RAC: 0

Hi Jord, Thanks for your

Hi Jord,

Thanks for your response! I checked my tasks this morning and the trend continues: all my completed tasks are marked invalid.

I am wondering if there is some indication of whatever might be wrong in the tasks' stdout / stderr outputs. Something that I could compare for a given WU between my task that is found invalid and another that is found valid.

Neil Newell
Neil Newell
Joined: 20 Nov 12
Posts: 176
Credit: 169699457
RAC: 0

Can't see anything obvious in

Can't see anything obvious in your scheduler logs, and the runtime for the jobs is in the expected range. I do recall I had a few problems around boinc-7.0.27, so one possibility would be to upgrade to the current 7.0.65 release.

Mike Hewson
Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6568
Credit: 292951432
RAC: 94275

RE: I am wondering if there

Quote:
I am wondering if there is some indication of whatever might be wrong in the tasks' stdout / stderr outputs. Something that I could compare for a given WU between my task that is found invalid and another that is found valid.


Excellent idea, in fact that's one reason why such logs exist. I'll have a look when I get home. :-)

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

[CSF] Vagelis Giannadakis
[CSF] Vagelis G...
Joined: 18 Mar 05
Posts: 8
Credit: 3760202
RAC: 0

Thank you Neil and Mike for

Thank you Neil and Mike for your responses! Mike, it would be sooo nice if you managed to take a look at those logs! Thank you in advance!

In my attempts to resolve this, I tried detaching / reattaching the project, resetting the project and even rebooting my PC. The detach / reattach caused a new computer ID to be generated, 7225470. I went ahead and merged the two computer entries.

All this, I am afraid, to no avail! A new WU was completed and also found invalid!

Now, I know my poor little GT 440 wouldn't make the slightest difference to the progress of the project if it actually did produce valid results, but it is SOOO depressing to see your WUs flagged invalid! :(

AgentB
AgentB
Joined: 17 Mar 12
Posts: 915
Credit: 513211304
RAC: 0

RE: I am running on Ubuntu

Quote:


I am running on Ubuntu 12.04 with BOINC and NVIDIA drivers from the repos (7.0.27 and 304.88 respectively). Here are the machine's details.

I can´t see if you are still running those versions, but i find running NVIDIA from the repos gives me problems (Ubuntu 10.04), and if i´m not paying attention at update times i overwrite my current 310.14, which have served me well, with 304.xx which either invalids or errors.

Nvidia would be the place to go to get later versions.

I notice 319.17 is the latest, so i may give that a try and report back.

HTH

[CSF] Vagelis Giannadakis
[CSF] Vagelis G...
Joined: 18 Mar 05
Posts: 8
Credit: 3760202
RAC: 0

Maybe I have found something.

Maybe I have found something. I downloaded and installed the latest driver from the Nvidia site (319.17). Nouveau did make me reboot my server +1 time, which takes some time with the RAID and other stuff I have on it, but I managed to load the latest driver and launch X successfully.

I then fired up BOINC and the manager to see whether the E@H WU would start and it did. Looking through the BOINC logs, I noticed this:

NVIDIA GPU 0: GeForce GT 440 (driver version unknown, CUDA version 5.50, compute capability 2.1, 134214656MB, 134214625MB available, 319 GFLOPS peak)

With the older driver, this was like so:

NVIDIA GPU 0: GeForce GT 440 (driver version unknown, CUDA version 5.0, compute capability 2.1, 134214656MB, 134214626MB available, 319 GFLOPS peak)

Notice the CUDA version difference, was 5.0, now is 5.50.

I then looked at the WUs logs:

[23:15:05][2456][INFO ] Using CUDA device #0 "GeForce GT 440" (96 CUDA cores / 318.72 GFLOPS)
[23:15:05][2456][INFO ] Version of installed CUDA driver: 5050
[23:15:05][2456][INFO ] Version of CUDA driver API used: 3020

So I am wondering, could it be that E@H requires CUDA 5.50 and didn't work correctly with 5.0??

AgentB
AgentB
Joined: 17 Mar 12
Posts: 915
Credit: 513211304
RAC: 0

RE: I then fired up BOINC

Quote:


I then fired up BOINC and the manager to see whether the E@H WU would start and it did. Looking through the BOINC logs, I noticed this:

NVIDIA GPU 0: GeForce GT 440 (driver version unknown, CUDA version 5.50, compute capability 2.1, 134214656MB, 134214625MB available, 319 GFLOPS peak)

With the older driver, this was like so:

NVIDIA GPU 0: GeForce GT 440 (driver version unknown, CUDA version 5.0, compute capability 2.1, 134214656MB, 134214626MB available, 319 GFLOPS peak)

Notice the CUDA version difference, was 5.0, now is 5.50.

I then looked at the WUs logs:

[23:15:05][2456][INFO ] Using CUDA device #0 "GeForce GT 440" (96 CUDA cores / 318.72 GFLOPS)
[23:15:05][2456][INFO ] Version of installed CUDA driver: 5050
[23:15:05][2456][INFO ] Version of CUDA driver API used: 3020

So I am wondering, could it be that E@H requires CUDA 5.50 and didn't work correctly with 5.0??

BRP4 requirements suggest 5.0 is ok.

i would let it run and see what happens. Any inprogress WUs may still error, so i would abort any that were in progress during the upgrade.

134214656MB - that is large for the GPU! I think something is not reporting the memory size of the GPU correctly - and i seem to recall it is a known bug.

And my gtx460s on 319.17 has crunched a couple of WUs ok. No obvious performance gains although hard to tell at first glance.

Jord
Joined: 26 Jan 05
Posts: 2952
Credit: 5779100
RAC: 0

RE: 134214656MB - that is

Quote:
134214656MB - that is large for the GPU! I think something is not reporting the memory size of the GPU correctly - and i seem to recall it is a known bug.


It is a known bug and fixed in 7.0.65, but last I checked 7.0.65 isn't available yet in Ubuntu repos.

[CSF] Vagelis Giannadakis
[CSF] Vagelis G...
Joined: 18 Mar 05
Posts: 8
Credit: 3760202
RAC: 0

RE: RE: 134214656MB -

Quote:
Quote:
134214656MB - that is large for the GPU! I think something is not reporting the memory size of the GPU correctly - and i seem to recall it is a known bug.

It is a known bug and fixed in 7.0.65, but last I checked 7.0.65 isn't available yet in Ubuntu repos.


I will give the latest BOINC version a try.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.