large number of invalids on Perseus Arm

Anonymous
Topic 197620

Noticed several invalids over the last 9 days involving Binary Radio Pulsar Search (Perseus Arm Survey) v1.39 (BRP5-cuda32-nv270) WUs. This is on a single machine running NVIDIA driver.

[EDIT - had to repost the following having "fat fingered the original"]

All of the jobs look like the following:
Name PB0023_01651_94_0
Workunit 193286317
Created 29 Jun 2014 0:44:56 UTC
Sent 29 Jun 2014 2:15:12 UTC
Received 30 Jun 2014 16:05:25 UTC
Server state Over
Outcome Validate error (2:00000010)

Client state Done
Exit status 0 (0x0)
Computer ID 10487486
Report deadline 13 Jul 2014 2:15:12 UTC
Run time 15,118.71
CPU time 4,332.58
Validate state Invalid
Claimed credit 50.28
Granted credit 0.00
application version Binary Radio Pulsar Search (Perseus Arm Survey) v1.39 (BRP5-cuda32-nv270)
Stderr output

7.0.27

[05:47:26][19183][INFO ] Application startup - thank you for supporting Einstein@Home!
[05:47:26][19183][INFO ] Starting data processing...
[05:47:26][19183][INFO ] CUDA global memory status (initial GPU state, including context):
------> Used in total: 507 MB (1541 MB free / 2048 MB total) -> Used by this application (assuming a single GPU task): 0 MB
[05:47:26][19183][INFO ] Using CUDA device #0 "GeForce GTX 760" (0 CUDA cores / 0.00 GFLOPS)
[05:47:26][19183][INFO ] Version of installed CUDA driver: 6000
[05:47:26][19183][INFO ] Version of CUDA driver API used: 3020
[05:47:26][19183][INFO ] Checkpoint file unavailable: status.cpt (No such file or directory).
------> Starting from scratch...
[05:47:26][19183][INFO ] Header contents:
------> Original WAPP file: ./PB0023_01651_DM188.00
------> Sample time in microseconds: 1000
------> Observation time in seconds: 2097.152
------> Time stamp (MJD): 53361.511633250782
------> Number of samples/record: 0
------> Center freq in MHz: 1231.5
------> Channel band in MHz: 3
------> Number of channels/record: 96
------> Nifs: 1
------> RA (J2000): 62356.6772995
------> DEC (J2000): 105922.841
------> Galactic l: 0
------> Galactic b: 0
------> Name: G4314495
------> Lagformat: 0
------> Sum: 1
------> Level: 3
------> AZ at start: 0
------> ZA at start: 0
------> AST at start: 0
------> LST at start: 0
------> Project ID: --
------> Observers: --
------> File size (bytes): 0
------> Data size (bytes): 0
------> Number of samples: 2097152
------> Trial dispersion measure: 188 cm^-3 pc
------> Scale factor: 1.82017
[05:47:27][19183][INFO ] Seed for random number generator is 1066181982.
[05:47:27][19183][INFO ] Derived global search parameters:
------> f_A probability = 0.04
------> single bin prob(P_noise > P_thr) = 1.2977e-08
------> thr1 = 18.1601
------> thr2 = 21.263
------> thr4 = 26.2923
------> thr8 = 34.674
------> thr16 = 48.9881
[05:47:27][19183][INFO ] CUDA global memory status (GPU setup complete):
------> Used in total: 629 MB (1419 MB free / 2048 MB total) -> Used by this application (assuming a single GPU task): 122 MB
[05:48:26][19183][INFO ] Checkpoint committed!
[05:49:26][19183][INFO ] Checkpoint committed!
o
o
o
[05:50:26][19183][INFO ] Checkpoint committed!
[05:51:26][19183][INFO ] Checkpoint committed!
[05:52:26][19183][INFO ] Checkpoint committed!

[07:53:26][19183][INFO ] Statistics: count dirty SumSpec pages 2459 (not checkpointed), Page Size 1024, fundamental_idx_hi-window_2: 1100505
[07:53:26][19183][INFO ] Data processing finished successfully!
[07:53:26][19183][INFO ] Starting data processing...
[07:53:26][19183][INFO ] CUDA global memory status (initial GPU state, including context):
------> Used in total: 508 MB (1540 MB free / 2048 MB total) -> Used by this application (assuming a single GPU task): 1 MB
[07:53:26][19183][INFO ] Using CUDA device #0 "GeForce GTX 760" (0 CUDA cores / 0.00 GFLOPS)
[07:53:26][19183][INFO ] Version of installed CUDA driver: 6000
[07:53:26][19183][INFO ] Version of CUDA driver API used: 3020
[07:53:27][19183][INFO ] Checkpoint file unavailable: status.cpt (No such file or directory).
------> Starting from scratch...
[07:53:27][19183][INFO ] Header contents:
------> Original WAPP file: ./PB0023_01651_DM190.00
------> Sample time in microseconds: 1000
------> Observation time in seconds: 2097.152
------> Time stamp (MJD): 53361.511633209018
------> Number of samples/record: 0
------> Center freq in MHz: 1231.5
------> Channel band in MHz: 3
------> Number of channels/record: 96
------> Nifs: 1
------> RA (J2000): 62356.6772995
------> DEC (J2000): 105922.841
------> Galactic l: 0
------> Galactic b: 0
------> Name: G4314495
------> Lagformat: 0
------> Sum: 1
------> Level: 3
------> AZ at start: 0
------> ZA at start: 0
------> AST at start: 0
------> LST at start: 0
------> Project ID: --
------> Observers: --
------> File size (bytes): 0
------> Data size (bytes): 0
------> Number of samples: 2097152
------> Trial dispersion measure: 190 cm^-3 pc
------> Scale factor: 1.82045
[07:53:27][19183][INFO ] Seed for random number generator is 1076873195.
[07:53:28][19183][INFO ] Derived global search parameters:
------> f_A probability = 0.04
------> single bin prob(P_noise > P_thr) = 1.2977e-08
------> thr1 = 18.1601
------> thr2 = 21.263
------> thr4 = 26.2923
------> thr8 = 34.674
------> thr16 = 48.9881
[07:53:28][19183][INFO ] CUDA global memory status (GPU setup complete):
------> Used in total: 630 MB (1418 MB free / 2048 MB total) -> Used by this application (assuming a single GPU task): 123 MB
[07:53:45][19183][INFO ] Checkpoint committed!
[07:54:45][19183][INFO ] Checkpoint committed!
[07:55:45][19183][INFO ] Checkpoint committed!
o
o
o
[09:58:03][19183][INFO ] Checkpoint committed!
[09:59:03][19183][INFO ] Checkpoint committed!
[09:59:22][19183][INFO ] Statistics: count dirty SumSpec pages 2591 (not checkpointed), Page Size 1024, fundamental_idx_hi-window_2: 1100505
[09:59:22][19183][INFO ] Data processing finished successfully!
09:59:22 (19183): called boinc_finish

]]>

Did this WU complete?

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117922067940
RAC: 34544143

large number of invalids on Perseus Arm

Quote:

Noticed several invalids over the last 9 days involving Binary Radio Pulsar Search (Perseus Arm Survey) v1.39 (BRP5-cuda32-nv270) WUs. This is on a single machine running NVIDIA driver.

....

Did this WU complete?


Yes it did but I guess you mean is there anything unusual in the std_err.txt output? I'm certainly no expert, but the content you have supplied looks pretty normal to me. I've read it through and nothing immediately jumps out as being abnormal. I'm presuming the

o
o
o


bits were inserted by you to represent large numbers of identical lines you have omitted.

The BOINC client sees a normally completed task but the validator is apparently seeing rubbish - hence the validate error you mention as the outcome. It's possible there may have been a problem with the data but it would seem more likely to be a problem with your hardware. The first thing I would check would be thermal conditions, overclocking, etc. Since some tasks are found to be valid, it seems something is operating 'close to the edge' and is sometimes falling over, but not enough to bring the whole show crashing down :-).

The host in question is running an old alpha version of BOINC (7.0.27) which is most likely responsible for your GPU showing as '134215679MB'. Probably a good idea to upgrade to 7.2.42 to see if that makes any difference to the validate errors - I wouldn't think so but you never know :-).

Cheers,
Gary.

Anonymous

RE: RE: Did this WU

Quote:

Quote:
Did this WU complete?

Yes it did but I guess you mean is there anything unusual in the std_err.txt output? I'm certainly no expert, but the content you have supplied looks pretty normal to me. I've read it through and nothing immediately jumps out as being abnormal. I'm presuming the

o
o
o

bits were inserted by you to represent large numbers of identical lines you have omitted.

yes the "o"s are as you describe. I was questioning the:
Server state Over
Outcome Validate error (2:00000010)
entries.

Quote:

The BOINC client sees a normally completed task but the validator is apparently seeing rubbish - hence the validate error you mention as the outcome. It's possible there may have been a problem with the data but it would seem more likely to be a problem with your hardware. The first thing I would check would be thermal conditions, overclocking, etc. Since some tasks are found to be valid, it seems something is operating 'close to the edge' and is sometimes falling over, but not enough to bring the whole show crashing down :-).

The host in question is running an old alpha version of BOINC (7.0.27) which is most likely responsible for your GPU showing as '134215679MB'. Probably a good idea to upgrade to 7.2.42 to see if that makes any difference to the validate errors - I wouldn't think so but you never know :-).

the temp on this node's GPU has been around 66C. I am thinking this should be acceptable. I am not doing any overclocking. I can certainly upgrade the BOINC version and will pay more attention to it to see if there is something on the "edge".

Most curious.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117922067940
RAC: 34544143

RE: ... the temp on this

Quote:
... the temp on this node's GPU has been around 66C.


That sounds fine and the problem may not be coming from the GPU.

The next thing I would check would be the CPU cooling system followed by running a full memory test with something like memtest. On my systems it's one of the boot options and I have seen an example of a system that seems to boot OK but does funny things while running that then showed up as a small number of memory errors during a memtest run.

The other thing I tend to do is swap hardware between systems to see where the problem goes. Makes it a bit easier to properly identify the offending item.

Is the GPU a recent addition to the system? New hardware tends to have a higher failure rate immediately after installation. I've had two brand new GPUs fail recently, one after about 24 hours and the other after a couple of weeks. Both were immediately replaced by the supplier and there has been no further problem.

Cheers,
Gary.

Anonymous

RE: The next thing I

Quote:

The next thing I would check would be the CPU cooling system followed by running a full memory test with something like memtest. On my systems it's one of the boot options and I have seen an example of a system that seems to boot OK but does funny things while running that then showed up as a small number of memory errors during a memtest run.

Thanks for the "headsup". Was not aware of memtest.
memtest is present on Ubuntu and installed by default but I also found another utility called memtester. It runs from the command line and performs the same functions as memtest I believe. It is not installed by default but is easily done from the software package manager or command line.

I used:
sudo lshw -C memory
to get memory size and then ran:
sudo memtester 5G 1 ( test 5 gig of memory and run one iteration)
followed by:
echo $?
if the echo returns a 0 (zero) then the memtester run is clean.

The above command takes awhile.

man memtester for more info.

You could script this looking for the "$?" result and put it in a cron for memory diagnostics I suppose but you need to be careful on assigning the memory amount to scan. You could lock up your system.

I am smarter today then yesterday.

Quote:

The other thing I tend to do is swap hardware between systems to see where the problem goes. Makes it a bit easier to properly identify the offending item.

Is the GPU a recent addition to the system? New hardware tends to have a higher failure rate immediately after installation. I've had two brand new GPUs fail recently, one after about 24 hours and the other after a couple of weeks. Both were immediately replaced by the supplier and there has been no further problem.

The GPU has been in use for about a year on this machine and has performed w/o incident. Just recently have I noticed these issues.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117922067940
RAC: 34544143

RE: ... memtest is present

Quote:
... memtest is present on Ubuntu and installed by default but I also found another utility called memtester. It runs from the command line and performs the same functions as memtest I believe.


The nice thing about the memtest I use is that it runs without an OS. It's an item on the boot menu and the bootloader loads it directly into memory instead of the kernel and it starts executing immediately. I don't know much about the inner workings but I believe it relocates itself in memory so that all memory is tested. You can actually watch it testing different memory blocks as it moves through the full range. Maybe a utility run from a command prompt can do the same - I don't know. It would have to shift both itself and the OS and that sounds a bit difficult. Maybe it's not.

Quote:
The GPU has been in use for about a year on this machine and has performed w/o incident. Just recently have I noticed these issues.


You could try swapping it with the GTX 770 and see if the problem transfers with the GPU. Even if there were different driver versions on the two machines, I wouldn't imagine there would be a problem.

Cheers,
Gary.

Anonymous

RE: The nice thing about

Quote:

The nice thing about the memtest I use is that it runs without an OS. It's an item on the boot menu and the bootloader loads it directly into memory instead of the kernel and it starts executing immediately.

Ubuntu has the same memtest that your distro has. Its an option on the grub menu. I had to do some looking. Soooooo, for you Ubu users if grub does not display automatically on boot hold down the "shift" key. It will then display and you can choose the memtest86+ option (there are two - either one). For me this seemed like a more thorough scan then the one I had used earlier from the command line. It took about 34 minutes for a single pass based upon my memory size. Beware it will output a subtle message that it completed w/o errors and begin pass 2. "ESC" returns you to the boot up process.

As it turned out I have no memory issues based upon both of these utility scans. That is always nice to know.

Also if you Google "memtest86" you can download their latest version, burn it to a bootable DVD and load it at boot if your "grub" does not support a version.

Anonymous

RE: RE: Noticed several

Quote:
Quote:

Noticed several invalids over the last 9 days involving Binary Radio Pulsar Search (Perseus Arm Survey) v1.39 (BRP5-cuda32-nv270) WUs. This is on a single machine running NVIDIA driver.

....

Did this WU complete?


Yes it did but I guess you mean is there anything unusual in the std_err.txt output? I'm certainly no expert, but the content you have supplied looks pretty normal to me. I've read it through and nothing immediately jumps out as being abnormal. I'm presuming the

o
o
o

bits were inserted by you to represent large numbers of identical lines you have omitted.

The BOINC client sees a normally completed task but the validator is apparently seeing rubbish - hence the validate error you mention as the outcome. It's possible there may have been a problem with the data but it would seem more likely to be a problem with your hardware. The first thing I would check would be thermal conditions, overclocking, etc. Since some tasks are found to be valid, it seems something is operating 'close to the edge' and is sometimes falling over, but not enough to bring the whole show crashing down :-).

The host in question is running an old alpha version of BOINC (7.0.27) which is most likely responsible for your GPU showing as '134215679MB'. Probably a good idea to upgrade to 7.2.42 to see if that makes any difference to the validate errors - I wouldn't think so but you never know :-).

Problem seems to have been corrected. I updated to a later BOINC and I installed the "fan controller" option on the NVIDIA driver package. I then increased the fan speed to increase cooling. The memtest suggested was run and no errors were detected so it may have been a temp/BOINC software issue although I don't think a temp of 66C should have caused a problem. But maybe there is something happening in the GPU and it will eventually generate more errors of this nature. For now the birds are singing and the sun is shining.

Thanks all.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117922067940
RAC: 34544143

RE: Problem seems to have

Quote:
Problem seems to have been corrected...


I'm glad you got it sorted.

A year or two ago (on Milkyway not Einstein) I had some GPUs (in service for a while) start to produce lot's of errors due to heat. Heat sink and fan were good so I decided to remove the heat sink and replace the thermal interface material. I found the old stuff had completely dried out in all cases. With freshly applied new grease, temperatures dropped quite a bit (~10C) and the errors stopped.

If the problem returns, I suggest you check the TIM, if you haven't already done so.

Cheers,
Gary.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.