Sporatic validate errors using Parkes PMPS XT v1.57-BRP6-Beta-cuda55

Ron Kosinski
Ron Kosinski
Joined: 23 Mar 05
Posts: 57
Credit: 1075916534
RAC: 766008
Topic 198674

I have 2 very similar computers (10810284 and 8183504) with the same GPU card (Gigabyte GTX-760) and MoBo (Asus Z87-A) in both. The GPU cards have the same driver (352.86) and O/C settings. The biggest difference is the CPU's and OS (i5 W7 Pro and i7 W7 Ult).

I am having a problem with one of the computers (10810284) giving me a lot of Validate Errors, but some WU validate OK. No issues with the other computer (8183504). No issues at SETI.

Would someone with more knowledge than I have please look at the WU logs and see what may be causing the problem?

All Binary Radio Pulsar Search (Parkes PMPS XT) tasks for computer 10810284
All Binary Radio Pulsar Search (Parkes PMPS XT) tasks for computer 8183504

If any additional information is wanted, please ask.

Thank you for any and all assistance.
Ron

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5874
Credit: 118013174648
RAC: 21033974

Sporatic validate errors using Parkes PMPS XT v1.57-BRP6-Beta-c

Quote:
Would someone with more knowledge than I have please look at the WU logs and see what may be causing the problem?


Unfortunately, the only person who can work out the cause of the problem is someone with physical access to your computer.

A validate error simply means that the result returned is so defective that the validator can immediately rule it out without even needing to compare it with a companion result. In the bulk of cases, validate errors point to hardware issues/failures, perhaps caused by excessive temperature, frequency or out of spec/unstable voltage. You need to work through a series of tests to narrow down the cause of the problem.

You should reduce any overclocking back closer to default values to see if the problem goes away. Just because one card works OK doesn't mean that the other one will tolerate the same settings. You should check temperatures and fan speeds and remove any dust/fluff buildup on heat sinks/fans. If none of those make a difference, you should attempt to narrow it down to individual bits of hardware - RAM, PSU, graphics card, etc.

Because you have two similar systems, the easiest thing to do is swap individual components (one at a time) between them, until you find the item that transfers the problem from one system to the other. If the problem doesn't transfer, you probably have a motherboard fault of some sort.

Good luck with finding what is causing the errors.

Cheers,
Gary.

Der Mann mit der Ledertasche
Der Mann mit de...
Joined: 12 Dec 05
Posts: 151
Credit: 302594178
RAC: 0

Hi Gary, by looking over a

Hi Gary,

by looking over a couple of the Results, it seems that these Errors came from cross Platform and different App Versions by the Wingman!
Just a Suggestion by the first Look over about 10 WU's.

BR

DMmdL

Greetings from the North

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5874
Credit: 118013174648
RAC: 21033974

RE: ... these Errors came

Quote:
... these Errors came from cross Platform and different App Versions by the Wingman!


The errors mentioned by the OP were "validate errors" and not "invalid" results arising from comparisons with other returned tasks. There is quite a difference.

With validate errors, there is no comparison performed. The validator recognises that one result is rubbish without having to do a comparison with another. That is why, when one member of a quorum is found to be a validate error, the other one is marked "Completed, validation inconclusive" (instead of "Completed, waiting for validation"), to indicate that the result looks OK (not a 2nd validate error) but a comparison can't proceed just yet until a further result is returned by a third host.

So as I tried to indicate, validate errors are most likely caused by hardware related problems on the host returning them.

Cheers,
Gary.

Ron Kosinski
Ron Kosinski
Joined: 23 Mar 05
Posts: 57
Credit: 1075916534
RAC: 766008

Thanks for the

Thanks for the replies!

Quote:
You should reduce any overclocking back closer to default values to see if the problem goes away. You should check temperatures and fan speeds and remove any dust/fluff buildup on heat sinks/fans. If none of those make a difference, you should attempt to narrow it down to individual bits of hardware - RAM, PSU, graphics card, etc.

Both GPU and CPU are slighty O/C on both boxes. GPU temps stay below 65 deg C, CPU temps stay below 55 deg C. Dust bunnies are cleaned out at least quarterly. Both boxes run 24/7 with minimal or no gaming. The GPU card on the problem box is about 6 months newer. I will try going back to stock GPU settings to see if that helps. I am including GPU-Z screenshots and BOINC event logs from both boxes if the additional information will be helpful.

Problem Box - 10810284

7/3/2016 9:11:03 PM | | cc_config.xml not found - using defaults
7/3/2016 9:11:03 PM | | Starting BOINC client version 7.6.9 for windows_x86_64
7/3/2016 9:11:03 PM | | log flags: file_xfer, sched_ops, task
7/3/2016 9:11:03 PM | | Libraries: libcurl/7.39.0 OpenSSL/1.0.2a zlib/1.2.8
7/3/2016 9:11:03 PM | | Data directory: C:\ProgramData\BOINC
7/3/2016 9:11:03 PM | | Running under account XXX
7/3/2016 9:11:04 PM | | CUDA: NVIDIA GPU 0: GeForce GTX 760 (driver version 352.86, CUDA version 7.5, compute capability 3.0, 2048MB, 1915MB available, 2650 GFLOPS peak)
7/3/2016 9:11:04 PM | | OpenCL: NVIDIA GPU 0: GeForce GTX 760 (driver version 352.86, device version OpenCL 1.2 CUDA, 2048MB, 1915MB available, 2650 GFLOPS peak)
7/3/2016 9:11:04 PM | | Host name: xxxx3
7/3/2016 9:11:04 PM | | Processor: 4 GenuineIntel Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz [Family 6 Model 60 Stepping 3]
7/3/2016 9:11:04 PM | | Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss htt tm pni ssse3 fma cx16 sse4_1 sse4_2 movebe popcnt aes f16c rdrandsyscall nx lm avx avx2 vmx tm2 pbe fsgsbase bmi1 smep bmi2
7/3/2016 9:11:04 PM | | OS: Microsoft Windows 7: Ultimate x64 Edition, Service Pack 1, (06.01.7601.00)
7/3/2016 9:11:04 PM | | Memory: 15.94 GB physical, 31.88 GB virtual
7/3/2016 9:11:04 PM | | Disk: 190.43 GB total, 41.00 GB free

GPU-Z image

Good Box - 8183504

7/4/2016 8:27:32 AM | | cc_config.xml not found - using defaults
7/4/2016 8:27:33 AM | | Starting BOINC client version 7.6.9 for windows_x86_64
7/4/2016 8:27:33 AM | | log flags: file_xfer, sched_ops, task
7/4/2016 8:27:33 AM | | Libraries: libcurl/7.39.0 OpenSSL/1.0.2a zlib/1.2.8
7/4/2016 8:27:33 AM | | Data directory: C:\ProgramData\BOINC
7/4/2016 8:27:33 AM | | Running under account XXX
7/4/2016 8:28:05 AM | | CUDA: NVIDIA GPU 0: GeForce GTX 760 (driver version 352.86, CUDA version 7.5, compute capability 3.0, 2048MB, 1936MB available, 2650 GFLOPS peak)
7/4/2016 8:28:05 AM | | OpenCL: NVIDIA GPU 0: GeForce GTX 760 (driver version 352.86, device version OpenCL 1.2 CUDA, 2048MB, 1936MB available, 2650 GFLOPS peak)
7/4/2016 8:28:07 AM | | Host name: xxxx2
7/4/2016 8:28:07 AM | | Processor: 4 GenuineIntel Intel(R) Core(TM) i5-4670K CPU @ 3.40GHz [Family 6 Model 60 Stepping 3]
7/4/2016 8:28:07 AM | | Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss htt tm pni ssse3 fma cx16 sse4_1 sse4_2 movebe popcnt aes f16c rdrandsyscall nx lm avx avx2 vmx tm2 pbe fsgsbase bmi1 smep bmi2
7/4/2016 8:28:07 AM | | OS: Microsoft Windows 7: Professional x64 Edition, Service Pack 1, (06.01.7601.00)
7/4/2016 8:28:07 AM | | Memory: 7.94 GB physical, 15.88 GB virtual
7/4/2016 8:28:07 AM | | Disk: 315.04 GB total, 87.37 GB free

GPU-Z image

Thanks, Ron

archae86
archae86
Joined: 6 Dec 05
Posts: 3159
Credit: 7247796429
RAC: 1354052

No two samples of a GPU chip

No two samples of a GPU chip can be assumed to have the same speed limit.

Nor can they necessarily be assumed to meet their own spec. I had a GTX 460 card which needed to be slightly underclocked for really good long-term stability running Einstein work of a few years ago, though it would run for weeks at a time at stock clock. I had another sample of the exact same brand and model of card which needed no such help.

In your specific case, before I tried Gary's excellent suggestions of swapping components, I'd try an appreciable clock speed reduction (say 10%) one at a time for any remotely suspect component. You are having failures frequently enough that if any of the clocks you try changing are modulating the problem you should be able to tell very quickly.

And, yes, memory clock (including all those bizarre details in the BIOS I usually don't fuss with but once) counts here. However to treat the system main memory as a suspect you can probably just invoke one of the less aggressive sets of settings rather than (at least initially) fiddling with them individually.

Ron Kosinski
Ron Kosinski
Joined: 23 Mar 05
Posts: 57
Credit: 1075916534
RAC: 766008

RE: And, yes, memory clock

Quote:
And, yes, memory clock (including all those bizarre details in the BIOS I usually don't fuss with but once) counts here. However to treat the system main memory as a suspect you can probably just invoke one of the less aggressive sets of settings rather than (at least initially) fiddling with them individually.

Do you thing the CPU Main Memory may be an issue, or just the GPU memory?

Thanks, Ron

archae86
archae86
Joined: 6 Dec 05
Posts: 3159
Credit: 7247796429
RAC: 1354052

RE: RE: And, yes, memory

Quote:
Quote:
And, yes, memory clock (including all those bizarre details in the BIOS I usually don't fuss with but once) counts here. However to treat the system main memory as a suspect you can probably just invoke one of the less aggressive sets of settings rather than (at least initially) fiddling with them individually.

Do you thing the CPU Main Memory may be an issue, or just the GPU memory?

Thanks, Ron

Well, I'd certainly start with the GPU things.

But if you reduce both GPU core and GPU memory clock without effect, then you've got a wider net to cast (and swapping the actual CPU chip and RAM sticks is not much fun, so hoping a downclock will expose things is worth a try).

Personally, I'd try it, maybe fourth or so down the list.

If you had a really low failure rate, it might make sense to back off everything and just wait to see if things go better. But your failure rate seems high enough to get some diagnostic power out of backing off things sequentially (but I'd leave the back offs in place while moving on in the diagnostic sequence).

Debugging styles vary, a lot. If you find and fix the problem, with reasonable effort, then your method worked for you.

Ron Kosinski
Ron Kosinski
Joined: 23 Mar 05
Posts: 57
Credit: 1075916534
RAC: 766008

WTF! Around the same time I

WTF!
Around the same time I cut back on the GPU O/C, they stopped using V1.57 and have gone back to V 1.52! I hate comparing apples to oranges. Oh well, I will leave the stock settings in for a while to see what happens, then I will start creeping up slowly and see what happens.

Ron

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4992
Credit: 18842788624
RAC: 5885809

Is there a document or thread

Is there a document or thread somewhere that explains how to read the stderr.txt file for E@H tasks? I know how to read my other projects but I can make no sense out of the reported result for E@H task that have been proclaimed invalid. I have tried to compare my result with the validated result and the output looks identical. I must be missing something in the file. I have produced invalid results on this computer 12291110. It doesn't invalid on all tasks just some. I am wondering if these old GTX 670's can't hack processing E@H on the cards along with other projects at the same time. My GTX 970 machines have no problems.

 

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4992
Credit: 18842788624
RAC: 5885809

RE: Hi Gary, by looking

Quote:

Hi Gary,

by looking over a couple of the Results, it seems that these Errors came from cross Platform and different App Versions by the Wingman!
Just a Suggestion by the first Look over about 10 WU's.

BR

DMmdL

I am noticing this also. My invalids are against either different platform or mainly because I am running Beta apps that are invalidated against stock apps.

 

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.