Problem with a R9 390X when run 2 or more WUs at a time.

juan BFP
juan BFP
Joined: 18 Nov 11
Posts: 839
Credit: 421,443,712
RAC: 0

Maybe this could help The

Maybe this could help

The WU crunches normaly for the normal time, the problem apparently apears only at the end of the process.

lHj2ixL.jpg

 

Gavin
Gavin
Joined: 21 Sep 10
Posts: 191
Credit: 34,005,107,084
RAC: 7,602,828

Hi Keith, RE: After

Hi Keith,

Quote:
After looking at your invalid tasks, I would guess that there is some sort of incompatibility/integration with the Einstein applications, ATI drivers and the OpenCL implementation. It looks like a lack of card resources when running 2X that immediately throws an exception handling event at task startup. I think the Einstein developers need to look closely at this. I don't believe it has anything to do with one specific host. First thing I would do is update to the latest 7.6.6 Boinc Manager since there are some specific fixes made for SETI and MW projects to prevent invalids. Might help with Einstein. Second would be to set some of the debug flags in the cc_config file using the BM interface. I would set co-processsor_debug, mem_usage_debug,checkpoint_debug,statefile_debug and task_debug. Then post the log results for an invalidated task to see if we can figure out just what the application or BOINC is complaining about that causes the invalid.


I don't think the problem lies with Boinc manager version, I already have several machines on 7.6.6... I also don't think there's a card/system resource issue at play here. The 'Activated exception handling' message in the Stderr output is informational only and not a sign of error, all Windows hosts display this line (my Linux hosts do not). Take a look at the output from your own (valid) tasks and you will find that line :-) I also don't believe setting debug flags will help in this instance as, as Juan BFB states, the tasks run normally to completion. The validate server then decides the result is rubbish and marks it as error... From my end all has proceeded correctly.

In answer to John's (Chase 1902) question, none of the cards applicable to this thread are overclocked beyond factory settings and the same is true for all but two of my cards (non 'X' 280's running OC'd under Kubuntu with tasks x4 without issue for many months).

Gav.

chase1902
chase1902
Joined: 13 Aug 11
Posts: 37
Credit: 1,264,094,642
RAC: 0

Gav, Sorry my question about

Gav,
Sorry my question about over clocking wasn't really related to the thread, probably should have said that.
Just interested in getting better through put, my systems are in desperate need of updating, cleaning etc, at the same time I could do with optimizing them a bit better.
John

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 1,062
Credit: 1,172,671,041
RAC: 2,753,886

RE: Hi Keith, I don't

Quote:

Hi Keith,

I don't think the problem lies with Boinc manager version, I already have several machines on 7.6.6... I also don't think there's a card/system resource issue at play here. The 'Activated exception handling' message in the Stderr output is informational only and not a sign of error, all Windows hosts display this line (my Linux hosts do not). Take a look at the output from your own (valid) tasks and you will find that line :-) I also don't believe setting debug flags will help in this instance as, as Juan BFB states, the tasks run normally to completion. The validate server then decides the result is rubbish and marks it as error... From my end all has proceeded correctly.

In answer to John's (Chase 1902) question, none of the cards applicable to this thread are overclocked beyond factory settings and the same is true for all but two of my cards (non 'X' 280's running OC'd under Kubuntu with tasks x4 without issue for many months).

Gav.

Yes, I agree. I should have looked at my own valid tasks and noticed that all tasks get the exception message. The fact that the task runs to completion apparently normally until the end when the server decides the task is invalid lends credence to my suspicion about the whether the stderr.txt output result is getting truncated or not closed correctly at the time of reporting. This is the issue we fought for over half a year at SETI and MilkyWay. The new 7.6.6. BOINC Manager client was created to resolve the issue. I would suggest Juan at least try the update, even though you state you are running 7.6.6 BM and seeing the issue on your own machines, Gavin. I still would like to see the logfile output for a failed task. I would also set the slot_debug flag because it helps show just how a task gets moved into a slot and out of a slot. The issue with MW and SETI was that the slots weren't getting cleared of their previous occupant before a new task was being assigned to it and corrupting the stderr.txt file at MW. The new 7.6.6. client added some code to make sure that Windows had enough time to close out the files properly. You can read about it in my What is the cause of these 'validate errors' thread at MilkyWay@Home. And here is the thread over at SETI@Home; Stderr Truncations I strongly suggest the new 7.6.6 client first.

 

juan BFP
juan BFP
Joined: 18 Nov 11
Posts: 839
Credit: 421,443,712
RAC: 0

RE: I strongly suggest the

Quote:
I strongly suggest the new 7.6.6 client first.


Thanks, will do that ASAP. The computer is at a remote location.

lHj2ixL.jpg

 

juan BFP
juan BFP
Joined: 18 Nov 11
Posts: 839
Credit: 421,443,712
RAC: 0

Now running 2 at a time with

Now running 2 at a time with 7.6.9. Lets see if the error realy dissapears.

lHj2ixL.jpg

 

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 1,062
Credit: 1,172,671,041
RAC: 2,753,886

Do let us know how you make

Do let us know how you make out with 7.6.9 client. It rolled up some of the corrections and additions for 7.6.6. Some questions about VBoxWrapper only thing outstanding.

 

juan BFP
juan BFP
Joined: 18 Nov 11
Posts: 839
Credit: 421,443,712
RAC: 0

Changed to Boinc 7.6.9 and

Changed to Boinc 7.6.9 and the problem remains.

lHj2ixL.jpg

 

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 1,062
Credit: 1,172,671,041
RAC: 2,753,886

Yeah, Juan, I took a look at

Yeah, Juan, I took a look at your tasks today and saw the problem remains. The only thing left to do would be to set some of the cc_config.xml flags I suggested and post the resulting logfile entries to the thread for a task that failed invalid. I'm no expert, but I do have experience with setting the flags and looking for problems in the way a task moves onto the GPU and off for reporting from my troubleshooting of my invalids at MilkyWay that led to the fix in the 7.6.6. client. Once you post the logfile I would like to entice Richard Haselgrove and Jason Gee to look the logs over. They understand the nitty gritty of how BOINC works and were instrumental in developing the fixes for 7.6.6 that David Anderson implemented.

Have you heard of anyone else having issues that mirror exactly your symptoms? There is always a chance that you have a real hardware problem that only rears its head when the card is stressed with more than one work unit. I had a recent failure of my CPU that took some time to diagnose why it was producing invalids. I would be interested in seeing the memory_debug and coprocessor_debug results in the logfile.

Cheers, Keith

 

Tom*
Tom*
Joined: 9 Oct 11
Posts: 53
Credit: 244,789,217
RAC: 8,173

Hi Juan long time, One

Hi Juan long time,

One interesting thing in the stderr of the invalids is the sumspec pages on some
of them seem to increase way beyond what I would consider to be a normal value.

some of the entries are over 4000 pages, while the successful valids (one at a time) are normal around 500, If you are only running two at a time I would not think sumspec would be too much over 1500.??

At least that is the way it works on my HD7950's

Glad you are back and posting from Panama, sorry you couldn't bring your GTX690's
with ya.

Tom* aka Bill

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.