AMDGPU-PRO 16.50

Richie

Joined: 7 Mar 14

Posts: 656

Credit: 1702989778

RAC: 0

13 Dec 2016 15:59:20 UTC

Topic 203692

(moderation:

)

What's your experience with this driver? I removed 16.40 which was working fine. This 16.50 started producing vast amount of errors, tasks crashing always at 7 secs. I downgraded back to 16.40 and it continues to run fine. Strangely, second host has been able to avoid wave of errors also when running with this 16.50.

AgentB

Joined: 17 Mar 12

Posts: 915

Credit: 513211304

RAC: 0

Can you point out which host

13 Dec 2016 20:20:47 UTC

Message 152752

(moderation:

)

Can you point out which host is failing and what hardware each host has?

Richie

Joined: 7 Mar 14

Posts: 656

Credit: 1702989778

RAC: 0

Okay, I'm sorry for not being

13 Dec 2016 21:46:43 UTC

Message 152753

(moderation:

)

Okay, I'm sorry for not being more precise earlier. Both hosts have pretty much same hardware in terms of motherboard platform and GPU model (R9 390, but different GPU manufacturers).

Host #12462839 : https://einsteinathome.org/host/12462839

Host #12469906 : https://einsteinathome.org/host/12469906

Both have Mint 18 and had originally kernel 4.4.0-51 and AMD driver 16.40. Both had already run 3x FGRPB1G tasks without problems (only a few errors but not signs of systematic problems).

Yesterday I applied this same procedure for both hosts: Updated kernel to 4.4.0-53 and AMD driver to 16.50.

I kept looking for both then for some time and didn't notice anything worrying, but obviously I missed something. I went to sleep. I woke up today, uploaded tasks from both hosts and noticed there was plenty of fresh errors from the #12469906 : https://einsteinathome.org/host/12469906/tasks/error . I watched it live then and saw it trashing many tasks at the beginning (~7 seconds).

So, the first one survived my update-circus but the other one didn't. Today I downgraded AMD driver back to 16.40 on the #12469906. It's been crunching then with kernel 4.4.0-53 + AMD 16.40 and tasks haven't errored out anymore (at least not in that abnormal scale). Interesting is that not every task did error out with 16.50. There were some succesful and validated results also, but for some reason GPU wasn't happy to run 3 tasks parallel with AMD driver 16.50. I know that isn't the only way to run them, but I wanted to do it that way because I had already seen it was possible.

AgentB

Joined: 17 Mar 12

Posts: 915

Credit: 513211304

RAC: 0

Richie_9 wrote:I woke up

13 Dec 2016 22:16:27 UTC

Message 152756 in response to message 152753

(moderation:

)

Richie_9 wrote:

I woke up today, uploaded tasks from both hosts and noticed there was plenty of fresh errors from the #12469906 : https://einsteinathome.org/host/12469906/tasks/error . I watched it live then and saw it trashing many tasks at the beginning (~7 seconds).

.

OK I had a look around.

I notice quite a difference in the integer ops speed between the two hosts. I think i have seen this before for some reason, but i can't recall what it was. It may not be related.

The error tasks show a Signal 8 (which I'm guessing is SIGFPE Floating point exception) so there may be a core file floating around to debug.

Warning:  Program terminating, but clFFT resources not freed.
Please consider explicitly calling clfftTeardown( ).

this is a little concerning.

I wonder if it is a memory leak type problem, the first tasks are ok but then it degrades. I can't see easily from the view, grepping the event log might be easiest - look for the starting and finishing and fails.

I think you might want to try running at x1 I've not seen great stability on BRP4G at x2 or higher and the performance improvement was not great.

Good luck!

Richie

Joined: 7 Mar 14

Posts: 656

Credit: 1702989778

RAC: 0

Thanks for sharing those

13 Dec 2016 23:09:13 UTC

Message 152758 in response to message 152756

(moderation:

)

Thanks for sharing those thoughts. I decided to give problematic host and driver 16.50 a new chance after local tasks have been crunched. I changed the GPU factor to 0.50 so it will have an effect later today.

Integer ops speed were strangly different, I remember thinking that was a bit strange. But I came to conclusion that slower host had just been doing quite heavily something else at the very moment when the CPU benchmark was running. I re-run benchmark and now the numbers there are almost identical (and they should be... because real clock speeds and settings are same).

On a larger scope, those CPU benchmark results depend also heavily on the Computing Preferences and "Use at most X % of the CPUs". I've seen that If I allow only one CPU thread, result would be something like 5600/22000. But if I allow all threads (I did now) result will be more like 4000/13000.

Ps. Another weirdness outside this theme is the "Daily quota". It depends on how many threads are allowed on the Computing Preferences and "Use at most X % of the CPUs", even if running only GPU tasks. If I put "10%" on that box maximum daily quota will be multiple times less than if I set it to 100%. And this involves also hosts running only GPU tasks.

Richie

Joined: 7 Mar 14

Posts: 656

Credit: 1702989778

RAC: 0

I forgot to mention I still

13 Dec 2016 23:21:47 UTC

Message 152759

(moderation:

)

I forgot to mention I still prefer running at least 2x... and I have an explanation for that. Those hosts are supplied with somewhat cool air from time to time. Running 1x will cause fans on the GPU to slow down every time a task completes and new one begins. That moment is only a few seconds, but I can hear it. If I run 2x and set those tasks to run in different phase (so that they won't complete at the same time)... there will always be one task running. That causes the fans run with more constant speed (they do, I can hear). I believe that will result in less thermal changes and less thermal stress on the hardware.

AgentB

Joined: 17 Mar 12

Posts: 915

Credit: 513211304

RAC: 0

I notice the errors stopped

17 Dec 2016 9:32:52 UTC

Message 152970 in response to message 152759

(moderation:

)

I notice the errors stopped on Dec 14, was it the change from x3 -> x2 that solved the problem?

Richie

Joined: 7 Mar 14

Posts: 656

Credit: 1702989778

RAC: 0

AgentB wrote:I notice the

17 Dec 2016 16:37:31 UTC

Message 152995 in response to message 152970

(moderation:

)

AgentB wrote:

I notice the errors stopped on Dec 14, was it the change from x3 -> x2 that solved the problem?

Yes. I just uploaded latest set of results and there hasn't been errors after that change.

AMDGPU-PRO 16.50

Forums › Cruncher's Corner

Can you point out which host

Okay, I'm sorry for not being

Richie_9 wrote:I woke up

Thanks for sharing those

I forgot to mention I still

I notice the errors stopped

AgentB wrote:I notice the

Comment viewing options

Forums › Cruncher's Corner