What's your experience with this driver? I removed 16.40 which was working fine. This 16.50 started producing vast amount of errors, tasks crashing always at 7 secs. I downgraded back to 16.40 and it continues to run fine. Strangely, second host has been able to avoid wave of errors also when running with this 16.50.
Copyright © 2024 Einstein@Home. All rights reserved.
Can you point out which host
)
Can you point out which host is failing and what hardware each host has?
edit: see also https://einsteinathome.org/content/gamma-ray-pulsar-binary-search-1-gpus#comment-152652
Okay, I'm sorry for not being
)
Okay, I'm sorry for not being more precise earlier. Both hosts have pretty much same hardware in terms of motherboard platform and GPU model (R9 390, but different GPU manufacturers).
Host #12462839 : https://einsteinathome.org/host/12462839
Host #12469906 : https://einsteinathome.org/host/12469906
Both have Mint 18 and had originally kernel 4.4.0-51 and AMD driver 16.40. Both had already run 3x FGRPB1G tasks without problems (only a few errors but not signs of systematic problems).
Yesterday I applied this same procedure for both hosts: Updated kernel to 4.4.0-53 and AMD driver to 16.50.
I kept looking for both then for some time and didn't notice anything worrying, but obviously I missed something. I went to sleep. I woke up today, uploaded tasks from both hosts and noticed there was plenty of fresh errors from the #12469906 : https://einsteinathome.org/host/12469906/tasks/error . I watched it live then and saw it trashing many tasks at the beginning (~7 seconds).
So, the first one survived my update-circus but the other one didn't. Today I downgraded AMD driver back to 16.40 on the #12469906. It's been crunching then with kernel 4.4.0-53 + AMD 16.40 and tasks haven't errored out anymore (at least not in that abnormal scale). Interesting is that not every task did error out with 16.50. There were some succesful and validated results also, but for some reason GPU wasn't happy to run 3 tasks parallel with AMD driver 16.50. I know that isn't the only way to run them, but I wanted to do it that way because I had already seen it was possible.
Richie_9 wrote:I woke up
)
OK I had a look around.
I notice quite a difference in the integer ops speed between the two hosts. I think i have seen this before for some reason, but i can't recall what it was. It may not be related.
The error tasks show a Signal 8 (which I'm guessing is SIGFPE Floating point exception) so there may be a core file floating around to debug.
this is a little concerning.
I wonder if it is a memory leak type problem, the first tasks are ok but then it degrades. I can't see easily from the view, grepping the event log might be easiest - look for the starting and finishing and fails.
I think you might want to try running at x1 I've not seen great stability on BRP4G at x2 or higher and the performance improvement was not great.
Good luck!
Thanks for sharing those
)
Thanks for sharing those thoughts. I decided to give problematic host and driver 16.50 a new chance after local tasks have been crunched. I changed the GPU factor to 0.50 so it will have an effect later today.
Integer ops speed were strangly different, I remember thinking that was a bit strange. But I came to conclusion that slower host had just been doing quite heavily something else at the very moment when the CPU benchmark was running. I re-run benchmark and now the numbers there are almost identical (and they should be... because real clock speeds and settings are same).
On a larger scope, those CPU benchmark results depend also heavily on the Computing Preferences and "Use at most X % of the CPUs". I've seen that If I allow only one CPU thread, result would be something like 5600/22000. But if I allow all threads (I did now) result will be more like 4000/13000.
Ps. Another weirdness outside this theme is the "Daily quota". It depends on how many threads are allowed on the Computing Preferences and "Use at most X % of the CPUs", even if running only GPU tasks. If I put "10%" on that box maximum daily quota will be multiple times less than if I set it to 100%. And this involves also hosts running only GPU tasks.
I forgot to mention I still
)
I forgot to mention I still prefer running at least 2x... and I have an explanation for that. Those hosts are supplied with somewhat cool air from time to time. Running 1x will cause fans on the GPU to slow down every time a task completes and new one begins. That moment is only a few seconds, but I can hear it. If I run 2x and set those tasks to run in different phase (so that they won't complete at the same time)... there will always be one task running. That causes the fans run with more constant speed (they do, I can hear). I believe that will result in less thermal changes and less thermal stress on the hardware.
I notice the errors stopped
)
I notice the errors stopped on Dec 14, was it the change from x3 -> x2 that solved the problem?
AgentB wrote:I notice the
)
Yes. I just uploaded latest set of results and there hasn't been errors after that change.