Experiences running the GWopencl-ati-Beta app on Linux X86_64

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117254608613
RAC: 36208919

RE: [82025.609744] [fglrx]

Quote:

[82025.609744] [fglrx] ASIC hang happened
....
....


I stuck that message into google and found a lot of hits from people doing coin mining of one form or another :-).

However, I found this (quite unrelated) bug report quite interesting. There was a fair bit of discussion and some indication of a fix. There was also a report later in the exchanges (by quentin right near the end) that the hangs went away by turning off the screensaver.

I don't claim to understand any of this but I get the impression that there's lots of things that can cause these problems.

Cheers,
Gary.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117254608613
RAC: 36208919

RE: ... I had a nearly the

Quote:
... I had a nearly the same problem with HD6950, as when running a Window Manager, the hardware semaphores were scruwed up.
Was easier to bebug, since the full driver is opened and can be debugged. It turned out, this was some specific ASIC setup, and it was different that HD7xxx.
Was even different than non Cayman based HD6xxx chipsets.


I looked up ASIC in Wikipedia - quite a few meanings - but I guess that "application specific integrated circuit" best fits the bill :-). Also, from the normal meaning of "semaphores" in the English language, I'm getting the impression of different bits of hardware/software wanting to talk to each other and so trying to 'flag' each other. So I can understand that stuck in sem_wait() means that something has gone wrong with something sending/waiting for an acknowledgement that is never given or is lost (never arrives).

Quote:
So I would go this way:
- collect all non working configurations (Chipsets & Window Manager)


Unfortunately, I don't have much of a range of chipsets. Apart from a sole HD7770, I have HD7850s - about 20 of them. I'm running KDE on everything so I've only ever used Kwin. I guess I could play with that (after learning how :-).).

It's interesting to note that since starting the current tests with PCLinuxOS (P) and Arch Linux (A), a task freeze has now occurred twice on P whilst A continues to run without problem - about 44 hours so far. I'm not on site so I can't attend to P at the moment - probably not until tomorrow. When I last restarted P (just after collecting the gdb info), I had a look at kernel messages and didn't find any "ASIC hang" reference. Note that A was setup to run from a basic xterm - no window manager - and it continues churning out results. If it keeps going, does that imply the problem is with Kwin?

Quote:
- collect ASIC errors reported by dmesg


Haven't found any yet but will keep looking.

Quote:
Then submit the problem to developer.amd.com


OK.

Quote:
The elapsed time / used time is something that I have noticed. On my debian, as you mentioned this almost the same. It means the CPU is always using a core at 100% and thus using a significant electrical power.
I dig a bit and found that the usleep/nano sleep functions are frequently interrupted and thus wake up the CPU again and again.
Your distro seems to indicate that the linux kernel is handling the wake up differently. It's probably that the linux timers are configured differently.


The Linux kernels on both machines are quite recent - 3.12.16 on P and 3.14.2 on Arch. Is it worth finding an older kernel - something like what you are using?

Quote:
And that could explain a hang in the Catalyst driver waiting forever on a event that will never occur or was missed.


Would this 'waiting forever' always put a "ASIC hang" message in the logs?

Thanks very much for taking the time to provide guidance to the uninitiated :-).

Cheers,
Gary.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117254608613
RAC: 36208919

Just a short update. I'm on

Just a short update. I'm on site and have reset P. A is still crunching without error - it's now getting close to 3 days without a freeze.

When I reset P, I did the same as with A - logged out of the full KDE desktop and set the failsafe mode and then logged back in to a blank screen with a single undecorated xterm. I exported the variable (as before) and started BOINC as a daemon.

Using BOINC Manager on a third machine, I can see that both machines are crunching normally.

Cheers,
Gary.

Jeroen
Jeroen
Joined: 25 Nov 05
Posts: 379
Credit: 740030628
RAC: 0

I am still seeing ASIC hangs

I am still seeing ASIC hangs with the system I dedicated to running the GW Beta application. Yesterday, my system had 16-hours of run time before the hang occurred. Here is what I have tried so far:

1) Adjust kernel timer frequencies from 1000 Hz to 100 Hz. Test with dynamic ticks (NO_HZ) enabled and disabled. Test with BFS scheduler enabled and disabled.
2) Test with different kernel versions and AMD driver versions. I am currently running the latest available 14.4 version.
3) Underclock GPU and memory frequencies.
4) Reduce PCI-E frequency from 3.0 to 2.0 specification.
5) Swap CPU with another identical CPU.
6) export GPU_MAX_ALLOC_PERCENT=100
7) Test with a more recent motherboard BIOS version.
8) CPU and memory stability testing.
9) Test with minimal resource usage window managers including Fluxbox and TWM.

Today I swapped the two Gigabyte 7970 GPUs running GW Search with two MSI 7970 GPUs from another system that are running BRP5. This is to help rule out faulty GPUs.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117254608613
RAC: 36208919

A further update. The A

A further update. The A machine (after almost 4 days) has now generated a GPU task freeze. The kernel logs show no ASIC hang messages. I ran gdb, but a bit differently from the last time (on the P machine). I used an ssh session from another machine and captured the entire output to a file. A lot easier than what I did last time.

gdb attach 9384
GNU gdb (GDB) 7.7.1
Copyright (C) 2014 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
.
Find the GDB manual and other documentation resources online at:
.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
attach: No such file or directory.
Attaching to process 9384
Reading symbols from /home/gary/BOINC/projects/einstein.phys.uwm.edu/einstein_S6CasA_1.08_x86_64-pc-linux-gnu__GWopencl-ati-Beta...done.

warning: Could not load shared library symbols for linux-vdso.so.1.
Do you need "set solib-search-path" or "set sysroot"?
Reading symbols from /usr/lib/libOpenCL.so.1...(no debugging symbols found)...done.
Loaded symbols for /usr/lib/libOpenCL.so.1
Reading symbols from /usr/lib/libdl.so.2...(no debugging symbols found)...done.
Loaded symbols for /usr/lib/libdl.so.2
Reading symbols from /usr/lib/libpthread.so.0...(no debugging symbols found)...done.
[New LWP 9386]
[New LWP 9385]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
Loaded symbols for /usr/lib/libpthread.so.0
Reading symbols from /usr/lib/libm.so.6...(no debugging symbols found)...done.
Loaded symbols for /usr/lib/libm.so.6
Reading symbols from /usr/lib/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /usr/lib/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /usr/lib/libgcc_s.so.1...(no debugging symbols found)...done.
Loaded symbols for /usr/lib/libgcc_s.so.1
Reading symbols from /usr/lib/libamdocl64.so...(no debugging symbols found)...done.
Loaded symbols for /usr/lib/libamdocl64.so
Reading symbols from /usr/lib/librt.so.1...(no debugging symbols found)...done.
Loaded symbols for /usr/lib/librt.so.1
Reading symbols from /usr/lib/libatiuki.so.1...done.
Loaded symbols for /usr/lib/libatiuki.so.1
Reading symbols from /usr/lib/libX11.so.6...(no debugging symbols found)...done.
Loaded symbols for /usr/lib/libX11.so.6
Reading symbols from /usr/lib/libxcb.so.1...(no debugging symbols found)...done.
Loaded symbols for /usr/lib/libxcb.so.1
Reading symbols from /usr/lib/libXau.so.6...(no debugging symbols found)...done.
Loaded symbols for /usr/lib/libXau.so.6
Reading symbols from /usr/lib/libXdmcp.so.6...(no debugging symbols found)...done.
Loaded symbols for /usr/lib/libXdmcp.so.6
Reading symbols from /usr/lib/libXext.so.6...(no debugging symbols found)...done.
Loaded symbols for /usr/lib/libXext.so.6
Reading symbols from /usr/lib/libatiadlxx.so...done.
Loaded symbols for /usr/lib/libatiadlxx.so
Reading symbols from /usr/lib/libstdc++.so.6...(no debugging symbols found)...done.
Loaded symbols for /usr/lib/libstdc++.so.6
Reading symbols from /usr/lib/libXinerama.so.1...(no debugging symbols found)...done.
Loaded symbols for /usr/lib/libXinerama.so.1
Reading symbols from /usr/lib/libGL.so.1...(no debugging symbols found)...done.
Loaded symbols for /usr/lib/libGL.so.1
0x00007f73ba20cb90 in sem_wait () from /usr/lib/libpthread.so.0
(gdb) cont
Continuing.
^C
Program received signal SIGINT, Interrupt.
0x00007f73ba20cb90 in sem_wait () from /usr/lib/libpthread.so.0
(gdb) bt
#0 0x00007f73ba20cb90 in sem_wait () from /usr/lib/libpthread.so.0
#1 0x00007f73b74e5030 in ?? () from /usr/lib/libamdocl64.so
#2 0x00007f73b74e17cf in ?? () from /usr/lib/libamdocl64.so
#3 0x00007f73b74cdfe0 in ?? () from /usr/lib/libamdocl64.so
#4 0x00007f73b74ce74c in ?? () from /usr/lib/libamdocl64.so
#5 0x00007f73b74a60c7 in clFinish () from /usr/lib/libamdocl64.so
#6 0x0000000000422968 in opencl_FstatisticsLoop (doComputeFstats=)
at /home/jenkins/workspace/workspace/EaH-GW-Testing/SLAVE/LINUX64-OPENCL/TARGET/linux-opencl/EinsteinAtHome/source/lalsuite/lalapps/src/pulsar/GCT/GCT_opencl.c:1155
#7 0x000000000040ee68 in MAIN (argc=, argv=)
at /home/jenkins/workspace/workspace/EaH-GW-Testing/SLAVE/LINUX64-OPENCL/TARGET/linux-opencl/EinsteinAtHome/source/lalsuite/lalapps/src/pulsar/GCT/HierarchSearchGCT.c:1525
#8 0x000000000041f041 in worker ()
at /home/jenkins/workspace/workspace/EaH-GW-Testing/SLAVE/LINUX64-OPENCL/TARGET/linux-opencl/EinsteinAtHome/source/lalsuite/lalapps/src/pulsar/hough/src2/EinsteinAtHome/hs_boinc_extras.c:1223
#9 main (argc=, argv=0x0)
at /home/jenkins/workspace/workspace/EaH-GW-Testing/SLAVE/LINUX64-OPENCL/TARGET/linux-opencl/EinsteinAtHome/source/lalsuite/lalapps/src/pulsar/hough/src2/EinsteinAtHome/hs_boinc_extras.c:1532
(gdb)

Looks pretty similar to the one from P. There's a small difference in the line number in GCT_opencl.c - don't know if that's significant - I assume probably not. I'm not a programmer so I don't see any point in trying to look for those lines in the code. The P machine is having its own set of issues as reported here.

Cheers,
Gary.

choks
choks
Joined: 24 Feb 05
Posts: 16
Credit: 145440373
RAC: 81771

This is just in an other

This is just in an other OpenCL kernel. The code is waiting forever for a GPU kernel call to finish.
The GW code is designed to break the GPU calls in small pieces to avoid lags in the mouse motion or sound.
A kernel should execute within a second (20 - 200 ms typical).

I'am running quite an old kernel 3.2.0-3-amd64 #1 SMP Mon Jul 23 02:45:17 UTC 2012 x86_64 GNU/Linux. This is the default kernel for debian stable.
No screen saver.

Don't know what kernel version you have, but the release note says up to 3.13 for 14.4 Release Candidate.

Cheers,
Christophe

Jeroen
Jeroen
Joined: 25 Nov 05
Posts: 379
Credit: 740030628
RAC: 0

Via the MSI 7970 cards I

Via the MSI 7970 cards I installed to replace the Gigabyte 7970 cards, I just recently had another ASIC hang. The runtime was a bit under 40-hours before the hang occurred.

Jeroen
Jeroen
Joined: 25 Nov 05
Posts: 379
Credit: 740030628
RAC: 0

This evening, I have been

This evening, I have been testing the GW Search application with a R9-290x in Linux. The tasks were running normally for several hours. Then I saw a task that started to run very slowly. 2-hours of runtime elapsed and the task was at approximately 95% runtime. When I tried to suspend the task, it was listed as a defunct process and the next task failed to start by BOINC. A reboot was required for normal operation. This is similar to the issue I saw with the four 7970 cards and GW Search application that I tested with in Linux. However, dmesg also reported the ASIC hang with the 7970 cards. I am running a slightly older driver version and kernel build of 3.12.19 with the R9-290x. The 7970 cards I tested with driver versions ranging from 13.10 to 14.4 and kernel versions ranging from 3.12.2 to 3.14.4.

If I get a chance, I will test the 7970 and GW Search application with an older driver and kernel.

jay
jay
Joined: 25 Jan 07
Posts: 99
Credit: 84044023
RAC: 0

Greeings, I had the beta

Greeings,

I had the beta test flag set to yes...
[Run beta/test application versions?
This helps us develop applications, but it may cause jobs to fail on your computer.]

I have a Radeon 7750, running Ubuntu Linux..

Had the ASIC hangs last week, or so....

Started looking at syslog and saw mutex errors.

They start to increase; then, ASIC crash.

So, when I see mutex errors start up, I suspend all work, reboot and restart.So far, no more ASIC hang.

I apologize,
I just posted the notes from the Mutex hang in the Einstein bugs forum
http://einsteinathome.org/node/197601

And of course, I had TWO terrible misspellings in the title.
I'll have humble pie tonight.

Hope this helps.

Jay

--edit--
Now that I have re-read this thread, I can say that there may be no connection between the mutex error and the ASIC crash - I was just suspicious.

Jeroen
Jeroen
Joined: 25 Nov 05
Posts: 379
Credit: 740030628
RAC: 0

I recently updated one of my

I recently updated one of my systems to the AMD 14.6 beta driver. The system has dual AMD cards. I setup one card to run GW Search and the other card to run FGRP Search. This is with running two separate BOINC clients since I had trouble in the past with keeping the GPUs fed with work while using the option.

My system has had approximately three days of runtime with GW Search and there has not been an ASIC hang yet. 281 tasks have validated and 187 tasks are pending validation with no tasks marked as invalid or in an error state. In the past, I had hundreds of tasks fail validation when cross-validating against a CPU but that is not happening currently. Everything looks good at the moment.

The only other thing I noticed is that the tasks run slower in Linux than Windows when using the 1.0 GPU utilization factor. The runtime per task that I have seen is averaged at 522 seconds in Linux compared to 463 seconds in Windows using the same hardware configuration. When running the 0.5 GPU utilization factor, the runtime is similar between the two operating systems.

choks, thank you for the work that you have done with the GW Search application.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.