Experiences running the GWopencl-ati-Beta app on Linux X86_64

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5872

Credit: 117250755305

RAC: 36217432

6 May 2014 10:27:24 UTC

Topic 197574

(moderation:

)

I don't think the News thread is the place for detailed feedback so I'm starting a new thread. I've done a fair bit of work trying to pinpoint the cause of a particular issue I'm seeing. I haven't been successful so far so it's time to spell out where I've got to and hope that somebody might have some suggestions as to what to try next.

I decided to set up two machines to participate in the beta test soon after the V1.08 app was released. The machines are pretty much hardware identical, same mobo, CPU, RAM and GPU on each. Both have a Q8400 quad core CPU and 2x2GB DDR2 RAM and both had been running E@H CPU tasks for several years without issue. Late last year, I added the POGS project to most of my CPU only hosts, and some GPU hosts. Both these hosts had E@H (FGRP3) and POGS.

Both machines were updated with brand new MSI HD7850 GPUs for the test. Before installing the GPUs, the machines had been running PCLinuxOS 32 bit. PCLinuxOS has only put out a 64 bit version within the last year. It's a rolling release and the 32 bit version has worked well for me for many years. The 64 bit version had spent a long time in testing so I felt very comfortable giving it a go, even though I'd never previously used a 64 bit OS.

Basically, I stopped BOINC, inserted the new graphics cards, loaded the 64 bit OS from a live USB and installed it 'over the top' of the existing root partition. All the user files and configuration are on a separate /home partition and do not get disturbed by the installation. The hardware was properly detected and there was no problem rebooting to a full KDE desktop after the install had completed. I did a full upgrade of the system - very little was needed as the iso image I used was quite recent. I also made sure I was using the most recent Catalyst driver available in the repo (13.12) and that the OpenCL libs were properly installed. I downloaded and installed the 64 bit BOINC V7.2.42 from Berkeley and installed it 'over the top' of the 32 bit version I had been using. As a precaution, I tested all BOINC and project executables with ldd to make sure there were no missing shared libs.

I was able to start BOINC just fine and the GPUs were properly detected. Before shutting down I had set NNT on all projects so the existing CPU tasks started up from where they had left off with the GPU remaining idle. Both machines had been doing E@H and POGS tasks beforehand and there were 4 partly completed tasks ready to be restarted. I wanted to allow a partly completed task or two to finish and be reported before getting any GPU tasks so I now had a bit of time to plan out a 'venues' rearrangement to allow only these two hosts to participate in the beta test. Once that was all sorted, I set very small work caches and allowed the first host to get a small number of new GPU tasks. Since the set of S6CasA data files needed for just a single task is so large, I saved them all and populated the second host with the same data. It's quite easy to seed the state file of the 2nd host with the blocks from the first host so that the 2nd host asks for work from the same frequency set and so avoids a big download. My plan was to run POGS for a while on the available CPU cores whilst testing out the Beta tasks on the GPU.

Both hosts started crunching GWopencl-ati-Beta tasks without apparent issue. The first host was crunching tasks singly in about 17 mins. I set the second host up to crunch 2x. I found I needed to reserve a full CPU per GPU task or else crunching slowed right down. With 2 cores reserved, the GPU was crunching 2x in around 25 mins. I even tried 3x with 3 free cores and that was taking around 33 mins. So after an hour or two of playing around, I left one machine running 2x and the other running 3x. During this playing around period, quite a few tasks were completed and since a lot of those were 'resends', a lot were validated as well. Things were looking very good.

The 'issue' showed up next morning when I went to check things. Both machines had stopped crunching GPU tasks in the middle of the night, some hours after I had left them. My immediate reaction was to suspect running multiple tasks concurrently. Both machines were running fine and CPU tasks were running fine. The machines were uploading finished tasks and downloading new work. On the machine running 3x, 2 GPU tasks were frozen and the third was showing a time of several hours and still ticking over. The % progress was 99.xxx%. The 2x machine was similar except that only one task was frozen.

I decided I would stop BOINC, wait a bit, and then restart. BOINC runs as a daemon and I have icons on the desktop for stopping and starting. The stop icon executes 'boinccmd --quit'. A small fraction of a second after clicking that icon, the whole system froze. The ONLY thing that would revive it was the 'reset' button. After rebooting, I could start BOINC normally. CPU tasks would restart from saved checkpoints. The GPU task that was at 99.xxx% would restart from the beginning. The frozen in-progress GPU tasks would restart from their saved checkpoints. No tasks were lost or damaged in any way.

I decided to let things run the way they were but after a period of just a few hours of crunching, both had frozen again in exactly the same manner. By checking in the slots directories, I was able to satisfy myself that the problem was occurring immediately after the start of a new GPU task. Because BOINC now has this 'fake progress' scheme where progress is simulated until a checkpoint is written, the GPU task that causes the freeze keeps counting away while waiting for that first checkpoint that is never going to come. At first I couldn't figure out why the task with such a high progress % was resetting right back to the start.

I changed each app_config.xml so that tasks would run singly. This has not solved the issue but it seems to make a difference. When running concurrent tasks, the problem seems to occur after about 4 - 6 hours. When running singly, the uptime seems to be around 10-12 hours or so, and sometimes a bit longer. I've quite recently started wondering if the behaviour is due to the exhaustion of some GPU resource after a certain number of tasks are completed. I've started noticing that about 40-50 tasks get completed between freezes.

Once I'd established that both machines were suffering exactly the same behaviour even when running tasks singly, I decided to see if it was anything to do with the OS or driver. I decided to find a totally different 64 bit distro and set that up on one of the machines. I did quite a bit of reading and decided I would try Arch Linux. I'd always wanted to try the 'build your own individual system completely from scratch' approach rather than the 'easy' way of the more popular Ububtu/Mint type distros.

I'll spare you the tedious details but about half way through I was beginning to question my sanity :-). However, I now have a fully functioning system complete with the latest KDE desktop and the latest catalyst drivers and OpenCL libs. I'm really impressed with how good the Arch Wiki is for covering everything you need to know. It's written in true unix style - quite terse:-). As long as you read carefully and don't skip words, it covers every detail very well.

I built this new system on a different disk. Once it was properly running I transferred the complete BOINC hierarchy from the disk of one of the PCLOS machines to this disk. Once I did my usual test with ldd, for the Berkeley boinc executable, there seemed to be something missing in libcurl.so.4 so I decided to grab the V7.2.42 BOINC from the Arch repo and test the boinc from that package. That was fine so I substituted all the executables (boinc, boincmgr, boinccmd) and then fired up BOINC. The 'in-progress' CPU and GPU tasks all restarted just fine. On the previous system, GPU tasks had been taking 17 mins. On Arch Linux, they were now taking 15 mins.

However, the issue hasn't been resolved. Both systems (PCLinuxOS and Arch Linux) are both showing the same behaviour with the GPU task freezing after around 8-12 hours of crunching. Both have fully functional CPU task behaviour at all times, even hours after a GPU task has frozen. Every time, the GPU task freeze occurs just as a GPU task is attempting to start. Both hosts show precisely the same behaviour of a complete system freeze at the precise instant I try to stop BOINC or initiate a system reboot when there has been a GPU freeze. This freeze occurs after perhaps 40 or more successful task starts and completions.

Because of the big differences between PCLinuxOS and Arch, it seems to me that the problem is something to do with the GWopencl-ati-Beta app rather than the OS. The drivers used had different versions. On PCLinuxOS it was Catalyst 13.12. On Arch it was Catalyst 14.4.

It seems strange that there aren't other similar reports about this app. At least, I haven't seen anything that looks similar. I have to come to the conclusion that it must be something specific to me, else others would be reporting the same issue. I'm not quite sure what to try next. I've documented everything I can think of in tedious detail so if anyone is still reading, I'm grateful for your persistence :-).

If anyone has any bright ideas, I'd certainly be interested in hearing them.

Cheers,
Gary.

Holmis

Joined: 4 Jan 05

Posts: 1118

Credit: 1055935564

RAC: 0

Experiences running the GWopencl-ati-Beta app on Linux X86_64

6 May 2014 13:11:07 UTC

Message 121413

(moderation:

)

I'm inclined to think the same as you that some sort of resource is getting exhausted. Maybe some kind of memory leak in the app?
Are there any GPU monitoring utilities available in Linux? Something like GPU-Z in Windows?
It would be interesting to see the amount of free video RAM and maybe even the GPUs memory controller load when a task has frozen.

Mike Hewson

Moderator

Joined: 1 Dec 05

Posts: 6588

Credit: 311876627

RAC: 123887

Hello Gary ! :-) Wild stab

6 May 2014 23:28:28 UTC

Message 121414

(moderation:

)

Hello Gary ! :-)

Wild stab in the dark :

From this page it seems that the Catalyst driver ( daemon ) you mentioned ( 14.4 ) has the note 'DOES NOT SUPPORT SYSTEMD' ( listed in second entry from the top ).

I looked that 'systemd' up and, being a daemon that manages other daemons ( especially during startup and shutdown ) I think this may have relevance to your problem.

Cheers, Mike.

( edit ) Of course you have BOINC running as a daemon too ..

( edit ) Here is another discussion on that topic.

( edit )

Quote:

Every time, the GPU task freeze occurs just as a GPU task is attempting to start. Both hosts show precisely the same behaviour of a complete system freeze at the precise instant I try to stop BOINC or initiate a system reboot when there has been a GPU freeze.

( my red highlight ) : This simply indicates that it knows what you want, is a downright gnarly & malign beast, but is anticipating you very well .... :-) :-) :-)

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5872

Credit: 117250755305

RAC: 36217432

Hi Mike, Thanks very much

7 May 2014 7:01:02 UTC

Message 121415 in response to message 121414

(moderation:

)

Hi Mike,

Thanks very much for taking the time and trouble to research this.

Quote:

From this page it seems that the Catalyst driver ( daemon ) you mentioned ( 14.4 ) has the note 'DOES NOT SUPPORT SYSTEMD' ( listed in second entry from the top ).

I looked that 'systemd' up and, being a daemon that manages other daemons ( especially during startup and shutdown ) I think this may have relevance to your problem.

I don't think so. systemd (about which there has been a lot of controversy) is just a new type of init system and I'm not having any issues with startup or shutdown when things are 'normal' - which is most of the time. I don't have the catalyst-daemon package installed. I didn't even know it existed until you linked to it. The other reason is that the driver on PCLinuxOS (exact same problem behaviour) is 13.12 and I'm pretty sure the init system there isn't systemd.

To get the catalyst drivers installed, I read the instructions in the Wiki and here are a couple of small excerpts.

Owners of ATI/AMD video cards have a choice between AMD's proprietary
driver (catalyst) and the open source driver (xf86-video-ati).
This article covers the proprietary driver.

Catalyst packages are no longer offered in the official repositories.
In the past, Catalyst has been dropped from official Arch support because of
dissatisfaction with the quality and speed of development. After a brief return
they were dropped again in April 2013 and they have not returned since.

When I first saw that, I thought I was in trouble, but I kept reading, and

There are three ways of installing Catalyst on your system.
One way is to use Vi0L0's (Arch's unofficial Catalyst maintainer) repository.
This repository contains all the necessary packages. The second method you can use is
the AUR; PKGBUILDs offered here are also made by Vi0L0 and are the same he uses to build
packages for his repository. Lastly, you can install the driver directly from AMD.

So I decided to use ViOLO's unofficial repo, although I was tempted to get the package direct from AMD because I'd previously done that with PCLinuxOS before they they got their OpenCL stuff in their repo.

For ViOLO's repo, the Wiki gives full instructions on enabling it, and exactly what packages to get, eg:-

Once you have added some Catalyst repository, update pacman's database and install
these packages (see #Tools for more information):

catalyst-hook
catalyst-utils
catalyst-libgl
opencl-catalyst - optional, needed for OpenCL support
lib32-catalyst-utils - optional, needed for 32-bit OpenGL support on 64-bit systems
lib32-catalyst-libgl - optional, needed for 32-bit OpenGL support on 64-bit systems
lib32-opencl-catalyst - optional, needed for 32-bit OpenCL support on 64-bit systems

To be safe, I just installed all of those and ran aticonfig --initial to create a suitable xorg.conf which pointed to the fglrx kernel module. I had no trouble when I first restarted BOINC. I could see from the event log that the OpenCL capabilities of the GPU were being properly detected and listed and by running ldd on the science app beforehand, I knew the science app would be happy as well. Even though I had transferred the whole BOINC hierarchy from a different distro, it all just restarted from where it had left off on PCLinuxOS.

Quote:

( edit ) Of course you have BOINC running as a daemon too ..

But not under the control of systemd. I just launch BOINC with the --daemon flag. I deliberately do it this way because I want full personal control over exactly when BOINC is launched. I don't want this to happen automatically.

Quote:

( edit ) Here is another discussion on that topic.

Yes, this is exactly the page I used to install catalyst. The excerpts above come from this page. I should have checked ALL your links first before quoting from something you had already seen :-).

Quote:

( edit )
Quote:
Every time, the GPU task freeze occurs just as a GPU task is attempting to start. Both hosts show precisely the same behaviour of a complete system freeze at the precise instant I try to stop BOINC or initiate a system reboot when there has been a GPU freeze.

( my red highlight ) : This simply indicates that it knows what you want, is a downright gnarly & malign beast, but is anticipating you very well .... :-) :-) :-)

Man, this is UNIX!!! Rogue apps are not supposed to make the whole system freeze up rock solid!!! :-) ;-). It's such a long time since I last saw such a screen - I click the icon and before I can move the mouse pointer more than a cm or so, the image of the whole screen is totally frozen. It's slightly different if I'm clicking 'restart' rather than the stop BOINC icon. I can move the mouse pointer perhaps right across the screen before everything locks up solid. I guess a few other things are being stopped before it tries to stop BOINC :-).

Cheers,
Gary.

choks

Joined: 24 Feb 05

Posts: 16

Credit: 145410373

RAC: 79614

Hello Gary, I'am runnig &

8 May 2014 17:28:15 UTC

Message 121416 in response to message 121415

(moderation:

)

Hello Gary,

I'am runnig & developping on Debian 64 on HD6950 with Catalyst 13.12 and never experienced this behavior. About same CPU config Q9550 and 4 Gb of RAM.

I have seen X window lockups when allocating & initializing more memory than the Catalyst driver can provide. I changed the code to allocate before initializing and it looked fixing the issue.

Could you please try this one option at a time:
- export GPU_MAX_ALLOC_PERCENT=100 before ./run_client and ./run_manager
- run basic Gnome without Unity (the config I use), or better just plain X session without any window manager (./run_client in the single window)
- try to restart boinc before it hangs

This will help us to understand if this is window manager / driver related.

Thanks,
Christophe

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5872

Credit: 117250755305

RAC: 36217432

Hi Christophe, Thanks very

10 May 2014 10:25:52 UTC

Message 121417 in response to message 121416

(moderation:

)

Hi Christophe,

Thanks very much for responding. Sorry if I'm interfering with your holidays!

Quote:

I'am runnig & developping on Debian 64 on HD6950 with Catalyst 13.12 and never experienced this behavior. About same CPU config Q9550 and 4 Gb of RAM.

Yes, I'd taken a look at your host before I started the thread and had noticed the very similar CPU. My hosts are running PCLinuxOS and Arch Linux. One big difference is that your host doesn't seem to be running any CPU tasks at all so effectively you have 4 cores to support the GPU. I use an app_config.xml which requires 1 CPU + 1 GPU for the GPU task and so there are always 3 CPU-only tasks running at the same time as the GWopencl-ati-Beta. Could that make any difference?

Quote:

I have seen X window lockups when allocating & initializing more memory than the Catalyst driver can provide. I changed the code to allocate before initializing and it looked fixing the issue.

Could you please try this one option at a time:
- export GPU_MAX_ALLOC_PERCENT=100 before ./run_client and ./run_manager

The two machines are at a business location and (24 hours ago - the machines running normally - 4 tasks crunching) I stopped BOINC, waited 20 secs, restarted BOINC and then went home. My idea was to give them both a 'fresh start' to see if they could last the night without a GPU task freeze. This morning (at home) I could see that they had both stopped returning GPU tasks. One had lasted about 4 hours and the other about 8 hours. Unfortunately, I had other commitments so I haven't been able to visit the machines until now (Saturday evening).

If I catch the machines early enough, even though the GPU task is frozen with the elapsed time just ticking away, the machine is otherwise acting quite normally. I can open and close windows - even BOINC Manager - without any problem. I just can't stop BOINC. If I try to, the whole machine locks up solid and the reset switch is needed. Today, I didn't get here soon enough. The GPU task has a max time limit and when that is reached, BOINC tries to abort the processing. At that point the machine locks up anyway and that is exactly how I found them both when I arrived. Those two tasks should appear in the tasks list as 'aborted by user', I believe. If I catch the machine before the max time has elapsed, the tasks get restarted rather than aborted.

I've now restarted both machines (P and A). For P, I've booted normally - full KDE desktop - , started konsole and exported the variable as suggested. In the same terminal session immediately afterwards, I've run the commands

cd BOINC
./boinc --daemon

which is pretty much how I normally start BOINC. I've also used

ps -A | grep to check that the boinc client and all the science apps are running and to see the PIDs of each - by selecting appropriate values for .

Quote:

- run basic Gnome without Unity (the config I use), or better just plain X session without any window manager (./run_client in the single window)

On machine A, once I booted to the desktop, I logged out. I'm running KDE and before logging in again, I've selected the failsafe option that is available in login manager screen. This gives a basic single xterm only with no window decorations or controls. I presume this is the environment you are suggesting. I have launched the boinc client as before and once again checked with ps -A to see that the client and expected tasks are running. So far so good - everything is working normally. I intend to wait around for some hours to see if either one will have a GPU task freeze.

Quote:

- try to restart boinc before it hangs

I have done this many times and it always works. If the GPU task is not frozen, there is no problem stopping and restarting BOINC. The only way for me to tell that a GPU task is frozen is to monitor BOINC Manager and notice if the progress % is ticking away continuously every second once the elapsed time is more than about 2 minutes. Until the first checkpoint is written, both progress and time move continuously. After the first checkpoint, progress increments by about 9% or so in one step when the next checkpoint is written. Tasks normally take around 15-17 mins to complete and if a task has frozen, it seems to take about 12-15 hours before there will be a 'max time limit exceeded' termination where the whole machine locks up.

One thing I have noticed is that stopping and restarting BOINC before there has been a freeze seems to allow quite a long run time without a freeze. By doing this a few times, I've managed to get 45 hours and 30 hours out of A and P respectively before a freeze has happened.

Once I see the result of these two experiments, I'll install gdb on both. I haven't used a debugger before (I'm not a programmer) but I'm very happy to learn how to run it. I'll spend tonignt getting ready and if one of the two hosts gets a GPU task freeze, I'll follow the instructions. If it hasn't happened before I have to go home, I should be able to catch it in the morning (provided I don't get commandeered into other duties :-).).

Thank you very much for your suggestions. Of course, Murphy's Law says that neither host will now experience a GPU freeze :-).

Cheers,
Gary.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5872

Credit: 117250755305

RAC: 36217432

Both machines have been

10 May 2014 11:36:16 UTC

Message 121418

(moderation:

)

Both machines have been crunching without a freeze for more than two hours now. I'm monitoring them occasionally from the BOINC Manager on another machine on the LAN. Everything is proceeding normally so far.

I've ssh'd into A to run pacman in order to check what is available in the repos. The following packages seem to be of interest

* gdb 7.7-1 -- GNU debugger
* ddd 3.3.12-3 -- Graphical front-end to various command line debuggers like gdb ...
* cgdb 0.6.7-2 -- Curses based interface to gdb
* kdbg -- A gdb GUI for KDE

I'm perfectly happy just to get gdb but if anything else makes life easier I'd be pleased to receive suggestions. I don't think I have any problem with the commands you listed in the PM :-).

Cheers,
Gary.

Jeroen

Joined: 25 Nov 05

Posts: 379

Credit: 740030628

RAC: 0

I have also been running the

10 May 2014 20:00:54 UTC

Message 121419

(moderation:

)

I have also been running the new Beta application in Linux x86_64. In my case, I have been seeing the following in the kernel log after varying run time ranging from 6-48 hours. The system is still accessible after the ASIC hang and I have only seen this with the GW application.

[82025.609744] [fglrx] ASIC hang happened
[82025.609748] CPU: 5 PID: 4474 Comm: einstein_S6CasA Tainted: P O 3.13.11-ck1 #2
[82025.609749] Hardware name: System manufacturer System Product Name/RAMPAGE IV FORMULA, BIOS 2105 08/04/2012
[82025.609751] 0000000104e373e9 ffffffff816bce20 0000000000000000 ffffffffa02963bc
[82025.609752] 0000000000000000 ffffffffa038fa1e ffff880347349a10 ffffffffa038f989
[82025.609754] ffffc9001275b020 ffffc90012b68e80 ffffffffa050fe90 ffffc90012143080
[82025.609755] Call Trace:
[82025.609760] [] ? dump_stack+0x41/0x51
[82025.609794] [] ? firegl_hardwareHangRecovery+0x1c/0x30 [fglrx]
[82025.609839] [] ? _ZN4Asic9WaitUntil15ResetASICIfHungEv+0x1e/0x30 [fglrx]
[82025.609883] [] ? _ZN4Asic9WaitUntil15WaitForCompleteEv+0xb9/0x130 [fglrx]
[82025.609928] [] ? _ZN8AsicR60012IO_QuietdownEv+0x2c/0x40 [fglrx]
[82025.609971] [] ? _ZN15ExecutableUnits10CPRingIdleE15idle_WaitMethod12_QS_CP_RING_+0x13c/0x1e0 [fglrx]
[82025.610015] [] ? _ZN21ExecutableUnitsCayman14AllCPRingsIdleE15idle_WaitMethod+0x1a/0x90 [fglrx]
[82025.610064] [] ? _ZN15ExecutableUnits7PM4idleE15idle_WaitMethod+0x4b/0x90 [fglrx]
[82025.610104] [] ? _ZN10QS_PRIVATE9QsPM4idleE15idle_WaitMethod+0x31/0x60 [fglrx]
[82025.610143] [] ? _ZN10QS_PRIVATE7idleAllE15idle_WaitMethod+0x10/0x40 [fglrx]
[82025.610181] [] ? _ZN3MSF19doGarbageCollectionEv+0x43/0x280 [fglrx]
[82025.610212] [] ? _ZN9CMMlegacy22CMMQS_ProcessTerminateEj+0x39/0x60 [fglrx]
[82025.610242] [] ? CMMQS_ProcessTerminate+0xa/0x10 [fglrx]
[82025.610266] [] ? firegl_cmmqs_ProcessTerminate+0x32/0xc0 [fglrx]
[82025.610286] [] ? firegl_release_helper+0x3e4/0x700 [fglrx]
[82025.610305] [] ? firegl_release+0x60/0x1b0 [fglrx]
[82025.610308] [] ? __fput+0xb0/0x1f0
[82025.610310] [] ? task_work_run+0xac/0xd0
[82025.610312] [] ? do_exit+0x29a/0x9f0
[82025.610314] [] ? do_wp_page+0x511/0x760
[82025.610315] [] ? do_group_exit+0x34/0xa0
[82025.610317] [] ? get_signal_to_deliver+0x165/0x500
[82025.610319] [] ? do_signal+0x3d/0x5c0
[82025.610321] [] ? __remove_hrtimer+0x3e/0x90
[82025.610322] [] ? hrtimer_try_to_cancel+0x64/0x70
[82025.610324] [] ? do_nanosleep+0x82/0x110
[82025.610326] [] ? hrtimer_nanosleep+0x8e/0x140
[82025.610327] [] ? do_notify_resume+0x6d/0x90
[82025.610328] [] ? int_signal+0x12/0x17
[82025.610331] pubdev:0xffffffffa094ec00, num of device:2 , name:fglrx, major 13, minor 35.

I tested out two different AMD 7970 cards, two different Intel 3930K CPUs, and different driver versions. I also ran Memtest86+ and Linpack both of which passed. The newer drivers report the above ASIC hang while the older drivers do not produce as much detail in the kernel log. In some cases, a task will run on for several hours rather than the normal 10-12 minute range when something has gone wrong.

I just set the export GPU_MAX_ALLOC_PERCENT=100 option before starting BOINC suggested by choks to see if that helps.

Thanks,

Jeroen

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5872

Credit: 117250755305

RAC: 36217432

RE: I have also been

11 May 2014 0:19:59 UTC

Message 121420 in response to message 121419

(moderation:

)

Quote:

I have also been running the new Beta application in Linux x86_64. In my case, I have been seeing the following in the kernel log after varying run time ranging from 6-48 hours. The system is still accessible after the ASIC hang and I have only seen this with the GW application.

Thanks for your report.

I wasn't smart enough to have delved into kernel logs, but I sure will in future when I'm next at the location where my two hosts are.

It's more than 12 hours now since they were last restarted and they have both survived the night. The Arch Linux host (catalyst 14.4, bare xterm - no DM) has actually gained about a further 3% in performance - tasks finishing in 14.5 mins now compared to around 15 mins previously. The PCLinuxOS host (catalyst 13.12) is still taking just over 17 mins on average. I'm assuming the 17/15 mins difference is due to the different driver versions - at least in part.

One thing I have noticed on Christophe's HD6950 system is that his elapsed/CPU times show as almost identical, eg 1044/1020. On my two hosts (both HD7850s) I'm seeing the following (times in seconds)

[pre] Host Elapsed CPU
Time Time
Arch 870 275
PCLOS 1025 405[/pre]

CPU times are very much lower than elapsed times and the difference between the two hosts is essentially all CPU time. I don't know what the significance of those differences really is.

Cheers,
Gary.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5872

Credit: 117250755305

RAC: 36217432

OK, a GWopencl-ati-Beta task

11 May 2014 6:30:02 UTC

Message 121421 in response to message 121420

(moderation:

)

OK, a GWopencl-ati-Beta task running on P has frozen and I've run gdb as instructed and secured a backtrace. Here is a log of the gdb session. I've left out some early stuff to do with reading symbols from various libs. I presume you don't need that.

gdb attach 4493
....
....
Reading symbols from /usr/lib64/libXinerama.so.1...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libXinerama.so.1

Reading symbols from /usr/lib64/fglrx-current/libGL.so.1...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/fglrx-current/libGL.so.1

Warning: File "/lib64/libthread_db-1.0.so" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load".

Warning: Unable to find libthread_db matching inferior's thread library, thread debugging will not be available.

0x00007fdb9c69b600 in sem_wait () from /lib64/libpthread.so.0

(gdb) cont
Continuing.
^C
Program received signal SIGINT, Interrupt.
0x00007fdb9c69b600 in sem_wait () from /lib64/libpthread.so.0

(gdb) bt
#0 0x00007fdb9c69b600 in sem_wait () from /lib64/libpthread.so.0
#1 0x00007fdb997e2700 in ?? from /usr/lib64/fglrx-current/libamdocl64.so
#2 0x00007fdb997dea5f in ?? from /usr/lib64/fglrx-current/libamdocl64.so
#3 0x00007fdb997cb690 in ?? from /usr/lib64/fglrx-current/libamdocl64.so
#4 0x00007fdb997cc0fb in ?? from /usr/lib64/fglrx-current/libamdocl64.so
#5 0x00007fdb997a4707 in clFinish from /usr/lib64/fglrx-current/libamdocl64.so
#6 0000000000004226af in opencl_FstatisticsLoop (doComputeFstats=) at /home/jenkins/workspace/workspace/EaH-GW-Testing/SLAVE/LINUX64-OPENCL/TARGET/linux-opencl/EinsteinAtHome/source/lalsuite/lalapps/src/pulsar/GCT/GCT_opencl.c:1179
#7 00000000000040ee68 in MAIN (argc=, argv=) at /home/jenkins/workspace/workspace/EaH-GW-Testing/SLAVE/LINUX64-OPENCL/TARGET/linux-opencl/EinsteinAtHome/source/lalsuite/lalapps/src/pulsar/GCT/HeirarchSearchGCT.c:1525
#8 00000000000041f041 in worker () at /home/jenkins/workspace/workspace/EaH-GW-Testing/SLAVE/LINUX64-OPENCL/TARGET/linux-opencl/EinsteinAtHome/source/lalsuite/lalapps/src/pulsar/hough/src2/EinsteinAtHome/hs_boinc_extras.c:1223
#9 main (argc=, argv=0x0) at /home/jenkins/workspace/workspace/EaH-GW-Testing/SLAVE/LINUX64-OPENCL/TARGET/linux-opencl/EinsteinAtHome/source/lalsuite/lalapps/src/pulsar/hough/src2/EinsteinAtHome/hs_boinc_extras.c:1532

(gdb)

I haven't (yet) quit out of gdb but I intend to shortly and then attempt to stop BOINC at which point the machine will totally lock up so a hard reset will be required. The A machine is still chugging along fine and it will be interesting to see how long it lasts.

Cheers,
Gary.

choks

Joined: 24 Feb 05

Posts: 16

Credit: 145410373

RAC: 79614

Hi Gary, From the stack

11 May 2014 13:43:13 UTC

Message 121422 in response to message 121421

(moderation:

)

Hi Gary,

From the stack this is happening deep inside the Catalyst driver.

I helped people writing the Open Source version of the Catalyst driver (now part of Mesa project). And I had a nearly the same problem with HD6950, as when running a Window Manager, the hardware semaphores were scruwed up.
Was easier to bebug, since the full driver is opened and can be debugged. It turned out, this was some specific ASIC setup, and it was different that HD7xxx.
Was even different than non Cayman based HD6xxx chipsets.

When you launch an OpenCL kernel, the driver set a semaphore and wait for someone (ususally the kernel) to release it.
When drawing on the screen (moving windows, drawing chars, ..) you also set other semaphores.
I would not be supprised the same semaphore problem occurs on Catalyst drivers.

So I would go this way:
- collect all non working configurations (Chipsets & Window Manager)
- collect ASIC errors reported by dmesg

Then submit the problem to developer.amd.com

The elapsed time / used time is something that I have noticed. On my debian, as you mentioned this almost the same. It means the CPU is always using a core at 100% and thus using a significant electrical power.
I dig a bit and found that the usleep/nano sleep functions are frequently interrupted and thus wake up the CPU again and again.
Your distro seems to indicate that the linux kernel is handling the wake up differently. It's probably that the linux timers are configured differently.

And that could explain a hang in the Catalyst driver waiting forever on a event that will never occur or was missed.

One possible reason we don't see it in other OpenCl apps, is that the kernel is very big (> 2000 lines of code) and the dataset quite big. A challenging OpenCL kernel for the drivers.

Christophe

Experiences running the GWopencl-ati-Beta app on Linux X86_64

Forums › Cruncher's Corner

Comment viewing options

Forums › Cruncher's Corner