System instability

kjs
kjs
Joined: 21 Apr 14
Posts: 2
Credit: 1654584
RAC: 0
Topic 197553

Hi there,
I started crunching Einstein@Home within the last week. The system I am crunching was built 4 weeks ago. I did not observe system instability until yesterday when the computer went down once during the day, and again at night.

So, this morning I opted out of beta mode with E@H. I am also wondering... are there logs for the BOINC client? Or system logs that might give more information? I'm running BOINC on Debian with 6 virtual CPU cores and 2 nVidia Cuda-enabled GPUs.

Thanks,
Kevin

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5850
Credit: 110037254011
RAC: 22391232

System instability

Hi, Welcome to the project.

I have two FX-6300 based systems, each with a single AMD HD7850 GPU. They now seem to be crunching reasonably reliably but they were doing pretty much what you describe when first built - ie run reliably with no crunching load but crash/lock up every day or two when under full load. The main problem was probably the stock heat sink and fan. Also, I was trying for just a little overclocking which seems to be a problem without much better cooling.

I removed the overclock and replaced the fan with a higher output one and they've both been running for quite a while now without further issue. I do intend eventually to get around to replacing the stock heat sink with something much better but with them both behaving at the moment, I haven't got around to it yet.

Are you using stock cooling? With two GPUs helping heat things up, you really need to pay attention to cooling.

BOINC's event log is the file stdoutdae.txt (and stdoutdae.old once rotated) and there are some error log files as well. You will find them all in the BOINC data directory. Where that is depends on how you installed BOINC. I wouldn't expect any info about machine crashes to be there though.

I looked back through your list of tasks on the website. You've done *lots* of S6 CasA opencl beta tasks and they seem to be going quite well. I've seen a few exanples of BRP4G and BRP5 tasks showing 'validate' errors which means the validator didn't like what it found when it did a sanity check on the returned results. This can easily be a pointer to hardware under stress. I also saw some FGRP3 opencl results so I guess you are running pretty much everything. All GPU tasks need some CPU support, some more than others. My experience with FX-6300 CPUs is that you may get better stability by deliberately not using all available cores. It would be a pretty quick test to change your preferences to use say 67% of the cores (4 out of 6) to see if the machine becomes more stable. BOINC may further reduce the active cores below 4 if necessary, depending on the mix of GPU tasks crunching at any particular time so don't be concerned if you see a CPU task 'waiting to run'.

Let us know how you get on. I wouldn't be worried about re-enabling the beta app seeing as it seems to be quite productive and shows no obvious errors.

Cheers,
Gary.

kjs
kjs
Joined: 21 Apr 14
Posts: 2
Credit: 1654584
RAC: 0

Hi Gary, Thank you for the

Hi Gary,

Thank you for the warm welcome. I appreciate all of the advice and experience you have to share.

I see only standard behavior in the BOINC logs.

23-Apr-2014 18:26:27 [Einstein@Home] Finished upload of h1_0960.15_S6Directed__S6CasAf40a_960.4Hz_1161_0_1
24-Apr-2014 00:28:26 [---] Starting BOINC client version 7.2.42 for x86_64-pc-linux-gnu

As a debug measure, I put together a little cron job that cats uptime to a file every hour. The results are a little surprising.

09:00:01 up 1:31, 1 user, load average: 7.54, 7.97, 8.07
10:00:01 up 2:31, 1 user, load average: 7.13, 7.56, 7.65
11:00:01 up 3:31, 1 user, load average: 8.48, 7.62, 7.58
12:00:01 up 4:31, 1 user, load average: 7.34, 7.34, 7.30
13:00:01 up 5:31, 1 user, load average: 7.39, 7.17, 7.19
14:00:01 up 6:31, 1 user, load average: 6.58, 6.64, 6.73
15:00:01 up 7:31, 1 user, load average: 6.43, 6.54, 6.57
16:00:01 up 8:31, 1 user, load average: 7.35, 7.13, 6.99
17:00:01 up 9:31, 1 user, load average: 8.02, 7.95, 7.64
18:00:01 up 10:31, 1 user, load average: 7.64, 7.89, 7.69
19:00:01 up 11:31, 1 user, load average: 7.71, 7.95, 7.79
20:00:01 up 12:31, 2 users, load average: 7.59, 7.35, 7.14
21:00:01 up 13:31, 2 users, load average: 8.60, 8.22, 8.07
22:00:01 up 14:31, 2 users, load average: 8.32, 8.22, 8.19

This shows 7-8 CPUs active at 100%, which seems unusual.

Either way, I'll take greater cooling into consideration. The cards are squished together at the bottom of the case. The case mounts the PSU at the bottom. It's designed to work with the natural rising and exhaust of heat from the top. However, given the orientation of the PCIe ports, both cards push air downward into the PSU, which I have oriented to push air upward... Maybe I'll turn the PSU around.. I could also install 120mm fans on the case wall and on the floor of the case.

I'd like to monitor the temperature of the GPUs. XFCE4 comes with nice goodies to tap into the CPU and hard disk sensors. Do you know a good application to get data off the GPU thermometers?

Thanks much,
Kevin

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5850
Credit: 110037254011
RAC: 22391232

I don't have my FX-6300 hosts

I don't have my FX-6300 hosts at home but I'll try to remember to check the load averages next time I visit where they are. They are crunching GPU tasks 4x on a HD7850 (which automatically reserves 2 CPU cores) and I've also freed up a core with preferences. So each host is running 4 GPU tasks and 3 CPU tasks concurrently.

I've checked an Intel quad core host at home which is running 4x on a HD7850 with two running CPU tasks and it shows values around 3.5. More CPU power is required to support AMD GPUs so I'm not particularly surprised by that number. At least it's not greater than 4 :-). I've also checked a couple of dual core machines running 2x on a GTX650 and 3x on a 650Ti, both with 2 CPU tasks. They were both showing around 3.3 to 3.5 as well. Lastly, I looked at a quad core with no GPU. It was showing exactly 4.0. GPU crunching must influence the value in some way.

For GPU temperatures on NVIDIA GPUs, there is a display settings tool that comes with the drivers that can show quite a bit of information like clocks, temperatures, fan speed, GPU load, etc. With certain "coolbits" options in xorg.conf, you can make some of these 'settable'. For example by setting coolbits to 4 you can have a fan speed slider control in the settings tool. I haven't played around much with this. I tend to just let the GPUs run at stock settings.

For AMD GPUs, the aticonfig utility can be used to measure and control the same sorts of parameters. There is a very extensive man-page which documents the zillions of options.

Cheers,
Gary.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.