54 + hour BRP5-cuda32-nv270 jobs

robl
robl
Joined: 2 Jan 13
Posts: 1709
Credit: 1454483533
RAC: 8532
Topic 197188

I have recently brought online a new Ubuntu box with a NVIDIA 650 Ti. This box only process E&H GPU WUs in parallel with Rosetta CPU WUs and all of GPU WUs seem to be of a duration greater than 54 hours. Do WUs of this length of time exist or are the times incorrect? I looked at my tasks currently in progress but they don't show times. One such WU name is PA0022_01281_238_0. It shows 30:02 elapsed with 24:19 remaining.

Is there a problem here?

AgentB
AgentB
Joined: 17 Mar 12
Posts: 915
Credit: 513211304
RAC: 0

54 + hour BRP5-cuda32-nv270 jobs

Quote:

I have recently brought online a new Ubuntu box with a NVIDIA 650 Ti. This box only process E&H GPU WUs in parallel with Rosetta CPU WUs and all of GPU WUs seem to be of a duration greater than 54 hours. Do WUs of this length of time exist or are the times incorrect? I looked at my tasks currently in progress but they don't show times. One such WU name is PA0022_01281_238_0. It shows 30:02 elapsed with 24:19 remaining.

Is there a problem here?

Is it this host? http://einsteinathome.org/host/8703676/tasks as it looks to be processing E@H CPU tasks as well as GPU tasks.

The CPU only tasks times CasA / GRP look ok.

The GPU tasks look slower than i would expect, a rule of thumb is ensure one free core per GPU. Lately i find dedicating my host to to a single flavour of GPU task (no CPU tasks) gives less errors, fastest turnover and overall more for less heat.

Currently i´m on a steady BRP4G diet, which is less fattening (credit wise) than the full fat PAS (BRP5).

Maybe you could try freeing up some CPU resources and see what happens with the GPU task times, it can be quite a dramatic improvement.

robl
robl
Joined: 2 Jan 13
Posts: 1709
Credit: 1454483533
RAC: 8532

RE: RE: I have recently

Quote:
Quote:

I have recently brought online a new Ubuntu box with a NVIDIA 650 Ti. This box only process E&H GPU WUs in parallel with Rosetta CPU WUs and all of GPU WUs seem to be of a duration greater than 54 hours. Do WUs of this length of time exist or are the times incorrect? I looked at my tasks currently in progress but they don't show times. One such WU name is PA0022_01281_238_0. It shows 30:02 elapsed with 24:19 remaining.

Is there a problem here?

Is it this host? http://einsteinathome.org/host/8703676/tasks as it looks to be processing E@H CPU tasks as well as GPU tasks.

The CPU only tasks times CasA / GRP look ok.

The GPU tasks look slower than i would expect, a rule of thumb is ensure one free core per GPU. Lately i find dedicating my host to to a single flavour of GPU task (no CPU tasks) gives less errors, fastest turnover and overall more for less heat.

Currently i´m on a steady BRP4G diet, which is less fattening (credit wise) than the full fat PAS (BRP5).

Maybe you could try freeing up some CPU resources and see what happens with the GPU task times, it can be quite a dramatic improvement.

Yes this is the node that you point out. It is running both GPU and CPU tasks for E&H as well as CPU tasks for Rosetta. I just freed up some CPU resources to see if that will help. Will freeing up CPU resources require some smoothing time to see results?

AgentB
AgentB
Joined: 17 Mar 12
Posts: 915
Credit: 513211304
RAC: 0

RE: Will freeing up CPU

Quote:
Will freeing up CPU resources require some smoothing time to see results?

Yes fairly quickly i would expect to see improvement.

The PCI bus width is also worth checking.

I don´t have a gtx650 but i noticed
http://einsteinathome.org/node/196380&nowrap=true#127034 has some good suggestions about PCIe3.0.

If you are running PCIe2.0 x4 or x8 you will see a marked drop in performance compared to x16

BackGroundMAN
BackGroundMAN
Joined: 25 Feb 05
Posts: 58
Credit: 246736656
RAC: 0

Hi, I have also a GTX-650 and

Hi, I have also a GTX-650 and I had noticed that there is a problem with the newest nvidia drivers in linux systems (thread).

If you have nvidia drivers with version greater than 319.XX you can downgrade nvidia drivers and see if this is the problem in your case.

robl
robl
Joined: 2 Jan 13
Posts: 1709
Credit: 1454483533
RAC: 8532

AgentB: freeing up some of

AgentB: freeing up some of the CPUs did not make a difference in GPU WU processing. Never hurts to try.

AgentB and BackGroundMAN

Your links pointing out issues in NVIDIA drivers seems to be the key.

BackGroundMAN,

Your "thread" link describes exactly what I am experiencing. 4 day jobs. I feel certain that it is a driver issue because of the following:

host A: NVIDIA 650 Ti 1 gig
NVIDIA drivers 304.88
This node is chewing through GPU WUs

host B: NVIDIA 650 Ti 2 gig
NVIDIA drivers 319.32
This node is slooooooooow.

I have used PPA to manage drivers on both of the hosts. I now need to figure out how to uninstall the 319.32 drivers and "get" the 304.88 drivers installed on host B.

I will let you know in the next couple of days. (Aside: my walk behind lawnmower transmission died yesterday in the middle of a Florida September day so I had to brute force it through two weeks of uncut grass. Repairing it will receive priority today over anything else. :>P )

robl
robl
Joined: 2 Jan 13
Posts: 1709
Credit: 1454483533
RAC: 8532

You guys were right. The

You guys were right.

The "long" processing time is driver related. I have fixed the problem by removing the 319.32 driver and replacing with 304.88 NVIDIA driver. Some interesting observations along with the sequence of events.

Prior to changing drivers I had suspended Rosetta. I had no jobs pending for E&H and set E&H for no new work. My only running jobs were now E&H GPU work. I then suspended E&H, and stopped the boinc-client. I used the Ubuntu software center to remove the 319 driver and then installed the 304 driver. Rebooted the node. I now had the NVIDIA system-config icon which now confirmed that I was running the 304 drivers. I launched the boinc manager and leaving Rosetta suspended I took E&H out of suspend with "no new work". What I noted in boinc manager upon doing this follows:

1. the very long E&H jobs that had been suspended were now showing running with an elapsed time incrementing at 1 second intervals, BUT the "remaining time" was decrementing in "blocks of seconds". It is really crunching through the those WUs that had been taking so long.

2. I have perl scripts which monitor GPU temps and sound audible alarms when temp violations occur. With the 304 drivers now installed I got audible alarms because the GPU temp had violated a previously set value in the script. With the 319 drivers the GPU ran at a temp of 44C and a fan speed of 21%. Now with the 304 drivers installed the temp jumped to 60C and with the fan speed holding at 21%.

I am waiting to clear out the remaining E&H GPU WUs that had the long completion times before accepting more E&H work and before "unsuspending" Rosetta.

Go figure.

Gould not have done it without your help.

AgentB
AgentB
Joined: 17 Mar 12
Posts: 915
Credit: 513211304
RAC: 0

RE: The "long" processing

Quote:

The "long" processing time is driver related. I have fixed the problem by removing the 319.32 driver and replacing with 304.88 NVIDIA driver.

That´s good news.

Quote:

I used the Ubuntu software center to remove the 319 driver and then installed the 304 driver.

I think the general consensus is to get nVidia drivers direct from nVidia, rather from the distros, It´s a a small hassle, but worth the effort.

I be interested if 325.x fixes this.

http://einsteinathome.org/node/197123

Quote:
Gould not have done it without your help.

BackGroundMAN take a bow.

robl
robl
Joined: 2 Jan 13
Posts: 1709
Credit: 1454483533
RAC: 8532

RE: RE: The "long"

Quote:
Quote:

The "long" processing time is driver related. I have fixed the problem by removing the 319.32 driver and replacing with 304.88 NVIDIA driver.

That´s good news.

Quote:

I used the Ubuntu software center to remove the 319 driver and then installed the 304 driver.

I think the general consensus is to get nVidia drivers direct from nVidia, rather from the distros, It´s a a small hassle, but worth the effort.


In the past I had installed the drivers directly from NVIDIA. But whenever I installed a "new kernel" everything fell apart. Hence I install from "ppa:".

Quote:

I be interested if 325.x fixes this.

http://einsteinathome.org/node/197123

Quote:
Gould not have done it without your help.

BackGroundMAN take a bow.


He does indeed.

BackGroundMAN
BackGroundMAN
Joined: 25 Feb 05
Posts: 58
Credit: 246736656
RAC: 0

RE: I be interested if

Quote:
I be interested if 325.x fixes this.

No. I try nvidia-325.15 in a Gentoo x64 Linux box (kernel 3.8.13) and I get very poor performance in EaH. My estimation is that 325.15 is ~6 time slower than 313.30 in EaH.
In other cuda applications (cuda samples, matlab) there is no performance gap between several nvidia-drivers from 313.30 to 325.15.

I am not ready yet to blame nvidia drivers for this issue.
EaH uses an old CUDA SDK (3.2) while other applications use CUDA-SDK 5.0.
I try to compile brp cuda app with the new cuda sdk and profile the application with several nvidia drivers but I have some issues with the produced binary.

Quote:
I think the general consensus is to get nVidia drivers direct from nVidia, rather from the distros, It´s a a small hassle, but worth the effort.

If you have a Debian-like distro or Gentoo-like distro this is not a good idea because you by-pass all the testing in new drivers by distro people.

Gentoo latest stable nvidia-driver is 313.30 while Debian (testing - jessie) is 304.88. Both drivers has no performance issues with EaH !!!

Furthermore, you have to create you own scripts for automatic driver-update and re-installation in case of a kernel upgrade.

robl
robl
Joined: 2 Jan 13
Posts: 1709
Credit: 1454483533
RAC: 8532

RE: RE: I be interested

Quote:
Quote:
I be interested if 325.x fixes this.

No. I try nvidia-325.15 in a Gentoo x64 Linux box (kernel 3.8.13) and I get very poor performance in EaH. My estimation is that 325.15 is ~6 time slower than 313.30 in EaH.
In other cuda applications (cuda samples, matlab) there is no performance gap between several nvidia-drivers from 313.30 to 325.15.

I am not ready yet to blame nvidia drivers for this issue.
EaH uses an old CUDA SDK (3.2) while other applications use CUDA-SDK 5.0.
I try to compile brp cuda app with the new cuda sdk and profile the application with several nvidia drivers but I have some issues with the produced binary.

Quote:
I think the general consensus is to get nVidia drivers direct from nVidia, rather from the distros, It´s a a small hassle, but worth the effort.

If you have a Debian-like distro or Gentoo-like distro this is not a good idea because you by-pass all the testing in new drivers by distro people.

Gentoo latest stable nvidia-driver is 313.30 while Debian (testing - jessie) is 304.88. Both drivers has no performance issues with EaH !!!

Furthermore, you have to create you own scripts for automatic driver-update and re-installation in case of a kernel upgrade.

BackGroundMAN,

During the initial build of this node I did the following to provide NVIDIA support:

sudo add-apt-repository ppa:ubuntu-x-swat/x-updates
sudo apt-get update
sudo apt-get install nvidia-current

These steps brought in the 319 driver which is "bad". Was my choice of "nvidia-current" the wrong parameter for the initial install.

Hoping you can shed some light on how this happened.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.