All things Nvidia GPU

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4895
Credit: 18435203241
RAC: 5712149

Tom, you never answered the

Tom, you never answered the question of whether you looked at the ECC errors on the VRAM when you are so heavily overclocking the memory.

The gpu will happily keep correcting for errors but since that takes several retries on every memory transfer that causes a slowdown in performance.  Exactly what you are experiencing.

 

Tom M
Tom M
Joined: 2 Feb 06
Posts: 6258
Credit: 8908723658
RAC: 10241106

Yes, it returns to processing

Yes, it returns to processing full speed after a reboot.

It is a dedicated boinc machine running WCG on the CPU and e@h/grp#1 on the GPU.

Tom M

A Proud member of the O.F.A.  (Old Farts Association).  Be well, do good work, and keep in touch.® (Garrison Keillor)

GWGeorge007
GWGeorge007
Joined: 8 Jan 18
Posts: 2997
Credit: 4926034438
RAC: 132503

Tom M wrote: Processing

Tom M wrote:

Processing slows down unless I re-boot daily?

Hi, I have a rtx 3080 ti FE that seems to slowdown its processing after being up for several days.

It goes from 2.5+ minutes down to 3.5-4 minutes per task.  It is running 2x on the GPU with petri's optimized app.

This has happened across both the current 525 and 470 Ubuntu/Nvidia drivers.  And includes MTM OC of +1700.

This GPU routinely used to do a little over 3M on the e@h grp#1 diet it normally gets.

It is clearly stalling at 2.5M.

I have rebooted and am now running an MTM OC of +900

What else should I be testing/looking at?

Thank you.

Tom M

Be honest with yourself.  Is this a 'daily driver', or do you do anything else besides BOINC with it?

You may have some other program stealing memory and GPU usage.  Have you run 'nvidia-smi'?

Also, did you buy the GPU in question used?  Do you know how old it is?  I had a couple of EVGA GPUs do that, and I ended up applying new thermal pads to them and they sped up again.  Could be the same with yours.

George

Proud member of the Old Farts Association

Tom M
Tom M
Joined: 2 Feb 06
Posts: 6258
Credit: 8908723658
RAC: 10241106

Keith Myers wrote: Tom, you

Keith Myers wrote:

Tom, you never answered the question of whether you looked at the ECC errors on the VRAM when you are so heavily overclocking the memory.

The gpu will happily keep correcting for errors but since that takes several retries on every memory transfer that causes a slowdown in performance.  Exactly what you are experiencing.

I am pretty sure I looked the last time you asked. I didn't see anything.  Let me research how and add that to my permanent notes.

Tom M

A Proud member of the O.F.A.  (Old Farts Association).  Be well, do good work, and keep in touch.® (Garrison Keillor)

Tom M
Tom M
Joined: 2 Feb 06
Posts: 6258
Credit: 8908723658
RAC: 10241106

Tom M wrote: Keith Myers

Tom M wrote:

Keith Myers wrote:

Tom, you never answered the question of whether you looked at the ECC errors on the VRAM when you are so heavily overclocking the memory.

The gpu will happily keep correcting for errors but since that takes several retries on every memory transfer that causes a slowdown in performance.  Exactly what you are experiencing.

I am pretty sure I looked the last time you asked. I didn't see anything.  Let me research how and add that to my permanent notes.

tlgalenson@Ryzen-OneHorseShay:~$ nvidia-smi -q -d PAGE_RETIREMENT

==============NVSMI LOG==============

Timestamp                                 : Fri Apr 28 20:45:40 2023
Driver Version                            : 525.105.17
CUDA Version                              : 12.0

Attached GPUs                             : 1
GPU 00000000:09:00.0
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A

tlgalenson@Ryzen-OneHorseShay:~$ nvidia-smi
Fri Apr 28 20:48:23 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:09:00.0  On |                  N/A |
| 69%   70C    P2   314W / 350W |   3953MiB / 12288MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1143      G   /usr/lib/xorg/Xorg                 17MiB |
|    0   N/A  N/A      1758      G   /usr/lib/xorg/Xorg                 70MiB |
|    0   N/A  N/A      1886      G   /usr/bin/gnome-shell               71MiB |
|    0   N/A  N/A      2214      G   /usr/bin/nvidia-settings            0MiB |
|    0   N/A  N/A      2683      G   /usr/lib/firefox/firefox           28MiB |
|    0   N/A  N/A      4146      C   ...-pc-linux-gnu-opencl_v1.0     1874MiB |
|    0   N/A  N/A      4172      C   ...-pc-linux-gnu-opencl_v1.0     1874MiB |
+-----------------------------------------------------------------------------+
tlgalenson@Ryzen-OneHorseShay:~$ nvidia-smi -g 0 --ecc-config=0
ECC features not supported for GPU 00000000:09:00.0.
Treating as warning and moving on.
All done.
tlgalenson@Ryzen-OneHorseShay:~$

A Proud member of the O.F.A.  (Old Farts Association).  Be well, do good work, and keep in touch.® (Garrison Keillor)

Boca Raton Community HS
Boca Raton Comm...
Joined: 4 Nov 15
Posts: 232
Credit: 9497128920
RAC: 23028947

Is there any correlation of

Is there any correlation of what WCG projects are running when the slowdown happens? What happens if you temporarily suspend WCG work when you notice the slowdown? If it speeds back up, then there could be a bottleneck because of the WCG work.

Also, the WCG Open Pandemics project has GPU tasks that get sent out sometimes (a lot of them today). Could those be running at some point?

Tom M
Tom M
Joined: 2 Feb 06
Posts: 6258
Credit: 8908723658
RAC: 10241106

GWGeorge007 wrote: Be honest

GWGeorge007 wrote:

Be honest with yourself.  Is this a 'daily driver', or do you do anything else besides BOINC with it?

You may have some other program stealing memory and GPU usage.  Have you run 'nvidia-smi'?

Also, did you buy the GPU in question used?  Do you know how old it is?  I had a couple of EVGA GPUs do that, and I ended up applying new thermal pads to them and they sped up again.  Could be the same with yours.

The daily driver runs Windows.  Not LInux/Ubuntu.  I do use the Firefox web browser on my boinc boxes.  But everything else is Linux utilities.

I got one rtx 3080 ti FE in a swap from Ian&SteveC after buying the original one used.

I will jack up the gpu fan speeds and keep an eye on the thermal readings.

 

 

A Proud member of the O.F.A.  (Old Farts Association).  Be well, do good work, and keep in touch.® (Garrison Keillor)

Tom M
Tom M
Joined: 2 Feb 06
Posts: 6258
Credit: 8908723658
RAC: 10241106

Boca Raton Community HS

Boca Raton Community HS wrote:

Is there any correlation of what WCG projects are running when the slowdown happens? What happens if you temporarily suspend WCG work when you notice the slowdown? If it speeds back up, then there could be a bottleneck because of the WCG work.

Also, the WCG Open Pandemics project has GPU tasks that get sent out sometimes (a lot of them today). Could those be running at some point?

Good question(s).  I think I am running a cpu only profile on WCG.  I have not gotten gpu tasks from WCG in a very long time even if I set a profile that only asks for them.

Presuming it slows down again in a couple of days I will try "suspending" the WCG project and observe what happens to the gpu processing.

Tom M

A Proud member of the O.F.A.  (Old Farts Association).  Be well, do good work, and keep in touch.® (Garrison Keillor)

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3911
Credit: 43716725976
RAC: 63093498

Tom M wrote: Tom M

Tom M wrote:

Tom M wrote:

Keith Myers wrote:

Tom, you never answered the question of whether you looked at the ECC errors on the VRAM when you are so heavily overclocking the memory.

The gpu will happily keep correcting for errors but since that takes several retries on every memory transfer that causes a slowdown in performance.  Exactly what you are experiencing.

I am pretty sure I looked the last time you asked. I didn't see anything.  Let me research how and add that to my permanent notes.

tlgalenson@Ryzen-OneHorseShay:~$ nvidia-smi -q -d PAGE_RETIREMENT

==============NVSMI LOG==============

Timestamp                                 : Fri Apr 28 20:45:40 2023
Driver Version                            : 525.105.17
CUDA Version                              : 12.0

Attached GPUs                             : 1
GPU 00000000:09:00.0
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A

tlgalenson@Ryzen-OneHorseShay:~$ nvidia-smi
Fri Apr 28 20:48:23 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:09:00.0  On |                  N/A |
| 69%   70C    P2   314W / 350W |   3953MiB / 12288MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1143      G   /usr/lib/xorg/Xorg                 17MiB |
|    0   N/A  N/A      1758      G   /usr/lib/xorg/Xorg                 70MiB |
|    0   N/A  N/A      1886      G   /usr/bin/gnome-shell               71MiB |
|    0   N/A  N/A      2214      G   /usr/bin/nvidia-settings            0MiB |
|    0   N/A  N/A      2683      G   /usr/lib/firefox/firefox           28MiB |
|    0   N/A  N/A      4146      C   ...-pc-linux-gnu-opencl_v1.0     1874MiB |
|    0   N/A  N/A      4172      C   ...-pc-linux-gnu-opencl_v1.0     1874MiB |
+-----------------------------------------------------------------------------+
tlgalenson@Ryzen-OneHorseShay:~$ nvidia-smi -g 0 --ecc-config=0
ECC features not supported for GPU 00000000:09:00.0.
Treating as warning and moving on.
All done.
tlgalenson@Ryzen-OneHorseShay:~$

that’s not what Keith is referring to. He’s talking about memory errors related to heavy memory overclock. not sure how you check those errors in Linux, but I know you can see it in Windows. 
 

Are you still overclocking the memory 1500 or more? Maybe try reducing that to stock speeds to see if the slowdown behavior stops. 

_________________________________________________________________________

Tom M
Tom M
Joined: 2 Feb 06
Posts: 6258
Credit: 8908723658
RAC: 10241106

Ian&Steve C. wrote: that’s

Ian&Steve C. wrote:

that’s not what Keith is referring to. He’s talking about memory errors related to heavy memory overclock. not sure how you check those errors in Linux, but I know you can see it in Windows. 

 

Are you still overclocking the memory 1500 or more? Maybe try reducing that to stock speeds to see if the slowdown behavior stops. 

Slap Forehead.

I was memory overclocking at +1700.  I am now memory overclocking at +900.  Which means another test, if/when the processing slowdown occurs again will to be dropping the OC entirely.

If I can figure out the right google search.... or if Keith reminds me what the command line is that he had me test....

A Proud member of the O.F.A.  (Old Farts Association).  Be well, do good work, and keep in touch.® (Garrison Keillor)

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.