Einstein FGRPB1G Linux/Nvidia Special app "AIO"

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3958
Credit: 47013242642
RAC: 64961153

first, I just want to say

first, I just want to say that there is no guarantee that slowing the memory clock will reduce the invalids. it was only a suggestion to try to see if it helped. same with slowing the core clock. i just wanted that to be more clear.

 

second, yes you need to edit the coolbits to unlock the overclocking ability (as well as thermal control). I do it by running this command:

sudo nvidia-xconfig --thermal-configuration-check --cool-bits=28 --enable-all-gpus

then reboot. and you should have the ability to adjust the clocks in the Nvidia Settings app.

_________________________________________________________________________

Boca Raton Community HS
Boca Raton Comm...
Joined: 4 Nov 15
Posts: 240
Credit: 10599935586
RAC: 21065444

Ian&Steve C. wrote:first, I

Ian&Steve C. wrote:

first, I just want to say that there is no guarantee that slowing the memory clock will reduce the invalids. it was only a suggestion to try to see if it helped. same with slowing the core clock. i just wanted that to be more clear.

 

second, yes you need to edit the coolbits to unlock the overclocking ability (as well as thermal control). I do it by running this command:

sudo nvidia-xconfig --thermal-configuration-check --cool-bits=28 --enable-all-gpus

then reboot. and you should have the ability to adjust the clocks in the Nvidia Settings app.

 

For sure- I am really just playing around with it to see the impact. 

I can edit the clock settings but they will not save when I exit the app and re-open it. I am not even sure they update the clock speeds after I change the numbers. What is odd is that if I change the preferred mode, that DOES save when I exit the app and re-open it. 


EDIT: Nevermind- I think I figured it out... I think it saved this time?

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3958
Credit: 47013242642
RAC: 64961153

the command i posted is the

the command i posted is the command i use on all my systems. changing the clock speed persists. but it does not persist after a reboot. you will have to re-set the clocks to what you want after reboots.

i just do this with a script.

GPU_clocks.sh wrote:

#!/bin/bash

/usr/bin/nvidia-smi -pm 1
/usr/bin/nvidia-smi -acp UNRESTRICTED

/usr/bin/nvidia-smi -i 0 -pl 130

/usr/bin/nvidia-settings -a "[gpu:0]/GPUPowerMizerMode=1"

/usr/bin/nvidia-settings -a "[gpu:0]/GPUMemoryTransferRateOffset[4]=1000" -a "[gpu:0]/GPUGraphicsClockOffset[4]=50"

/usr/bin/nvidia-settings -a '[gpu:0]/GPUFanControlState=1' -a '[fan:0]/GPUTargetFanSpeed=100'

save this, adjust the values as needed. and make sure it's executable. (either run chmod +x on it, or go into the file properties and click the checkbox to 'run as a program')

if you want to reduce clocks, use a negative offset.

_________________________________________________________________________

Boca Raton Community HS
Boca Raton Comm...
Joined: 4 Nov 15
Posts: 240
Credit: 10599935586
RAC: 21065444

Okay, thank you. I will play

Okay, thank you. I will play around with this. Initial observation- changing the clocks (up or down by even 100) is causing the work units to fail at about ~2%. That is even with a reduction in clock and/or memory speed. It seems to be very... finicky. I will keep messing around with it.

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4964
Credit: 18748790456
RAC: 7072986

I see you are getting hit by

I see you are getting hit by a couple of different errors.

First is you are getting the "flushing" errors that I get on my 2080 cards using a 5950X.  I seemed to be the only one with this issue.  I extensively tried troubleshooting against all possible variables and never could pin down the cause on that host.  I had other cards in other hosts not afflicted.  I thought that only the older 2080 cards caused the issue as none of my 3000 series cards ever had the problem.

Interesting to see a new 4000 series card afflicted also.

The other errors I see are something about Petri's clfft files is not liked by the 4090 card.

This is an interesting snippet from a errored task.

Quote:
Using alternate fft kernel file: ../../clfft.kernel.Transpose2.cl.alt Using alternate fft kernel file: ../../clfft.kernel.Stockham3.cl.alt FFTGeneratedStockhamAction::compileKernels failed Error in OpenCL context: CL_OUT_OF_RESOURCES error executing CL_COMMAND_WRITE_BUFFER on NVIDIA GeForce RTX 4090 (Device 0).

And here again:

Quote:

Using alternate fft kernel file: ../../clfft.kernel.Transpose4.cl.alt
FFTGeneratedTransposeGCNAction::compileKernels failed
ERROR: plan generation("baking") failed: -5
09:05:52 (5558): [CRITICAL]: ERROR: MAIN() returned with error '-5'

Never knew that a "baking" plan generation was in the AIO setup.

 

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3958
Credit: 47013242642
RAC: 64961153

Boca Raton Community HS

Boca Raton Community HS wrote:

Okay, thank you. I will play around with this. Initial observation- changing the clocks (up or down by even 100) is causing the work units to fail at about ~2%. That is even with a reduction in clock and/or memory speed. It seems to be very... finicky. I will keep messing around with it.

another question of curiosity.

how much VRAM is each gamma ray task using?

you can check with the 'nvidia-smi' command, in the listed processes at the bottom of the output.

_________________________________________________________________________

Boca Raton Community HS
Boca Raton Comm...
Joined: 4 Nov 15
Posts: 240
Credit: 10599935586
RAC: 21065444

Keith Myers wrote: I seemed

Keith Myers wrote:

I seemed to be the only one with this issue.  I extensively tried troubleshooting against all possible variables and never could pin down the cause on that host.  I had other cards in other hosts not afflicted.  I thought that only the older 2080 cards caused the issue as none of my 3000 series cards ever had the problem.

Interesting to see a new 4000 series card afflicted also.

Glad I could join you with this error!

Ian&Steve C. wrote:

another question of curiosity.

how much VRAM is each gamma ray task using?

you can check with the 'nvidia-smi' command, in the listed processes at the bottom of the output.

Each one is using 2010MiB.

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3958
Credit: 47013242642
RAC: 64961153

ok that's not too much and

ok that's not too much and pretty normal for this app.

_________________________________________________________________________

Boca Raton Community HS
Boca Raton Comm...
Joined: 4 Nov 15
Posts: 240
Credit: 10599935586
RAC: 21065444

I had a thought during the

I had a thought during the weekend- what if I enable ECC on the GPU? Do you think this could decrease the invalids? I do not know enough about the reason something is "invalidated" and if that relates to GPU errors in the memory. I can easily test this and let it go for a while, but wanted to know some thoughts on this. 

I understand this will slow the work unit down, but if it increases the valid rate, then it might be worth it. 

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3958
Credit: 47013242642
RAC: 64961153

you can give it a shot. but

you can give it a shot. but it also will likely slow down the computation. you'd have to test if the decrease in invalids (if at all) offsets the slower crunch times.

_________________________________________________________________________

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.