first, I just want to say that there is no guarantee that slowing the memory clock will reduce the invalids. it was only a suggestion to try to see if it helped. same with slowing the core clock. i just wanted that to be more clear.
second, yes you need to edit the coolbits to unlock the overclocking ability (as well as thermal control). I do it by running this command:
first, I just want to say that there is no guarantee that slowing the memory clock will reduce the invalids. it was only a suggestion to try to see if it helped. same with slowing the core clock. i just wanted that to be more clear.
second, yes you need to edit the coolbits to unlock the overclocking ability (as well as thermal control). I do it by running this command:
then reboot. and you should have the ability to adjust the clocks in the Nvidia Settings app.
For sure- I am really just playing around with it to see the impact.
I can edit the clock settings but they will not save when I exit the app and re-open it. I am not even sure they update the clock speeds after I change the numbers. What is odd is that if I change the preferred mode, that DOES save when I exit the app and re-open it.
EDIT: Nevermind- I think I figured it out... I think it saved this time?
the command i posted is the command i use on all my systems. changing the clock speed persists. but it does not persist after a reboot. you will have to re-set the clocks to what you want after reboots.
/usr/bin/nvidia-settings -a "[gpu:0]/GPUPowerMizerMode=1"
/usr/bin/nvidia-settings -a "[gpu:0]/GPUMemoryTransferRateOffset[4]=1000" -a "[gpu:0]/GPUGraphicsClockOffset[4]=50"
/usr/bin/nvidia-settings -a '[gpu:0]/GPUFanControlState=1' -a '[fan:0]/GPUTargetFanSpeed=100'
save this, adjust the values as needed. and make sure it's executable. (either run chmod +x on it, or go into the file properties and click the checkbox to 'run as a program')
if you want to reduce clocks, use a negative offset.
Okay, thank you. I will play around with this. Initial observation- changing the clocks (up or down by even 100) is causing the work units to fail at about ~2%. That is even with a reduction in clock and/or memory speed. It seems to be very... finicky. I will keep messing around with it.
I see you are getting hit by a couple of different errors.
First is you are getting the "flushing" errors that I get on my 2080 cards using a 5950X. I seemed to be the only one with this issue. I extensively tried troubleshooting against all possible variables and never could pin down the cause on that host. I had other cards in other hosts not afflicted. I thought that only the older 2080 cards caused the issue as none of my 3000 series cards ever had the problem.
Interesting to see a new 4000 series card afflicted also.
The other errors I see are something about Petri's clfft files is not liked by the 4090 card.
This is an interesting snippet from a errored task.
Quote:
Using alternate fft kernel file: ../../clfft.kernel.Transpose2.cl.alt Using alternate fft kernel file: ../../clfft.kernel.Stockham3.cl.alt FFTGeneratedStockhamAction::compileKernels failed Error in OpenCL context: CL_OUT_OF_RESOURCES error executing CL_COMMAND_WRITE_BUFFER on NVIDIA GeForce RTX 4090 (Device 0).
And here again:
Quote:
Using alternate fft kernel file: ../../clfft.kernel.Transpose4.cl.alt
FFTGeneratedTransposeGCNAction::compileKernels failed
ERROR: plan generation("baking") failed: -5
09:05:52 (5558): [CRITICAL]: ERROR: MAIN() returned with error '-5'
Never knew that a "baking" plan generation was in the AIO setup.
Okay, thank you. I will play around with this. Initial observation- changing the clocks (up or down by even 100) is causing the work units to fail at about ~2%. That is even with a reduction in clock and/or memory speed. It seems to be very... finicky. I will keep messing around with it.
another question of curiosity.
how much VRAM is each gamma ray task using?
you can check with the 'nvidia-smi' command, in the listed processes at the bottom of the output.
I seemed to be the only one with this issue. I extensively tried troubleshooting against all possible variables and never could pin down the cause on that host. I had other cards in other hosts not afflicted. I thought that only the older 2080 cards caused the issue as none of my 3000 series cards ever had the problem.
Interesting to see a new 4000 series card afflicted also.
Glad I could join you with this error!
Ian&Steve C. wrote:
another question of curiosity.
how much VRAM is each gamma ray task using?
you can check with the 'nvidia-smi' command, in the listed processes at the bottom of the output.
I had a thought during the weekend- what if I enable ECC on the GPU? Do you think this could decrease the invalids? I do not know enough about the reason something is "invalidated" and if that relates to GPU errors in the memory. I can easily test this and let it go for a while, but wanted to know some thoughts on this.
I understand this will slow the work unit down, but if it increases the valid rate, then it might be worth it.
you can give it a shot. but it also will likely slow down the computation. you'd have to test if the decrease in invalids (if at all) offsets the slower crunch times.
first, I just want to say
)
first, I just want to say that there is no guarantee that slowing the memory clock will reduce the invalids. it was only a suggestion to try to see if it helped. same with slowing the core clock. i just wanted that to be more clear.
second, yes you need to edit the coolbits to unlock the overclocking ability (as well as thermal control). I do it by running this command:
sudo nvidia-xconfig --thermal-configuration-check --cool-bits=28 --enable-all-gpus
then reboot. and you should have the ability to adjust the clocks in the Nvidia Settings app.
_________________________________________________________________________
Ian&Steve C. wrote:first, I
)
For sure- I am really just playing around with it to see the impact.
I can edit the clock settings but they will not save when I exit the app and re-open it. I am not even sure they update the clock speeds after I change the numbers. What is odd is that if I change the preferred mode, that DOES save when I exit the app and re-open it.
EDIT: Nevermind- I think I figured it out... I think it saved this time?
the command i posted is the
)
the command i posted is the command i use on all my systems. changing the clock speed persists. but it does not persist after a reboot. you will have to re-set the clocks to what you want after reboots.
i just do this with a script.
save this, adjust the values as needed. and make sure it's executable. (either run chmod +x on it, or go into the file properties and click the checkbox to 'run as a program')
if you want to reduce clocks, use a negative offset.
_________________________________________________________________________
Okay, thank you. I will play
)
Okay, thank you. I will play around with this. Initial observation- changing the clocks (up or down by even 100) is causing the work units to fail at about ~2%. That is even with a reduction in clock and/or memory speed. It seems to be very... finicky. I will keep messing around with it.
I see you are getting hit by
)
I see you are getting hit by a couple of different errors.
First is you are getting the "flushing" errors that I get on my 2080 cards using a 5950X. I seemed to be the only one with this issue. I extensively tried troubleshooting against all possible variables and never could pin down the cause on that host. I had other cards in other hosts not afflicted. I thought that only the older 2080 cards caused the issue as none of my 3000 series cards ever had the problem.
Interesting to see a new 4000 series card afflicted also.
The other errors I see are something about Petri's clfft files is not liked by the 4090 card.
This is an interesting snippet from a errored task.
And here again:
Never knew that a "baking" plan generation was in the AIO setup.
Boca Raton Community HS
)
another question of curiosity.
how much VRAM is each gamma ray task using?
you can check with the 'nvidia-smi' command, in the listed processes at the bottom of the output.
_________________________________________________________________________
Keith Myers wrote: I seemed
)
Glad I could join you with this error!
Each one is using 2010MiB.
ok that's not too much and
)
ok that's not too much and pretty normal for this app.
_________________________________________________________________________
I had a thought during the
)
I had a thought during the weekend- what if I enable ECC on the GPU? Do you think this could decrease the invalids? I do not know enough about the reason something is "invalidated" and if that relates to GPU errors in the memory. I can easily test this and let it go for a while, but wanted to know some thoughts on this.
I understand this will slow the work unit down, but if it increases the valid rate, then it might be worth it.
you can give it a shot. but
)
you can give it a shot. but it also will likely slow down the computation. you'd have to test if the decrease in invalids (if at all) offsets the slower crunch times.
_________________________________________________________________________