NVidia driver 515.48.07 gives OpenCL errors for 12GB RTX 3060 on Linux

Mike Hewson

Moderator

Joined: 1 Dec 05

Posts: 6591

Credit: 323652229

RAC: 168036

There are indeed 10

4 Jul 2022 1:18:00 UTC

Message 198421 in response to message 198417

(moderation:

)

There are indeed 10 peak/troughs in the toplist phase ! I'd wondered about that, such a regular pattern.

Well spotted. ;-)

The temps do show asymptotic behaviour, with each phase, as the balance b/w heat production and heat loss come toward a dynamic equilibrium. Reminds me of capacitor charge/discharge curves.

As for skypoints, the stderr mentions for Gamma-ray pulsar search #5:

% fft length: 16777216 (0x1000000)

so I don't think we will discern so many features ( 2²⁵ ) with the granularity of the temperature sampling. What we might be seeing though is different phases of the FFT and that depends on the methodology for the workunit. I'll look more closely into that.

{ Side note : the 'traditional' FFT length for E@H has been 2²², for instance stderr for a Binary Radio Pulsar Search (Arecibo, large) sub-unit:

------> Number of samples: 4194304

which is repeated 8 times in such a work unit. That is : 16777216 = 2²⁵ = 2³ * 2²² = 8 * 4194304 }

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

Mike Hewson

Moderator

Joined: 1 Dec 05

Posts: 6591

Credit: 323652229

RAC: 168036

{ If you will forgive some

4 Jul 2022 5:23:00 UTC

Message 198424

(moderation:

)

{ If you will forgive some further wandering of this thread from its title .... }

Now the FFT phase is a bit more difficult to analyse, as the sampling of the Coolero application doesn't neatly coincide with the peaks/troughs. But in detail one sees something like this ( GPU workload curve ):

For which I have counted out many examples, getting somewhere b/w 40 to 50 peaks/troughs depending on the workunit. I think the actual number is a constant across workunits and would be evidently so with a higher workload curve sampling rate. My bet is that the actual number is 48.

Reasoning ( much is abbreviated here for clarity ) : there are textbooks chock full of FFT algorithms, however one approach is as follows. Any given FFT may be solved by a recursive process whereby an FFT of a certain size can be expressed in terms of several FFTs of a smaller size, each which may be simpler to solve. For FFTs which are a power of two, the obvious strategy is to express an FFT of 2ⁿ points in terms of two FFTs of 2^n-1 points. So if one inputs a 2²⁵ point FFT then one can express that as two FFTs of size 2²⁴ points, each of those as two FFTs of size 2²³ points and so on until one achieves an FFT size which is readily solved ( this is best viewed as matrix operations IMHO ). The easiest FFT is of 2⁰ = 1 which is trivial, but 2¹ less so. Now my guess is that some 24 levels of recursion have occurred, to go from 2²⁵ down to 2¹. Now when you get to 2¹ you start inserting some numbers ( multiplications and re-indexing and 'twiddling' ) to solve for the 2² FFTs, using those to then solve the 2³ FFTs ( multiplications/re-indexing/twiddling ) and so on back up to the 2²⁵ FFT now fully transformed solution to the original question put. Hence I'm implying some 24 levels of recursion 'downwards' to 2¹ and then another 24 levels back 'upwards' to 2²⁵. That's 24 * 2 = 48 bursts of activity. Exactly what is happening with each peak, and at the troughs, I can't be sure of. I only speak of the rhythm of the work.

So Gary, in a way I think you are correct : only we count the peaks/troughs as some function of log₂(number_of_skypoints) ! ;-O

Cheers, Mike.

( edit ) Of course if 2² is nominated as the deepest level to recurse to, then the number of recursive levels is 25 - 2 = 23, or twice that number of peaks ( 46 ) involved for a full FFT solution. If 2³ is the nominated deepest level then mutatis mutandis ....

( edit ) Note : since any recursive algorithm may be expressed as an iterated ( looping ) structure, then the same considerations apply if so.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

Mike Hewson

Moderator

Joined: 1 Dec 05

Posts: 6591

Credit: 323652229

RAC: 168036

It's not the drivers,

7 Jul 2022 14:46:46 UTC

Message 198514

(moderation:

)

It's not the drivers, probably. The problem has recurred : same error as before ( BOINC not finding the GPU ). Lost some 196 WUs due to that error, last one on 6 Jul 2022 0:41:09 UTC.

What I have since done is shut down, remove the RTX 3060 card, cleaned the edge connector on it, cleaned the motherboard slot and gently but firmly re-inserting it, this time using only a single fingertight screw to attach it to the rear of the case, and then powering up again. I also rechecked the seating of the cabling from the modular PSU to the card.

{ In the past I have sometimes noted that the too tight winding of said screws can cause the GPU card to pivot slightly with the result that the cards' edge connection does not seat completely & evenly within the mobo PCIe slot. This implies that the plane of the mobo is not necessarily absolutely orthogonal to the plane of the rear of the case, nor precisely orthogonal to the plane of the video card for that matter. }

Since that re-insertion : Gamma-ray pulsar binary search #1 on GPUs v1.28 () x86_64-pc-linux-gnu has four invalids ( out of 251 successful credit awarded ) but none marked as in error since 6 Jul 2022 0:41:09 UTC.

NB the card manufacturers power rating on the RTX 3060 is 170W max, the Ryzen 5095X tests with zenmonitor to top out at about 145W max for the whole package. I think that should be covered adequately by a Corsair RM 850x ( Gold ).

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

Ian&Steve C.

Joined: 19 Jan 20

Posts: 4041

Credit: 47931454803

RAC: 35704253

Mike Hewson wrote: It's not

7 Jul 2022 15:19:15 UTC

Message 198515 in response to message 198514

(moderation:

)

Mike Hewson wrote:

It's not the drivers, probably. The problem has recurred : same error as before ( BOINC not finding the GPU ). Lost some 196 WUs due to that error, last one on 6 Jul 2022 0:41:09 UTC.

What I have since done is shut down, remove the RTX 3060 card, cleaned the edge connector on it, cleaned the motherboard slot and gently but firmly re-inserting it, this time using only a single fingertight screw to attach it to the rear of the case, and then powering up again. I also rechecked the seating of the cabling from the modular PSU to the card.

{ In the past I have sometimes noted that the too tight winding of said screws can cause the GPU card to pivot slightly with the result that the cards' edge connection does not seat completely & evenly within the mobo PCIe slot. This implies that the plane of the mobo is not necessarily absolutely orthogonal to the plane of the rear of the case, nor precisely orthogonal to the plane of the video card for that matter. }

Since that re-insertion : Gamma-ray pulsar binary search #1 on GPUs v1.28 () x86_64-pc-linux-gnu has four invalids ( out of 251 successful credit awarded ) but none marked as in error since 6 Jul 2022 0:41:09 UTC.

NB the card manufacturers power rating on the RTX 3060 is 170W max, the Ryzen 5095X tests with zenmonitor to top out at about 145W max for the whole package. I think that should be covered adequately by a Corsair RM 850x ( Gold ).

Cheers, Mike.

hate to say "I told you so", but... lol.

certainly could be a PSU issue, even if the rated max capacity seems sufficient. how old is it? are you able to get a read on *actual* voltages while the system is under load? 12V or 3.3V lines could be drooping out of spec, causing the GPU to dropout occasionally. I've had this happen before.

on the software front, what OS exactly do you have? your host identifies as "Linux GNOME", but I'm not familiar with Gnome as an OS (outside of dev/testing), it's really just a desktop environment. a previous comment mentioned Pop_OS, is that what you're running? how exactly did you install the nvidia drivers? did you use one of their proprietary installers? or did you use a package manager in your OS? all of these things could be a potential source of issue with your specific OS/software environment. *could* be, not necessarily IS, but still helpful to readers to understand your exact environment to try to assess the potential issue.

_________________________________________________________________________

Mike Hewson

Moderator

Joined: 1 Dec 05

Posts: 6591

Credit: 323652229

RAC: 168036

Ian&Steve C. wrote:hate to

8 Jul 2022 0:54:00 UTC

Message 198524 in response to message 198515

(moderation:

)

Ian&Steve C. wrote:

hate to say "I told you so", but... lol.

Yeah, I know ... ;-)

Quote:

certainly could be a PSU issue, even if the rated max capacity seems sufficient. how old is it? are you able to get a read on *actual* voltages while the system is under load? 12V or 3.3V lines could be drooping out of spec, causing the GPU to dropout occasionally. I've had this happen before.

OK. The Corsair RM850x is brand new out of the box and I'm only using cables out of that same box. The PCIe cable going to the GPU is 8 pins at the PSU and 6 + 2 pins at the GPU. There's only an 8 pin socket at the GPU and only one way to connect the 6 + 2 pins into it.

I'm able to test the 12V lines on the same cable that goes to the GPU, as it has chained adapters on it. These give, under the load of the E@H workunits, 11.99 to 12.01 on my digital multimeter. Now I'll have to work out which cable to tap for the 3.3V ....

{ I did have a power supply testing gadget that you could just plug in, as it had various cable type inputs. But I lent it out to someone who let the magic smoke out of it ..... }

Quote:

on the software front, what OS exactly do you have? your host identifies as "Linux GNOME", but I'm not familiar with Gnome as an OS (outside of dev/testing), it's really just a desktop environment. a previous comment mentioned Pop_OS, is that what you're running? how exactly did you install the nvidia drivers? did you use one of their proprietary installers? or did you use a package manager in your OS? all of these things could be a potential source of issue with your specific OS/software environment. *could* be, not necessarily IS, but still helpful to readers to understand your exact environment to try to assess the potential issue.

Pop!_OS is built from Ubuntu repositories and has Flatpaks as a package install method from a GUI update manager called Pop!_Shop. It has Gnome by default. I found that flatpaks just install with little or no user input - but with less user control by the same token - and of course you can only update something if a flatpak has been written for it. It's a way of avoiding configuring Linux all the time, which I prefer. 'Sandboxing' is mentioned alot.

Now the NVidia drivers appear as an install option with Pop!_Shop, you can select an available version from a list and just hit the update button. It remembers all prior driver versions that were installed so making rolling back quite easy - as I have done. It un-installs prior NVidia drivers before putting another in, and I reboot to be sure. Plus an NVidia Control Panel application appears on the lower taskbar which has the usual information.

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

GWGeorge007

Joined: 8 Jan 18

Posts: 3114

Credit: 5001166769

RAC: 1248859

Mike Hewson

8 Jul 2022 2:00:34 UTC

Message 198529 in response to message 198524

(moderation:

)

Mike Hewson wrote:

...snip...

I'm able to test the 12V lines on the same cable that goes to the GPU, as it has chained adapters on it. These give, under the load of the E@H workunits, 11.99 to 12.01 on my digital multimeter. Now I'll have to work out which cable to tap for the 3.3V ....

{ I did have a power supply testing gadget that you could just plug in, as it had various cable type inputs. But I lent it out to someone who let the magic smoke out of it ..... }

...snip...

I certainly hope that you are aware, but just in case, when using one of the "power supply testing gadgets" they (it?) do not load the PSU as if it was plugged into a running GPU or other device. It only measures available voltage unloaded, not voltage under load.

While I do have one of these "gadgets", I only use it to see if my PSU can actually supply the necessary voltage. If I need to see what my loaded voltages are, I would need to either find a way to probe into the wiring which is plugged into a device (such as a GPU) which is working or have a motherboard with the voltage probe points on the MB. To my knowledge there are only a few such boards available.

George

Proud member of the Old Farts Association

Ian&Steve C.

Joined: 19 Jan 20

Posts: 4041

Credit: 47931454803

RAC: 35704253

orange cables (if theyre

8 Jul 2022 3:14:53 UTC

Message 198531

(moderation:

)

orange cables (if theyre colored at all) are the 3.3V ones. found on the SATA connector. but you'd need some kind of SATA adapter to test it in a non destructive way.

_________________________________________________________________________

Keith Myers

Joined: 11 Feb 11

Posts: 5013

Credit: 18905059000

RAC: 6907906

You can always back-probe the

8 Jul 2022 4:26:58 UTC

Message 198532 in response to message 198531

(moderation:

)

You can always back-probe the ATX 24 pin connector on the motherboard with the host under load.

https://www.etechnog.com/2022/03/atx-power-supply-pinout-diagram.html

Mike Hewson

Moderator

Joined: 1 Dec 05

Posts: 6591

Credit: 323652229

RAC: 168036

Thank you all for your

8 Jul 2022 8:14:00 UTC

Message 198534

(moderation:

)

Thank you all for your interest. ;~)

OK, so I decided to use a spare SATA cable, in the destructive sense, with a multimeter to carefully ( ! ) measure ( a number of times for each configuration ) the notional 3.3V rail vs common as it comes from the PSU.

- firstly, a BIOS screen on boot reveals a value of 3.324V ( + 0.7% ) from whatever onboard sensor, but 3.294V ( -0.2% ) is on the SATA cable.

- secondly, with ~ 10 % GPU load and no E@H GPU activity the value is 3.287V ( -0.4% ) using the SATA cable.

- thirdly, with ~ 85% GPU load and an E@H workunit underway ( in the FFT phase ) the value is 3.315V ( + 0.5% ) using the SATA cable.

They all seem pretty acceptable. I don't know what the tolerances are for the multimeter and thus whether it is valid to quote to 4 figures.

I guess the issue is though : the value of the 3.3V line on the PCIe bus hasn't been ( or isn't ) directly measurable under load without some other method. Perhaps, say, the PCIe 3.3V level is derived from the mobo's +5V source ( an unreachable circuit with the tools I have ). I'm not about to go randomly probing the mobo, even with something of high impedance.

Neither the sensors command nor the Coolero app mention the 3.3V line. But I will research to see if there is any linux software that could do the job, after all the BIOS knows the value so there is a sensor somewhere.

Cheers, Mike.

( edit ) I've also changed out the video card's 8 pin power cable.

( edit ) The multimeter's probe head is too fat to reach down the back of the connector.

( edit ) CPU-X doesn't yield the 3.3V value.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

Ian&Steve C.

Joined: 19 Jan 20

Posts: 4041

Credit: 47931454803

RAC: 35704253

all values sound acceptable.

8 Jul 2022 15:14:21 UTC

Message 198544

(moderation:

)

3.3v comes direct from PSU 3.3v rail (through the MB). there is no reason for the MB to convert 5V to 3.3V when it has 3.3V already.

all values sound acceptable. if you were getting readings closer to 3.0v it might indicate a problem, but it appears that your PSU is operating normally and probably not the cause of your particular issue.

your previously mentioned PCIe connection issue is plausible. since you've remounted it, watch and see if the issue pops up again. then reassess if there might be something else causing the problem.

_________________________________________________________________________

NVidia driver 515.48.07 gives OpenCL errors for 12GB RTX 3060 on Linux

Forums › Cruncher's Corner

Comment viewing options

Forums › Cruncher's Corner