NVidia driver 515.48.07 gives OpenCL errors for 12GB RTX 3060 on Linux

Ian&Steve C.

Joined: 19 Jan 20

Posts: 4155

Credit: 50084964887

RAC: 42318330

Yes that’s true. 3060Ti

2 Jul 2022 2:37:58 UTC

Message 198369 in response to message 198368

(moderation:

)

Yes that’s true.

3060Ti - December 2nd 2020

3060 - February 25th 2021

_________________________________________________________________________

Keith Myers

Joined: 11 Feb 11

Posts: 5061

Credit: 19278402871

RAC: 7267361

My original idea of

2 Jul 2022 2:41:20 UTC

Message 198370 in response to message 198369

(moderation:

)

My original idea of backleveling to the 470 series drivers then was a good idea then for Mike.

Ian&Steve C.

Joined: 19 Jan 20

Posts: 4155

Credit: 50084964887

RAC: 42318330

I don’t think his issue is

2 Jul 2022 2:54:07 UTC

Message 198371

(moderation:

)

I don’t think his issue is the drivers. Looks like the GPU is dropping off the PCIe bus (OpenCL device missing) Reboot brings it back I bet.

how’s the power and thermal situation? Using risers? What motherboard? Which slot on the motherboard? What PCIe gen is the slot/card running?

_________________________________________________________________________

Mike Hewson

Moderator

Joined: 1 Dec 05

Posts: 6596

Credit: 340168118

RAC: 264521

Keith Myers wrote: My

2 Jul 2022 2:56:18 UTC

Message 198372 in response to message 198370

(moderation:

)

Keith Myers wrote:

My original idea of backleveling to the 470 series drivers then was a good idea then for Mike.

Indeed ! There have now been 16 consecutive successful FGRPB1G WUs finishing, validating and being awarded credit since I reverted the driver. Looking good ie.

12:30:25 (747): [debug]: Set up communication with graphics process.
boinc_get_opencl_ids returned [0x1659b30 , 0x16598f0] 
Using OpenCL platform provided by: NVIDIA Corporation
Using OpenCL device "NVIDIA GeForce RTX 3060" by: NVIDIA Corporation
Max allocation limit: 3159015424
Global mem size: 12636061696

Thanks everyone for all your input.

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

Mike Hewson

Moderator

Joined: 1 Dec 05

Posts: 6596

Credit: 340168118

RAC: 264521

Ian&Steve C. wrote:I don’t

2 Jul 2022 3:17:00 UTC

Message 198373 in response to message 198371

(moderation:

)

Ian&Steve C. wrote:

I don’t think his issue is the drivers. Looks like the GPU is dropping off the PCIe bus (OpenCL device missing) Reboot brings it back I bet.

how’s the power and thermal situation?

About 60 - 70⁰C at 80 - 90% load. Power I don't know.

Quote:

Using risers?

Nope.

Quote:

What motherboard?

Gigabyte B550 Aorus Pro AX ( rev 1.0 )

Quote:

Which slot on the motherboard?

The first ( nearest to CPU ) of three, other two are empty. But there are M.2 slots ( both occupied ) either side of that first slot.

Quote:

What PCIe gen is the slot/card running?

4.0

Hmm, I am rebooting after driver changes .....

I'll watch and wait.

Cheers, Mike.

( edit ) I drop the 3060 temp to ~ 50⁰C by setting the fans x2 to max speed.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

Keith Myers

Joined: 11 Feb 11

Posts: 5061

Credit: 19278402871

RAC: 7267361

Sounds like the 470 backlevel

2 Jul 2022 4:19:22 UTC

Message 198374

(moderation:

)

Sounds like the 470 backlevel was the ticket.

The 515 drivers are showing up in a lot of problem tickets across numerous forums.

They did a major rewrite to shift focus to AI and cloud computing clusters for the series.

That appears to be antithetical to standard BOINC computing.

Mike Hewson

Moderator

Joined: 1 Dec 05

Posts: 6596

Credit: 340168118

RAC: 264521

Yep it's working fine, now 32

2 Jul 2022 7:20:00 UTC

Message 198376

(moderation:

)

Yep it's working fine, now 32 consecutive successful FGRPB1G WUs finishing, validating and being awarded credit since I reverted the driver.

It's actually fascinating to watch the GPU load and temperature in one ( Coolero ) window vs the work unit progress in another ( BOINC Task pane ). The first ( notionally ) ~ 90 % of the workunit goes at 80 - 90% load @ 60⁰C, the remaining 10% of the workunit at 20 - 60% load and temp down to 45⁰C. Check this out :

Do I see exponential rise and decay in the GPU temps ? You can even see a little downward notch in load & temp during the reload to a new WU.

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

Keith Myers

Joined: 11 Feb 11

Posts: 5061

Credit: 19278402871

RAC: 7267361

The last 10% of the task

2 Jul 2022 7:23:50 UTC

Message 198377 in response to message 198376

(moderation:

)

The last 10% of the task shifts computation to the cpu to calculate the toplist.

So utilization on the gpu drops to almost nothing.

Mike Hewson

Moderator

Joined: 1 Dec 05

Posts: 6596

Credit: 340168118

RAC: 264521

Yeah the first 90% is

2 Jul 2022 7:38:47 UTC

Message 198378 in response to message 198377

(moderation:

)

Yeah the first 90% is essentially an FFT calculation, but in the toplist phase the GPU load varies widely, probably shuffling data back to general memory space maybe ?

Also it clearly shows that the last notional 10% of the WU is actually over 20% of the time.

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5888

Credit: 119682540599

RAC: 25318534

A very impressive

3 Jul 2022 22:46:32 UTC

Message 198417 in response to message 198376

(moderation:

)

A very impressive visualisation of the overall crunching behaviour!

For the rates of rise/fall in GPU temps, I think the word you're looking for is asymptotic rather than exponential :-).

Since the followup stage (last 10%) involves recalculating (on the GPU, not the CPU) the top 10 candidates (toplist) using double precision, the lower GPU load (and temperature) is a function of the relative paucity of the double precision hardware being used. The single precision hardware would be idling during this time so the load being measured (on average) drops.

I was interested to count the peak/trough cycles that show so nicely. Exactly 1 for each of the ten candidates. The bigger drop to almost zero load marks the transition between tasks.

There are similar cycles in load for the main (90%) stage. I reckon if you counted those, they would add up to the number of skypoints in the task being analysed :-).

That software seems to be doing a very nice job of following exactly what is happening. Thanks very much for sharing.

Cheers,
Gary.

NVidia driver 515.48.07 gives OpenCL errors for 12GB RTX 3060 on Linux

Forums › Cruncher's Corner

Comment viewing options

Forums › Cruncher's Corner