FGRP - High invalid rate on Nvidia 4090?

DF1DX
DF1DX
Joined: 14 Aug 10
Posts: 102
Credit: 3,085,964,491
RAC: 4,024,912
Topic 229374

I see a very high rate of up to 10% of invalid WUs on my Linux host with my NVIDIA 4090:

Linux Mint, NVIDIA driver 525.105.17, FGRPopencl2Pup-nvidia 1.28.

It does not matter if one or two wus are computing at the same time. No problems with O3MDF-WUs. Also other Boinc projects like Primegrid, Asteroids or Folding@home are processed without errors. Only one project is running at a time.

I run the card very conservatively and cool with 2 GHz and a powerlimt of 220 W.

My old Radeon VII has an invalid rate of about 3-5%.
Now almost up to 10% error rate seems unusually high to me.

Is there anything I can do?

Boca Raton Community HS
Boca Raton Comm...
Joined: 4 Nov 15
Posts: 216
Credit: 8,516,247,359
RAC: 4,509,376

I have not found anything

I have not found anything with the 4090 that will reduce this. Your ~10% invalid rate matches my ~10% invalid rate on both 4090s. I am running them on Mint. Is it a higher rate than most other GPUs? Yes. Is it high for a 4090? I don't believe so based on what I have seen. I am really curious to see another 40xx GPU and how it performs in the context of the invalid rate. 

GWGeorge007
GWGeorge007
Joined: 8 Jan 18
Posts: 2,852
Credit: 4,741,221,633
RAC: 3,375,442

Boca Raton Community HS

Boca Raton Community HS wrote:

I have not found anything with the 4090 that will reduce this. Your ~10% invalid rate matches my ~10% invalid rate on both 4090s. I am running them on Mint. Is it a higher rate than most other GPUs? Yes. Is it high for a 4090? I don't believe so based on what I have seen. I am really curious to see another 40xx GPU and how it performs in the context of the invalid rate. 

I just checked the two 4090's running Einstein's Gamma-Ray Pulsar Binary search #1 on GPUs, a.k.a. FGRPB1G, and their invalid rate is 17% - 18%, though the both of them are running Windows.

FWIW.....

George

Proud member of the Old Farts Association

Boca Raton Community HS
Boca Raton Comm...
Joined: 4 Nov 15
Posts: 216
Credit: 8,516,247,359
RAC: 4,509,376

Which hosts are you referring

Which hosts are you referring to? 

GWGeorge007
GWGeorge007
Joined: 8 Jan 18
Posts: 2,852
Credit: 4,741,221,633
RAC: 3,375,442

Boca Raton Community HS

Boca Raton Community HS wrote:

Which hosts are you referring to? 

https://einsteinathome.org/host/12850873

https://einsteinathome.org/host/13068022

George

Proud member of the Old Farts Association

mikey
mikey
Joined: 22 Jan 05
Posts: 12,087
Credit: 1,834,325,298
RAC: 12,906

Boca Raton Community HS

Boca Raton Community HS wrote:

I have not found anything with the 4090 that will reduce this. Your ~10% invalid rate matches my ~10% invalid rate on both 4090s. I am running them on Mint. Is it a higher rate than most other GPUs? Yes. Is it high for a 4090? I don't believe so based on what I have seen. I am really curious to see another 40xx GPU and how it performs in the context of the invalid rate. 

WAAAAY back in the beginning of crunching one problem we had was one unit finishing and the next one starting before it had finished writing it's completed data and we had to add in a slight delay between tasks, I have no memory of how we did that or even if that's the problem.

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3,758
Credit: 36,023,992,734
RAC: 44,082,198

this seems commonly observed

this seems commonly observed with the 40-series cards for some reason. since it's happening with both the custom app and the stock app and windows and linux, I think it must be something in the base code from Einstein that these cards don't like or don't do well with. maybe even a artifact of some unknown bug in the driver. there's a lot of possibilities. validation is driven by matching precision of the results, and since there are more invalids, that seems to indicate that the 40-series cards are producing slightly less precise results for some reason.

just my .02.

it's a shame though. because even though the 40-series are already bottlenecked by their memory bus, the addition of the high invalids makes them basically not any better than the 30-series cards for Einstein FGRPB1G.

_________________________________________________________________________

Boca Raton Community HS
Boca Raton Comm...
Joined: 4 Nov 15
Posts: 216
Credit: 8,516,247,359
RAC: 4,509,376

Ian&Steve C. wrote: this

Ian&Steve C. wrote:

this seems commonly observed with the 40-series cards for some reason. since it's happening with both the custom app and the stock app and windows and linux, I think it must be something in the base code from Einstein that these cards don't like or don't do well with. maybe even a artifact of some unknown bug in the driver. there's a lot of possibilities. validation is driven by matching precision of the results, and since there are more invalids, that seems to indicate that the 40-series cards are producing slightly less precise results for some reason.

just my .02.

it's a shame though. because even though the 40-series are already bottlenecked by their memory bus, the addition of the high invalids makes them basically not any better than the 30-series cards for Einstein FGRPB1G.

 

After using the 4090 for a while, I agree with you 100%. Pretty soon when we run out of these tasks, I guess it will matter less. I would figure the RTX 6000 Ada would have the same issue since it is the same chip. That would be fun to try. Anyone have ~$6,000 laying around? jk

GWGeorge007
GWGeorge007
Joined: 8 Jan 18
Posts: 2,852
Credit: 4,741,221,633
RAC: 3,375,442

Boca Raton Community HS

Boca Raton Community HS wrote:

After using the 4090 for a while, I agree with you 100%. Pretty soon when we run out of these tasks, I guess it will matter less. I would figure the RTX 6000 Ada would have the same issue since it is the same chip. That would be fun to try. Anyone have ~$6,000 laying around? jk

Good one, Jon!  $6,000 laying around??  no...  In the bank?  yes...  I figure that's a good place for it.  LOL

George

Proud member of the Old Farts Association

DF1DX
DF1DX
Joined: 14 Aug 10
Posts: 102
Credit: 3,085,964,491
RAC: 4,024,912

Thank you for your feedback.

Thank you for your feedback. So i am not alone with this behavior of the 4090.

GWGeorge007 wrote:
I just checked the two 4090's running Einstein's Gamma-Ray Pulsar Binary search #1 on GPUs, a.k.a. FGRPB1G, and their invalid rate is 17% - 18%, though the both of them are running Windows.

I see up to 20% invalid wus with the AIO app. That's too many errors for me, so I don't use this app anymore.

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4,276
Credit: 245,631,304
RAC: 11,236

GWGeorge007 schrieb:I just

GWGeorge007 wrote:

I just checked the two 4090's running Einstein's Gamma-Ray Pulsar Binary search #1 on GPUs, a.k.a. FGRPB1G, and their invalid rate is 17% - 18%, though the both of them are running Windows.

That's a bit above the average - 4090s seem to have about 15% invalids on average (overall FGRPB1G invalid average is 2.5%).

It's pretty hard to track which card produced which result in ~200k results per day. I looked only into a few such results, and it doesn't look like a precision problem to me right now. Could be, though, that the driver (=compiler) or the kernel scheduler changes the execution order of operations too badly, the comparisons might yield a different result. Not idea how to prevent this, though.

BM

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.