I see a very high rate of up to 10% of invalid WUs on my Linux host with my NVIDIA 4090:
Linux Mint, NVIDIA driver 525.105.17, FGRPopencl2Pup-nvidia 1.28.
It does not matter if one or two wus are computing at the same time. No problems with O3MDF-WUs. Also other Boinc projects like Primegrid, Asteroids or Folding@home are processed without errors. Only one project is running at a time.
I run the card very conservatively and cool with 2 GHz and a powerlimt of 220 W.
My old Radeon VII has an invalid rate of about 3-5%.
Now almost up to 10% error rate seems unusually high to me.
Is there anything I can do?
Copyright © 2024 Einstein@Home. All rights reserved.
I have not found anything
)
I have not found anything with the 4090 that will reduce this. Your ~10% invalid rate matches my ~10% invalid rate on both 4090s. I am running them on Mint. Is it a higher rate than most other GPUs? Yes. Is it high for a 4090? I don't believe so based on what I have seen. I am really curious to see another 40xx GPU and how it performs in the context of the invalid rate.
Boca Raton Community HS
)
I just checked the two 4090's running Einstein's Gamma-Ray Pulsar Binary search #1 on GPUs, a.k.a. FGRPB1G, and their invalid rate is 17% - 18%, though the both of them are running Windows.
FWIW.....
Proud member of the Old Farts Association
Which hosts are you referring
)
Which hosts are you referring to?
Boca Raton Community HS
)
https://einsteinathome.org/host/12850873
https://einsteinathome.org/host/13068022
Proud member of the Old Farts Association
Boca Raton Community HS
)
WAAAAY back in the beginning of crunching one problem we had was one unit finishing and the next one starting before it had finished writing it's completed data and we had to add in a slight delay between tasks, I have no memory of how we did that or even if that's the problem.
this seems commonly observed
)
this seems commonly observed with the 40-series cards for some reason. since it's happening with both the custom app and the stock app and windows and linux, I think it must be something in the base code from Einstein that these cards don't like or don't do well with. maybe even a artifact of some unknown bug in the driver. there's a lot of possibilities. validation is driven by matching precision of the results, and since there are more invalids, that seems to indicate that the 40-series cards are producing slightly less precise results for some reason.
just my .02.
it's a shame though. because even though the 40-series are already bottlenecked by their memory bus, the addition of the high invalids makes them basically not any better than the 30-series cards for Einstein FGRPB1G.
_________________________________________________________________________
Ian&Steve C. wrote: this
)
After using the 4090 for a while, I agree with you 100%. Pretty soon when we run out of these tasks, I guess it will matter less. I would figure the RTX 6000 Ada would have the same issue since it is the same chip. That would be fun to try. Anyone have ~$6,000 laying around? jk
Boca Raton Community HS
)
Good one, Jon! $6,000 laying around?? no... In the bank? yes... I figure that's a good place for it. LOL
Proud member of the Old Farts Association
Thank you for your feedback.
)
Thank you for your feedback. So i am not alone with this behavior of the 4090.
I see up to 20% invalid wus with the AIO app. That's too many errors for me, so I don't use this app anymore.
GWGeorge007 schrieb:I just
)
That's a bit above the average - 4090s seem to have about 15% invalids on average (overall FGRPB1G invalid average is 2.5%).
It's pretty hard to track which card produced which result in ~200k results per day. I looked only into a few such results, and it doesn't look like a precision problem to me right now. Could be, though, that the driver (=compiler) or the kernel scheduler changes the execution order of operations too badly, the comparisons might yield a different result. Not idea how to prevent this, though.
BM