FGRP - High invalid rate on Nvidia 4090?

DF1DX

Joined: 14 Aug 10

Posts: 108

Credit: 3908823534

RAC: 56220

12 Apr 2023 9:24:14 UTC

Topic 229374

(moderation:

)

I see a very high rate of up to 10% of invalid WUs on my Linux host with my NVIDIA 4090:

Linux Mint, NVIDIA driver 525.105.17, FGRPopencl2Pup-nvidia 1.28.

It does not matter if one or two wus are computing at the same time. No problems with O3MDF-WUs. Also other Boinc projects like Primegrid, Asteroids or Folding@home are processed without errors. Only one project is running at a time.

I run the card very conservatively and cool with 2 GHz and a powerlimt of 220 W.

My old Radeon VII has an invalid rate of about 3-5%.
Now almost up to 10% error rate seems unusually high to me.

Is there anything I can do?

Boca Raton Comm...

Joined: 4 Nov 15

Posts: 302

Credit: 11362318425

RAC: 12500777

I have not found anything

12 Apr 2023 12:12:02 UTC

Message 210953

(moderation:

)

I have not found anything with the 4090 that will reduce this. Your ~10% invalid rate matches my ~10% invalid rate on both 4090s. I am running them on Mint. Is it a higher rate than most other GPUs? Yes. Is it high for a 4090? I don't believe so based on what I have seen. I am really curious to see another 40xx GPU and how it performs in the context of the invalid rate.

GWGeorge007

Joined: 8 Jan 18

Posts: 3189

Credit: 5183426723

RAC: 4080244

Boca Raton Community HS

12 Apr 2023 14:16:57 UTC

Message 210957 in response to message 210953

(moderation:

)

Boca Raton Community HS wrote:

I have not found anything with the 4090 that will reduce this. Your ~10% invalid rate matches my ~10% invalid rate on both 4090s. I am running them on Mint. Is it a higher rate than most other GPUs? Yes. Is it high for a 4090? I don't believe so based on what I have seen. I am really curious to see another 40xx GPU and how it performs in the context of the invalid rate.

I just checked the two 4090's running Einstein's Gamma-Ray Pulsar Binary search #1 on GPUs, a.k.a. FGRPB1G, and their invalid rate is 17% - 18%, though the both of them are running Windows.

FWIW.....

George

Proud member of the Old Farts Association

Boca Raton Comm...

Joined: 4 Nov 15

Posts: 302

Credit: 11362318425

RAC: 12500777

Which hosts are you referring

12 Apr 2023 15:43:23 UTC

Message 210963 in response to message 210957

(moderation:

)

Which hosts are you referring to?

GWGeorge007

Joined: 8 Jan 18

Posts: 3189

Credit: 5183426723

RAC: 4080244

Boca Raton Community HS

12 Apr 2023 15:51:31 UTC

Message 210964 in response to message 210963

(moderation:

)

Boca Raton Community HS wrote:

Which hosts are you referring to?

https://einsteinathome.org/host/12850873

https://einsteinathome.org/host/13068022

George

Proud member of the Old Farts Association

mikey

Joined: 22 Jan 05

Posts: 12911

Credit: 1884435265

RAC: 61875

Boca Raton Community HS

12 Apr 2023 16:10:40 UTC

Message 210967 in response to message 210953

(moderation:

)

Boca Raton Community HS wrote:

I have not found anything with the 4090 that will reduce this. Your ~10% invalid rate matches my ~10% invalid rate on both 4090s. I am running them on Mint. Is it a higher rate than most other GPUs? Yes. Is it high for a 4090? I don't believe so based on what I have seen. I am really curious to see another 40xx GPU and how it performs in the context of the invalid rate.

WAAAAY back in the beginning of crunching one problem we had was one unit finishing and the next one starting before it had finished writing it's completed data and we had to add in a slight delay between tasks, I have no memory of how we did that or even if that's the problem.

Ian&Steve C.

Joined: 19 Jan 20

Posts: 4150

Credit: 49668025075

RAC: 38091680

this seems commonly observed

12 Apr 2023 16:28:38 UTC

Message 210969

(moderation:

)

this seems commonly observed with the 40-series cards for some reason. since it's happening with both the custom app and the stock app and windows and linux, I think it must be something in the base code from Einstein that these cards don't like or don't do well with. maybe even a artifact of some unknown bug in the driver. there's a lot of possibilities. validation is driven by matching precision of the results, and since there are more invalids, that seems to indicate that the 40-series cards are producing slightly less precise results for some reason.

just my .02.

it's a shame though. because even though the 40-series are already bottlenecked by their memory bus, the addition of the high invalids makes them basically not any better than the 30-series cards for Einstein FGRPB1G.

_________________________________________________________________________

Boca Raton Comm...

Joined: 4 Nov 15

Posts: 302

Credit: 11362318425

RAC: 12500777

Ian&Steve C. wrote: this

12 Apr 2023 16:33:46 UTC

Message 210970 in response to message 210969

(moderation:

)

Ian&Steve C. wrote:

this seems commonly observed with the 40-series cards for some reason. since it's happening with both the custom app and the stock app and windows and linux, I think it must be something in the base code from Einstein that these cards don't like or don't do well with. maybe even a artifact of some unknown bug in the driver. there's a lot of possibilities. validation is driven by matching precision of the results, and since there are more invalids, that seems to indicate that the 40-series cards are producing slightly less precise results for some reason.

just my .02.

it's a shame though. because even though the 40-series are already bottlenecked by their memory bus, the addition of the high invalids makes them basically not any better than the 30-series cards for Einstein FGRPB1G.

After using the 4090 for a while, I agree with you 100%. Pretty soon when we run out of these tasks, I guess it will matter less. I would figure the RTX 6000 Ada would have the same issue since it is the same chip. That would be fun to try. Anyone have ~$6,000 laying around? jk

GWGeorge007

Joined: 8 Jan 18

Posts: 3189

Credit: 5183426723

RAC: 4080244

Boca Raton Community HS

12 Apr 2023 17:01:24 UTC

Message 210971 in response to message 210970

(moderation:

)

Boca Raton Community HS wrote:

After using the 4090 for a while, I agree with you 100%. Pretty soon when we run out of these tasks, I guess it will matter less. I would figure the RTX 6000 Ada would have the same issue since it is the same chip. That would be fun to try. Anyone have ~$6,000 laying around? jk

Good one, Jon! $6,000 laying around?? no... In the bank? yes... I figure that's a good place for it. LOL

George

Proud member of the Old Farts Association

DF1DX

Joined: 14 Aug 10

Posts: 108

Credit: 3908823534

RAC: 56220

Thank you for your feedback.

13 Apr 2023 8:44:39 UTC

Message 210999 in response to message 210957

(moderation:

)

Thank you for your feedback. So i am not alone with this behavior of the 4090.

GWGeorge007 wrote:

I just checked the two 4090's running Einstein's Gamma-Ray Pulsar Binary search #1 on GPUs, a.k.a. FGRPB1G, and their invalid rate is 17% - 18%, though the both of them are running Windows.

I see up to 20% invalid wus with the AIO app. That's too many errors for me, so I don't use this app anymore.

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4349

Credit: 253369677

RAC: 37871

GWGeorge007 schrieb:I just

27 Apr 2023 13:30:00 UTC

Message 211600 in response to message 210957

(moderation:

)

GWGeorge007 wrote:

I just checked the two 4090's running Einstein's Gamma-Ray Pulsar Binary search #1 on GPUs, a.k.a. FGRPB1G, and their invalid rate is 17% - 18%, though the both of them are running Windows.

That's a bit above the average - 4090s seem to have about 15% invalids on average (overall FGRPB1G invalid average is 2.5%).

It's pretty hard to track which card produced which result in ~200k results per day. I looked only into a few such results, and it doesn't look like a precision problem to me right now. Could be, though, that the driver (=compiler) or the kernel scheduler changes the execution order of operations too badly, the comparisons might yield a different result. Not idea how to prevent this, though.

FGRP - High invalid rate on Nvidia 4090?

Forums › Cruncher's Corner

Comment viewing options

Forums › Cruncher's Corner