FGRP - High invalid rate on Nvidia 4090?

Boca Raton Comm...

Joined: 4 Nov 15

Posts: 264

Credit: 10875897552

RAC: 9373076

Bernd Machenschalk

27 Apr 2023 13:55:49 UTC

Message 211601 in response to message 211600

(moderation:

)

Bernd Machenschalk wrote:

GWGeorge007 wrote:

I just checked the two 4090's running Einstein's Gamma-Ray Pulsar Binary search #1 on GPUs, a.k.a. FGRPB1G, and their invalid rate is 17% - 18%, though the both of them are running Windows.

That's a bit above the average - 4090s seem to have about 15% invalids on average (overall FGRPB1G invalid average is 2.5%).

It's pretty hard to track which card produced which result in ~200k results per day. I looked only into a few such results, and it doesn't look like a precision problem to me right now. Could be, though, that the driver (=compiler) or the kernel scheduler changes the execution order of operations too badly, the comparisons might yield a different result. Not idea how to prevent this, though.

It seems that my invalid rates on the two 4090 systems have dropped a little- maybe a recent Nvidia driver update had a small impact- I am really not sure (host and host). I have not really changed anything else. If you ever want us to try anything on our end, we are more than willing to experiment with these GPUs. This is definitely a 40xx series issue- take a look at this host running two 4070 gpus (Pokey- not trying to pick on your pc, just trying to figure out the invalid issue on these GPUs).

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5874

Credit: 118356345605

RAC: 25473074

Boca Raton Community HS

28 Apr 2023 8:17:20 UTC

Message 211630 in response to message 211601

(moderation:

)

Boca Raton Community HS wrote:

...

It seems that my invalid rates on the two 4090 systems have dropped a little ...

This is definitely a 40xx series issue- take a look at this host running two 4070 gpus ...

When you examine results on the website, there's an important bit of context that is missing - the ability to easily monitor inconclusive results. It would be very nice to have a separate column to list these, as well as the errors and invalids. At any point in time, if you take the number for "All" and subtract all the other categories (In Progress, Pending, Valid, etc), there may be a 'left-over' amount (hopefully very small) which can be labeled as 'Inconclusive'. These are results described as "Checked but no consensus yet", if you search through the entire list to find them.

For the three hosts you mention, here are the numbers that existed at a certain point a little earlier today:-

HostID	All	In_Progress	Pending	Valid	Invalid	Error	Inconclusive
13125618	3946	162	1159	2284	177	16	148
13126606	3820	177	1136	2197	167	0	143
12986942	11854	888	2180	7375	1066	0	345

As you can see, there are quite a few Inconclusives. Some will ultimately become valid whilst others get rejected. A lot depends on the third host that gets selected as the 'deadlock breaker'.

I've noticed increasing numbers of inconclusives in my hosts. I ran some very limited checks and found a tendency for inconclusives to happen when the normal app was being matched against the Petri app being run under the 'anonymous platform' mechanism. In that case, the outcome depends on the third task. Since there is probably a greater chance it will use the regular app, the host that will probably suffer the most will likely be the one using the anonymous platform app. Since these hosts are faster and since the number of them is likely to increase, this situation might reverse in the future if the FGRPB1G search keeps going for a while.

There might be a solution but it's probably quite unlikely to happen. If the FGRPB1G search was split into two separate streams, one for the regular apps and one for anonymous platform apps, the ultimate rate for invalid results might improve quite a bit. At least it might reverse the rising level of inconclusives that seems to be occurring :-).

Cheers,
Gary.

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4332

Credit: 251639935

RAC: 36359

I'm curious - does the 4090

28 Apr 2023 8:39:00 UTC

Message 211631

(moderation:

)

I'm curious - does the 4090 exhibit the same behavior on BRP7 and GW (currently O3MDF)?

DF1DX

Joined: 14 Aug 10

Posts: 106

Credit: 3904207313

RAC: 656460

Bernd Machenschalk

28 Apr 2023 9:22:06 UTC

Message 211632 in response to message 211631

(moderation:

)

Bernd Machenschalk wrote:

I'm curious - does the 4090 exhibit the same behavior on BRP7 and GW (currently O3MDF)?

Thanks for the feedback. I will try BRP7 over the weekend.
BTW: There are no errors with GW-WUs on the 4090.

Boca Raton Comm...

Joined: 4 Nov 15

Posts: 264

Credit: 10875897552

RAC: 9373076

Gary Roberts wrote: When you

28 Apr 2023 14:53:08 UTC

Message 211640 in response to message 211630

(moderation:

)

Gary Roberts wrote:

When you examine results on the website, there's an important bit of context that is missing - the ability to easily monitor inconclusive results. It would be very nice to have a separate column to list these, as well as the errors and invalids. At any point in time, if you take the number for "All" and subtract all the other categories (In Progress, Pending, Valid, etc), there may be a 'left-over' amount (hopefully very small) which can be labeled as 'Inconclusive'. These are results described as "Checked but no consensus yet", if you search through the entire list to find them.

For the three hosts you mention, here are the numbers that existed at a certain point a little earlier today:-

HostID All In_Progress Pending Valid Invalid Error Inconclusive

13125618 3946     162 1159 2284    177 16     148

13126606 3820     177 1136 2197    167    0     143

12986942 11854     888 2180 7375 1066    0     345

As you can see, there are quite a few Inconclusives. Some will ultimately become valid whilst others get rejected. A lot depends on the third host that gets selected as the 'deadlock breaker'.

I've noticed increasing numbers of inconclusives in my hosts. I ran some very limited checks and found a tendency for inconclusives to happen when the normal app was being matched against the Petri app being run under the 'anonymous platform' mechanism. In that case, the outcome depends on the third task. Since there is probably a greater chance it will use the regular app, the host that will probably suffer the most will likely be the one using the anonymous platform app. Since these hosts are faster and since the number of them is likely to increase, this situation might reverse in the future if the FGRPB1G search keeps going for a while.

There might be a solution but it's probably quite unlikely to happen. If the FGRPB1G search was split into two separate streams, one for the regular apps and one for anonymous platform apps, the ultimate rate for invalid results might improve quite a bit. At least it might reverse the rising level of inconclusives that seems to be occurring :-).

Thank you for this post! Definitely helpful. Here is my question (and it is hard to put into words). Could two systems come up with the SAME wrong/inconclusive result? Is a result being inconclusive (because it doesn't match the other system and breaker) a product of the calculation done on the local machine incorrectly? Would this calculation be done incorrectly, in the same way, with the same result on a different system?

I hope that makes sense.

Bernd Machenschalk wrote:

I'm curious - does the 4090 exhibit the same behavior on BRP7 and GW (currently O3MDF)?

I can try next week- these two systems are powered down over weekend (no AC in the building).

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5874

Credit: 118356345605

RAC: 25473074

Boca Raton Community HS

29 Apr 2023 4:45:29 UTC

Message 211668 in response to message 211640

(moderation:

)

Boca Raton Community HS wrote:

... Could two systems come up with the SAME wrong/inconclusive result?

My simple answer would be NO. If two systems came up with exactly the SAME set of results, they would be declared 'valid' even if (technically) they were actually wrong :-). An inconclusive result is neither right nor wrong It's impossible to know until further results are analysed.

I'm just an ordinary volunteer like yourself. I have no background in theoretical physics so my knowledge (such as it is) is just from what I've read or listened to. With that disclaimer in mind, here is my take on the validation process.

The aim of a task seems to be to return 'candidate signals' which I interpret to be related to gamma ray counts coming from from parts of the sky where there seems to be a potential peak in the count rate above the supposed background rate. The last part of a task is to re-evaluate the top ten candidates in double precision. This gives the immediate impression that there will always be some variability arising from different hardware/software environments so every attempt needs to be made to minimise the variations.

I don't know exactly what parameters are compared between the two results undergoing validation but it is expected that there will always be small discrepancies. The validator uses certain 'tolerances' when doing the comparison. If the differences are within these tolerances for all the parameters being assessed, then both tasks are declared valid. If not, the status of both results becomes "Checked but no consensus yet" - in other words 'Inconclusive', rather than immediately invalid.

A third task is then sent out and when those results are sent back, all three sets are compared again. The most likely outcome is that two will be found that do agree within the tolerances. However it's entirely possible that all three might 'agree' - the third result fell in between the other two so all three are now close enough - or it could be that there is still no 'close enough' agreement and a 4th task is sent out.

The point of my post was to suggest that the standard app and the anonymous platform app might be returning results with just enough of a difference to be causing a rise in inconclusives. If so, all parties are being disadvantaged. The project needs to send out more 'resend' tasks than otherwise and each of the volunteers involved has the chance of an otherwise good result being rejected based on the chance event of what type of app is used to process the resend.

I don't know for sure if there really is a 'rising inconclusives' problem with the FGRPB1G search. If people responding to Bernd's request for information about other GPU searches are not seeing rising numbers of inconclusives which ultimately leads to rising invalids, then it tends to suggest that there might be.

Cheers,
Gary.

Tom M

Joined: 2 Feb 06

Posts: 6578

Credit: 9654358031

RAC: 2849675

It would be interesting to

29 Apr 2023 13:24:22 UTC

Message 211678

(moderation:

)

It would be interesting to see if the "inconclusive" were to go down if a 4090 was using the standard/stock Linux app rather than the AIO/Optimized app.

Tom M

A Proud member of the O.F.A. (Old Farts Association). Be well, do good work, and keep in touch.® (Garrison Keillor) I want some more patience. RIGHT NOW!

Ian&Steve C.

Joined: 19 Jan 20

Posts: 4045

Credit: 48063235268

RAC: 34498501

Tom M wrote:It would be

29 Apr 2023 15:17:29 UTC

Message 211682 in response to message 211678

(moderation:

)

Tom M wrote:

It would be interesting to see if the "inconclusive" were to go down if a 4090 was using the standard/stock Linux app rather than the AIO/Optimized app.

Tom M

read back in the thread. that's been tried and no it makes no difference. still high invalids with the stock app. Same with Windows, which does not have an optimized app.

I have a gut feeling that the problem lies in the Nvidia driver, not in the application(s).

_________________________________________________________________________

Tom M

Joined: 2 Feb 06

Posts: 6578

Credit: 9654358031

RAC: 2849675

Ian&Steve C. wrote: Tom M

29 Apr 2023 16:37:48 UTC

Message 211689 in response to message 211682

(moderation:

)

Ian&Steve C. wrote:

Tom M wrote:

It would be interesting to see if the "inconclusive" were to go down if a 4090 was using the standard/stock Linux app rather than the AIO/Optimized app.

Tom M

read back in the thread. that's been tried and no it makes no difference. still high invalids with the stock app. Same with Windows, which does not have an optimized app.

I have a gut feeling that the problem lies in the Nvidia driver, not in the application(s).

Thank you. I missed that post.

Tom M

A Proud member of the O.F.A. (Old Farts Association). Be well, do good work, and keep in touch.® (Garrison Keillor) I want some more patience. RIGHT NOW!

Boca Raton Comm...

Joined: 4 Nov 15

Posts: 264

Credit: 10875897552

RAC: 9373076

Gary Roberts wrote: Boca

30 Apr 2023 1:33:28 UTC

Message 211715 in response to message 211668

(moderation:

)

Gary Roberts wrote:

Boca Raton Community HS wrote:
... Could two systems come up with the SAME wrong/inconclusive result?
My simple answer would be NO. If two systems came up with exactly the SAME set of results, they would be declared 'valid' even if (technically) they were actually wrong :-). An inconclusive result is neither right nor wrong It's impossible to know until further results are analysed.

I'm just an ordinary volunteer like yourself. I have no background in theoretical physics so my knowledge (such as it is) is just from what I've read or listened to. With that disclaimer in mind, here is my take on the validation process.

The aim of a task seems to be to return 'candidate signals' which I interpret to be related to gamma ray counts coming from from parts of the sky where there seems to be a potential peak in the count rate above the supposed background rate. The last part of a task is to re-evaluate the top ten candidates in double precision. This gives the immediate impression that there will always be some variability arising from different hardware/software environments so every attempt needs to be made to minimise the variations.

I don't know exactly what parameters are compared between the two results undergoing validation but it is expected that there will always be small discrepancies. The validator uses certain 'tolerances' when doing the comparison. If the differences are within these tolerances for all the parameters being assessed, then both tasks are declared valid. If not, the status of both results becomes "Checked but no consensus yet" - in other words 'Inconclusive', rather than immediately invalid.

A third task is then sent out and when those results are sent back, all three sets are compared again. The most likely outcome is that two will be found that do agree within the tolerances. However it's entirely possible that all three might 'agree' - the third result fell in between the other two so all three are now close enough - or it could be that there is still no 'close enough' agreement and a 4th task is sent out.

The point of my post was to suggest that the standard app and the anonymous platform app might be returning results with just enough of a difference to be causing a rise in inconclusives. If so, all parties are being disadvantaged. The project needs to send out more 'resend' tasks than otherwise and each of the volunteers involved has the chance of an otherwise good result being rejected based on the chance event of what type of app is used to process the resend.

I don't know for sure if there really is a 'rising inconclusives' problem with the FGRPB1G search. If people responding to Bernd's request for information about other GPU searches are not seeing rising numbers of inconclusives which ultimately leads to rising invalids, then it tends to suggest that there might be.

This is an amazingly thought out explanation and makes complete sense. Thank you for taking the time to write this post. I wonder if the 4090 is coming up with a [somewhat] different list of candidates or if the issue is in the double precision evaluation.

I will be trying one of the 4090 GPUs on some of the other GPU tasks this upcoming week to see what happens.

Ian&Steve C. wrote:

I have a gut feeling that the problem lies in the Nvidia driver, not in the application(s).

I just installed new drivers a day or two ago- it will be interesting to see if anything changes. Just out of curiosity- would the workstation/professional version of the nvidia driver work on the 4090 (Linux Mint)? The A6000 Ada is the same chip, but I have no idea if anything would change or if this is possible on Linux.

FGRP - High invalid rate on Nvidia 4090?

Forums › Cruncher's Corner

Comment viewing options

Forums › Cruncher's Corner