Strange invalidating of GPU tasks - Linux versus Windows

Bent Vangli
Bent Vangli
Joined: 6 Apr 11
Posts: 21
Credit: 333283402
RAC: 1211489
Topic 212268

Hi

I am wondering what is happening. I got some invalid tasks from one of my machines. It is a watercooled and running stock frequencies, both on GPU and CPU. Memory is running at safe speed, no overclocking is involved. The tasks in question are:

https://einsteinathome.org/nb/workunit/326424533
https://einsteinathome.org/nb/workunit/329143860
https://einsteinathome.org/nb/workunit/329243420
https://einsteinathome.org/nb/workunit/329626361
https://einsteinathome.org/nb/workunit/329699678
https://einsteinathome.org/nb/workunit/329828430

The peculiar thing is that it is always against windows machines my linux box doesn't validate those GPU tasks.

Of course, those tasks may be invalid, but I can't see any trace of misbehaviour in report file.

Could it be that the Windows OpenCL and the Linux version sometimes generates small differences in the result file. It could be interesting to know, and I hope the Einstein team could check those mentioned workunits. If mine in fact are invalid, also hopefully give some clue to what is wrong, so I can correct it.

Best regard, Bent, Oslo

Gary Roberts
Gary Roberts
Joined: 9 Feb 05
Posts: 4189
Credit: 10424436696
RAC: 24369518

Bent Vangli wrote:The

Bent Vangli wrote:
The peculiar thing is that it is always against windows machines my linux box doesn't validate those GPU tasks.

I followed one of your links and found the machine involved - a Ryzen 7 with a GTX 1080Ti GPU.  In the entire tasks list showing in the online database, there are currently 8 invalids and 1079 that have validated.  You have two other machines as well where the corresponding numbers are 7/849 and 2/571.

This tends to be something we all experience - a small number of invalids scattered in a much larger population of valid results.  For your hosts, the invalid results are less than 1% of the total and whilst it's not nice to see any result declared invalid, it's not all that unusual.  On the server status page you can see the ratio of invalid to valid for FGRPB1G, currently 9,192 to 890,746, around the 1% mark.

It's possible that differences between Windows and Linux may be part of the reason but when I looked at a couple of the 8 listed for your machine, I saw an example of a W+L+L quorum where your L box missed out :-).  So you can't say it's "always" Windows machines because right there is a counter example :-).

Because there are a lot more Windows machines than Linux machines you're going to see many of your completed tasks being matched against a Windows machine and consequently, for any quorum you look at, your partners will tend to be mainly Windows machines.  Less than 1% of your tasks are failing, 99% are passing.  That doesn't really suggest a big problem that could be easily identified and fixed.

 

Cheers,
Gary.

archae86
archae86
Joined: 6 Dec 05
Posts: 2235
Credit: 1124602163
RAC: 1474351

Gary has given you a good

Gary has given you a good answer.  Let me expand on a couple of points.

Since we as a community ALL experience invalid result determinations on our GPU work at an appreciable rate, it is completely unreasonable for an individual user with a half dozen such units to expect personalized analysis from the project.

I think there is a big difference between CPU and GPU work here at Einstein in this matter.  For CPU work, on a properly setup machine not running error-inducing clock rates, one can expect to see zero invalid (and zero error) determinations in many weeks of operation (so thousands of units).  But for GPU work, I don't think we see invalid rates much below a few tenths of a percent on large samples.

I don't think the real source of this is well known, or at least publicized.  Some people imagine GPUs just make mistakes now and then.  Another possibility is that the methods in use produce slight differences depending on the time-slicing of GPU usage during processing of particular work.  Who knows, there could even be an unintended dependence on an uninitialized variable.

That is not to say that all invalid findings are spurious.  The fleet average is above the irreducible base rate because some participant systems at times operate with substantial real error rates.  I suspect the most common reason is operation above the highest clock rates for which the system can currently generate consistently correct results.  But that can arise not because the user twisted up the clock rate knob, but because the chip temperature rose through fan failure, thermal paste degradation, room ambient temperature rise ...  Some users also push up the clocks too high (I have...), although those users may be more likely than most to notice.  Some cards can't quite consistently get the right answer at stock clocks, and need to be underclocked a little to work well.

Just to give a personal example, at this moment a look at my account tasks summary for Einstein shows All tasks at 3661.  But 1060 of these are shown as In Progress so can't be considered success or failure.  That leaves about 2600.  Of those 12 show as Invalid.  Crudely, that suggest I have about a .5% invalid rate.  Crude, because not all the 2600 have actually had a quorum filled and been compared, but more importantly because the retention time in the various categories is not the same.  

As part of regular health and sanity checking, I click on the invalid link for my whole account, then click on the sorting link over the Sent column twice.  I want to see if one of my three machines seems recently to have started a much higher problem rate than usual.  As the "no problem" rate varies quite a bit, it takes some time to get a feel for what is different enough to be a concern.

In summary--yes it is a real problem, and specifically it makes it much more difficult for users to assess the health of their systems than would be best.  I don't think we can usefully drive for improvement by offering examples of a handful of such results on a specific system.

Gary Roberts
Gary Roberts
Joined: 9 Feb 05
Posts: 4189
Credit: 10424436696
RAC: 24369518

archae86 wrote:Gary has given

archae86 wrote:
Gary has given you a good answer.  Let me expand on a couple of points.

Thank you very much for doing that.  It improves the answer considerably.

I had considered throwing in some comments about 'factory OC cards' but seeing as I don't have the tools, skill or patience to tweak performance through changing clock speeds, I decided to not go there, having never done it myself :-).  I'm glad you took up the challenge :-).

I've often wondered about the impact of factory overclocking.  I'm sure card manufacturers do it to lure in more customers in the hope of a better performing product and they probably push it a bit more than they should.  This is probably quite OK for gaming use - the odd glitch here or there is probably not going to be noticed by most users.  It sort of becomes rather critical for precision scientific calculations that are closely monitored for correctness.

Maybe the best thing for the OP to do is to experiment with downclocking the card somewhat and then, over time, seeing if the error rate reduces significantly.  You would need to do this over a considerable number of tasks to have confidence in the significance of any change to the error rate that is observed.

 

Cheers,
Gary.

archae86
archae86
Joined: 6 Dec 05
Posts: 2235
Credit: 1124602163
RAC: 1474351

Gary Roberts wrote:Maybe the

Gary Roberts wrote:
Maybe the best thing for the OP to do is to experiment with downclocking the card somewhat and then, over time, seeing if the error rate reduces significantly.  You would need to do this over a considerable number of tasks to have confidence in the significance of any change to the error rate that is observed.

If you have invalid rates far above the background noise level (which I'll guess is in the order of 0.1% to 1%) then a moderate trial time you can tell something.  But the variability in the base rate is enough that people are just kidding themselves if they draw conclusions on the difference between zero invalids in fifty trials and two in fifty trials.

I spent many weeks inching up the memory and core clocks on two Nvidia Pascal series cards on my main system just recently.  As I approached the cliff of doom in tiny increments, pausing a day or more at each, I never reached a point at which I had a detectable elevation of invalid rates.  Instead I got abnormal early termination with a reported error, or, less commonly, an actual system black screen, blue screen, or unrequested reboot.  Other cards in other systems probably have other behaviors.

AgentB
AgentB
Joined: 17 Mar 12
Posts: 912
Credit: 511707248
RAC: 483978

To give a good answer about

To give a good answer about this i would suggest the code writers would need to make comment, but i don't think these low level invalids are bad results, just that sometimes the data leads to rounding errors and they are significant occasionally.

I think of it like choosing the top 20 runners in a race simulation we run on our hosts.

Most times we pick the same 20 for the same race, but now and then it's a dead heat for 20 and 21.  Depending on rounding errors it goes one way or the other.  Good or back luck comes into play when your get a quorum partner.

We have had runs where data makes a larger number of invalids.

OpenCL is not as robust for numerical calculations as CPU based ANSI standards and this is then magnified by the larger number of calculations in a GPU task.

OpenCL is in its infancy compared with say the classic FFT software running on our CPUs so we should expect a greater difference in results from different systems. 

To add to this different compilers generate different rounding errors, in fact the same compiler can reorder an instruction to run faster but give a different rounding error.

For a technical discussion (lament) about rounding and OpenCL see here

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.