BRP4 Intel GPU app feedback thread

Michal Gust
Michal Gust
Joined: 27 Jul 16
Posts: 2
Credit: 3,880,947
RAC: 0

Dear developers, Is there any

Dear developers, Is there any way how can I participate in exact identification and fix of Skylake issue? I have relatively powerful Iris 540 and I'm looking for useful work for it as many others here.

It’s really frustrating to see how much work is thrown because of wrong WUs even I’m not worse one with about 40% of wrong WUs and even GPU is producing similar amount or more work after deduction of wrong WUs than CPU…

I’m not programmer/developer to write or change the code on my own. I can perform any test you want and be one who fills up Intel support with request to fix issue. But as my long experience as network engineer and designer in reputable company working for large enterprises and government agencies it’s much easier to achieve expected fix if you narrow down what exactly is wrong not just what are the symptoms before open any support case. Hence my idea is to run computation tasks in parallel on CPU and GPU and compare results – break down WU in several steps to focus on what is causing wrong results down to particular OpenCL commands/calls…

 

Does anyone from developers go ahead?

Christian Beer
Christian Beer
Moderator
Joined: 9 Feb 05
Posts: 595
Credit: 101,360,266
RAC: 4,666

We just recently got a

We just recently got a feedback from Intel about the problem and possible solutions. I briefly discussed this with Benjamin and we will change the validation threshold slightly so newer Intel iGPUs are validated fine. I didn't have time to deploy this change but I'll do it as soon as possible.

Edit (13:40 UTC): I deployed a new validator with the increased tolerance. Please test using the Beta application. If there the validation rate increases I'm going to include newer Intel iGPUs into the non-Beta application.

Michal Gust
Michal Gust
Joined: 27 Jul 16
Posts: 2
Credit: 3,880,947
RAC: 0

Thank you for reply. To

Thank you for reply. To increase tolerance sounds really strange.

I expect even though you work with probability the process of calculation is exact and repeating the process on the same input data produces always the same results. But this sound like results are near but not same like random number generator is somehow incorporated and if its physical characteristic changes calculation results could change as well.

 Could you provide me some link what could explain essence of why are results different?

I'll report validation results once there will be reasonable amount of WU.

Christian Beer
Christian Beer
Moderator
Joined: 9 Feb 05
Posts: 595
Credit: 101,360,266
RAC: 4,666

This goes down to the level

This goes down to the level of assembler code that is executed on the GPU. Here is the most basic explanation I got from Intel:

Say you have the following:

Answer_mul = float0 * float1;
Answer_add = Answer_mul + float2;

This gets converted to the following in assembly.....

  Mul %answer_mul, %float0, %float1
  Add %answer_add, %answer_mul, %float2

The value in the register "answer_mul" is rounded before it does the addition.
In the Intel case (and AARch64 too) these two instructions get fused into a "mad" instruction

  Mad %answer_mad, %float0, %float1, %float2

The result of the mad instruction is more precise for it does not do the rounding after the multiply.

And because we do a lot of summing of multiplications the seemingly small rounding errors turn out to be significant in the end. No random numbers involved.

slozomby
slozomby
Joined: 8 Dec 05
Posts: 15
Credit: 256,213
RAC: 0

good news on the "fix" for

good news on the "fix" for skylake. 

running some more WUs. 

slozomby
slozomby
Joined: 8 Dec 05
Posts: 15
Credit: 256,213
RAC: 0

not looking good. 1 invalid.

not looking good. 1 invalid. several inconclusive. 

 

https://www.einsteinathome.org/host/12407179/tasks

Christian Beer
Christian Beer
Moderator
Joined: 9 Feb 05
Posts: 595
Credit: 101,360,266
RAC: 4,666

I checked the one invalid

I checked the one invalid task and the value in question is just again right above our new threshold. This is kind of expected and in the nature of thresholds. Let's see what the pending and inconclusive tasks do. At least we should see a better ration of valid to invalids over time.

MarkJ
MarkJ
Joined: 28 Feb 08
Posts: 437
Credit: 125,149,196
RAC: 191,026

In light of that I have

In light of that I have re-enabled two machines with HD Graphics 530's and they are now running all the Beta Intel_GPU OpenCL apps (FGRP1, BRP4, BRP4G and BRP6).

Hosts

https://einsteinathome.org/host/6181626

https://einsteinathome.org/host/2871149

 

Christian Beer wrote:

We just recently got a feedback from Intel about the problem and possible solutions. I briefly discussed this with Benjamin and we will change the validation threshold slightly so newer Intel iGPUs are validated fine. I didn't have time to deploy this change but I'll do it as soon as possible.

Edit (13:40 UTC): I deployed a new validator with the increased tolerance. Please test using the Beta application. If there the validation rate increases I'm going to include newer Intel iGPUs into the non-Beta application.

MarkJ
MarkJ
Joined: 28 Feb 08
Posts: 437
Credit: 125,149,196
RAC: 191,026

MarkJ wrote:In light of that

MarkJ wrote:

In light of that I have re-enabled two machines with HD Graphics 530's and they are now running all the Beta Intel_GPU OpenCL apps (FGRP1, BRP4, BRP4G and BRP6).

Hosts

https://einsteinathome.org/host/6181626

https://einsteinathome.org/host/2871149

The 2871149 host totally over-fetched work. I have had it doing nothing else in an attempt to get it under control.

 

It seems all the BRP6 1.52 are considered invalid so I've aborted the remaining ones. They've been taking over 8 hours each and I think there are enough examples of validate error by now.

 

It will now process the 12 remaining BRP 1.34 tasks in the hope they might validate.

slozomby
slozomby
Joined: 8 Dec 05
Posts: 15
Credit: 256,213
RAC: 0

im at 3 valid 5 invalid for

im at 3 valid 5 invalid for the work on my 530. several still pending/inconclusive. 

 

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.