BRP4 Intel GPU app feedback thread

Michal Gust

Joined: 27 Jul 16

Posts: 2

Credit: 3880947

RAC: 0

Dear developers, Is there any

20 Aug 2016 17:01:43 UTC

Message 149012

(moderation:

)

Dear developers, Is there any way how can I participate in exact identification and fix of Skylake issue? I have relatively powerful Iris 540 and I'm looking for useful work for it as many others here.

It’s really frustrating to see how much work is thrown because of wrong WUs even I’m not worse one with about 40% of wrong WUs and even GPU is producing similar amount or more work after deduction of wrong WUs than CPU…

I’m not programmer/developer to write or change the code on my own. I can perform any test you want and be one who fills up Intel support with request to fix issue. But as my long experience as network engineer and designer in reputable company working for large enterprises and government agencies it’s much easier to achieve expected fix if you narrow down what exactly is wrong not just what are the symptoms before open any support case. Hence my idea is to run computation tasks in parallel on CPU and GPU and compare results – break down WU in several steps to focus on what is causing wrong results down to particular OpenCL commands/calls…

Does anyone from developers go ahead?

Christian Beer

Joined: 9 Feb 05

Posts: 595

Credit: 197696851

RAC: 16749

We just recently got a

22 Aug 2016 5:57:00 UTC

Message 149025

(moderation:

)

We just recently got a feedback from Intel about the problem and possible solutions. I briefly discussed this with Benjamin and we will change the validation threshold slightly so newer Intel iGPUs are validated fine. I didn't have time to deploy this change but I'll do it as soon as possible.

Edit (13:40 UTC): I deployed a new validator with the increased tolerance. Please test using the Beta application. If there the validation rate increases I'm going to include newer Intel iGPUs into the non-Beta application.

Michal Gust

Joined: 27 Jul 16

Posts: 2

Credit: 3880947

RAC: 0

Thank you for reply. To

22 Aug 2016 16:40:45 UTC

Message 149039 in response to message 149025

(moderation:

)

Thank you for reply. To increase tolerance sounds really strange.

I expect even though you work with probability the process of calculation is exact and repeating the process on the same input data produces always the same results. But this sound like results are near but not same like random number generator is somehow incorporated and if its physical characteristic changes calculation results could change as well.

Could you provide me some link what could explain essence of why are results different?

I'll report validation results once there will be reasonable amount of WU.

Christian Beer

Joined: 9 Feb 05

Posts: 595

Credit: 197696851

RAC: 16749

This goes down to the level

22 Aug 2016 17:50:01 UTC

Message 149041

(moderation:

)

This goes down to the level of assembler code that is executed on the GPU. Here is the most basic explanation I got from Intel:

Say you have the following:

Answer_mul = float0 * float1; Answer_add = Answer_mul + float2;

This gets converted to the following in assembly.....

  Mul %answer_mul, %float0, %float1
  Add %answer_add, %answer_mul, %float2

The value in the register "answer_mul" is rounded before it does the addition.
In the Intel case (and AARch64 too) these two instructions get fused into a "mad" instruction

  Mad %answer_mad, %float0, %float1, %float2

The result of the mad instruction is more precise for it does not do the rounding after the multiply.

And because we do a lot of summing of multiplications the seemingly small rounding errors turn out to be significant in the end. No random numbers involved.

slozomby

Joined: 8 Dec 05

Posts: 15

Credit: 256213

RAC: 0

good news on the "fix" for

23 Aug 2016 20:28:51 UTC

Message 149093

(moderation:

)

good news on the "fix" for skylake.

running some more WUs.

slozomby

Joined: 8 Dec 05

Posts: 15

Credit: 256213

RAC: 0

not looking good. 1 invalid.

25 Aug 2016 1:55:28 UTC

Message 149129

(moderation:

)

not looking good. 1 invalid. several inconclusive.

https://www.einsteinathome.org/host/12407179/tasks

Christian Beer

Joined: 9 Feb 05

Posts: 595

Credit: 197696851

RAC: 16749

I checked the one invalid

25 Aug 2016 8:51:23 UTC

Message 149133

(moderation:

)

I checked the one invalid task and the value in question is just again right above our new threshold. This is kind of expected and in the nature of thresholds. Let's see what the pending and inconclusive tasks do. At least we should see a better ration of valid to invalids over time.

MarkJ

Joined: 28 Feb 08

Posts: 437

Credit: 139002861

RAC: 0

In light of that I have

25 Aug 2016 13:13:38 UTC

Message 149135 in response to message 149025

(moderation:

)

In light of that I have re-enabled two machines with HD Graphics 530's and they are now running all the Beta Intel_GPU OpenCL apps (FGRP1, BRP4, BRP4G and BRP6).

Hosts

https://einsteinathome.org/host/6181626

https://einsteinathome.org/host/2871149

Christian Beer wrote:

We just recently got a feedback from Intel about the problem and possible solutions. I briefly discussed this with Benjamin and we will change the validation threshold slightly so newer Intel iGPUs are validated fine. I didn't have time to deploy this change but I'll do it as soon as possible.
Edit (13:40 UTC): I deployed a new validator with the increased tolerance. Please test using the Beta application. If there the validation rate increases I'm going to include newer Intel iGPUs into the non-Beta application.

BOINC blog

MarkJ

Joined: 28 Feb 08

Posts: 437

Credit: 139002861

RAC: 0

MarkJ wrote:In light of that

28 Aug 2016 8:46:46 UTC

Message 149175 in response to message 149135

(moderation:

)

MarkJ wrote:

In light of that I have re-enabled two machines with HD Graphics 530's and they are now running all the Beta Intel_GPU OpenCL apps (FGRP1, BRP4, BRP4G and BRP6).
Hosts
https://einsteinathome.org/host/6181626
https://einsteinathome.org/host/2871149

The 2871149 host totally over-fetched work. I have had it doing nothing else in an attempt to get it under control.

It seems all the BRP6 1.52 are considered invalid so I've aborted the remaining ones. They've been taking over 8 hours each and I think there are enough examples of validate error by now.

It will now process the 12 remaining BRP 1.34 tasks in the hope they might validate.

BOINC blog

slozomby

Joined: 8 Dec 05

Posts: 15

Credit: 256213

RAC: 0

im at 3 valid 5 invalid for

28 Aug 2016 14:44:45 UTC

Message 149178

(moderation:

)

im at 3 valid 5 invalid for the work on my 530. several still pending/inconclusive.

BRP4 Intel GPU app feedback thread

Forums › Problems and Bug Reports

Comment viewing options

Forums › Problems and Bug Reports