BRP4 Intel GPU app feedback thread

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4266
Credit: 244924643
RAC: 16631

I had a first look at the

I had a first look at the invalid results from BRP4. I can clearly see a certain type of failure that I haven't seen before. The worst part is that the validator didn't explicitly catch these, so these scientifically unusable results probably made their way into the canonical results. It's unlikely that we missed a pulsar discovery because of that, but still not as good as it should be.

The first thing was to update the validator so results with this property will now be marked "validate error" (instead of ending up as invalid). I'll have another look hopefully tomorrow to correlate these with specific application versions, driver versions etc. and find a way to exclude these machines from getting work and wasting computing time.

The results at first glance look so weird that I currently have little hope of making an application that could deal with this driver oddities in a way that produces acceptable results.

BM

BM

Maximilian Mieth
Maximilian Mieth
Joined: 4 Oct 12
Posts: 128
Credit: 9885011
RAC: 2193

RE: The first thing was to

Quote:
The first thing was to update the validator so results with this property will now be marked "validate error" (instead of ending up as invalid).


That seems to have worked. As far as I see the results of the HD4600s and 4400s are now "validate errors" if they try to validate wih a result from a HD4000. One example is here:
http://einsteinathome.org/workunit/211070094

However, the task is then sent out again and if it is sent out to another host with a HD4600/4400 it ends up as validate error again like in this case:
http://einsteinathome.org/workunit/211111170
I'm afraid this goes on until the task finally is sent out to a host not affected by the problem.

Quote:
I'll have another look hopefully tomorrow to correlate these with specific application versions, driver versions etc. and find a way to exclude these machines from getting work and wasting computing time.


I think for starters this would be very useful!

Quote:
The results at first glance look so weird that I currently have little hope of making an application that could deal with this driver oddities in a way that produces acceptable results.

Ouch. Doesn't sound good :(

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2139
Credit: 2752761155
RAC: 1451532

I'm pleased to note that my

I'm pleased to note that my HD 4600 'Haswell' (host 8864187, running driver 10.18.10.3621) hasn't had a task declared 'invalid' since yesterday, but has had several validated today. I think that is evidence that the problem is 'Haswell+driver', rather than a straight hardware erratum.

I've tried to be neutral in describing this issue, as being an incompatibility between application and driver - without pointing a finger at either participant. But (and subject to Bernd's further checking in the database about correlations between these anomalous results, and the hardware/application/driver versions employed), I'm beginning to be more convinced that this is something that will have to be addressed at the driver level, with Intel and/or Khronos.

When we were discussing a similar problem between OpenCL applications and a newly-released NVidia driver for pre-Fermi NV cards, Jacob Klein found https://developer.nvidia.com/opencl - a suite of 31 standalone sample/test applications demonstrating various OpenCL capabilities. Although the pre-compiled examples are for NV cards, full sources are supplied, and I'm given to understand that the changes needed to compile for other OpenCL platforms aren't too great.

Jacob and I found that the NV driver problems caused test failures in these cases from the test suite:

oclConvolutionSeparable
oclDXTCompression
oclFDTD3d
oclParticles
oclQuasirandomGenerator
oclVolumeRender

and those simple, but replicable, samples were sufficient to get NVidia to look further into the driver. They say that their problem has now been solved, although we're still waiting for an update/hotfix driver release to incorporate the solution. That form of approach might be useful here too.

Maximilian Mieth
Maximilian Mieth
Joined: 4 Oct 12
Posts: 128
Credit: 9885011
RAC: 2193

Richard Haselgrove wrote:I'm

Richard Haselgrove wrote:
I'm pleased to note that my HD 4600 'Haswell' (host 8864187, running driver 10.18.10.3621) hasn't had a task declared 'invalid' since yesterday

Same for me and my HD4000 (same driver version).

Richard Haselgrove wrote:
I've tried to be neutral in describing this issue, as being an incompatibility between application and driver - without pointing a finger at either participant. But (and subject to Bernd's further checking in the database about correlations between these anomalous results, and the hardware/application/driver versions employed), I'm beginning to be more convinced that this is something that will have to be addressed at the driver level, with Intel and/or Khronos.


I just found a couple of cases in which my HD4000 validated against an HD4600. Interestingly in this case two HD4600s were not able to validate against each other, but I was able to validate against one of them. I think that supports your oppion that it is a driver issue.

Claggy
Claggy
Joined: 29 Dec 06
Posts: 560
Credit: 2694028
RAC: 0

RE: I just found a couple

Quote:
I just found a couple of cases in which my HD4000 validated against an HD4600. Interestingly in this case two HD4600s were not able to validate against each other, but I was able to validate against one of them. I think that supports your oppion that it is a driver issue.


Keep an eye on their scheduler logs and you'll see eventualy what drivers they're running:

http://einstein.phys.uwm.edu/host_sched_logs/6317/6317416

http://einstein.phys.uwm.edu/host_sched_logs/11671/11671864

Claggy

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2139
Credit: 2752761155
RAC: 1451532

RE: RE: I just found a

Quote:
Quote:
I just found a couple of cases in which my HD4000 validated against an HD4600. Interestingly in this case two HD4600s were not able to validate against each other, but I was able to validate against one of them. I think that supports your oppion that it is a driver issue.

Keep an eye on their scheduler logs and you'll see eventualy what drivers they're running:

http://einstein.phys.uwm.edu/host_sched_logs/6317/6317416

http://einstein.phys.uwm.edu/host_sched_logs/11671/11671864

Claggy


Yes, I've been doing that.

Aleksey (6317416 - valid) is using 9.18.10.3186

Still waiting on tron.

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2139
Credit: 2752761155
RAC: 1451532

And tron (11671864 - invalid)

And tron (11671864 - invalid) is using 10.18.10.3907

I think both cases confirm our previous expectations.

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4266
Credit: 244924643
RAC: 16631

For now I disabled

For now I disabled (automatically) sending work to Intel GPUs with drivers of 10.18.10.3907 and newer.

Later today or early tomorrow I'll make sure that older hardware (up to HD 4000) that can be identified as such gets work as well.

I currently don't have any means in house to find out whether there is a newer driver that works. I'll set up another Beta application version, such that Intel GPUs may get work with any driver version.

BM

BM

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2139
Credit: 2752761155
RAC: 1451532

Ooops, there goes my HD 4000

Ooops, there goes my HD 4000 for the duration - she'll be dry long before midnight:

2015-02-19 16:06:35.3977 [PID=20462] [version] [HOST#5744895] device name: 'Intel(R) HD Graphics 4000'; OpenCL driver version: 10.18.10.4061; platform version: OpenCL 1.2; device version: OpenCL 1.2
2015-02-19 16:06:35.3977 [PID=20462] [version] driver version 1018104061, min: 0, max: 1018103906
2015-02-19 16:06:35.3977 [PID=20462] [version] driver version required max: 1018103906, supplied: 1018104061

Now, where did I leave that copy of 3621? ;)

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4266
Credit: 244924643
RAC: 16631

I'll run another shift for

I'll run another shift for you.

Should work in a few minutes from now.

BM

BM

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.