Latest data file for FGRPB1G GPU tasks

Oliver Behnke

Moderator

Administrator

Joined: 4 Sep 07

Posts: 988

Credit: 25171438

RAC: 0

Thanks for that detail. I'll

29 Jan 2019 14:03:00 UTC

Message 169180

(moderation:

)

Thanks for that detail. I'll dig out a similar test case then to avoid any red herrings...

Update:

I found a fresh workunit LATeah0104Y_1004 that
- FAILS on the RTX 2080 Ti (Turing)
- FAILS on the GV100 (Volta)
- WORKS on the GTX 1080 Ti (Pascal)
Same results across all tested app versions: 1.20, 1.17, 1.12

That's something to work with! Off debugging...

Oliver

Einstein@Home Project

Oliver Behnke

Moderator

Administrator

Joined: 4 Sep 07

Posts: 988

Credit: 25171438

RAC: 0

Weekend update:I could

1 Feb 2019 14:47:00 UTC

Message 169243

(moderation:

)

Weekend update:

I could isolate the problem: it's an issue in the thread management where a specific local memory synchronization barrier isn't reached by all threads, so the process waits forever.
I tried a few ways to get around the problem but without any luck. Sure, removing the barrier gets things moving on the Turing (and Volta) but the computations would likely be incorrect, so we need to find the actual root cause.
An educated guess makes me think that NVIDIA's Independent Thread Scheduling (introduced with Volta) might be the culprit here, so that's what I'll look into next. Not sure how that applies to OpenCL, though...

Hang in there, we're on it...

Oliver

Einstein@Home Project

Yeti

Joined: 17 Nov 04

Posts: 59

Credit: 1371204130

RAC: 560

Thanks for the Update

1 Feb 2019 15:46:34 UTC

Message 169245 in response to message 169243

(moderation:

)

Thanks for the Update

Supporting BOINC, a great concept !

San-Fernando-Valley

Joined: 16 Mar 16

Posts: 558

Credit: 10866246174

RAC: 14188719

Hope I have the "right"

7 Feb 2019 9:46:02 UTC

Message 169364

(moderation:

)

Hope I have the "right" thread for my question/problem:

FGRPopencl1K-nvidia ... LATeah2103L VERY long elapsed times ( around 2 hours !).

WU 388601504

computer 12761622

Should I cancel/abort these ?

I am no "expert". Do you need more informations?

THANKS

UPDATE:

Task ID 827356700 says after over 2 hours "new" and zero time ?

Used to run under 10 minutes.

Oliver Behnke

Moderator

Administrator

Joined: 4 Sep 07

Posts: 988

Credit: 25171438

RAC: 0

If you are running on a

7 Feb 2019 9:45:55 UTC

Message 169365

(moderation:

)

If you are running on a Turing or Volta GPU, I'd say yes. You can then opt-out of this application until we fixed the problem on those GPU architectures.

Oliver

Einstein@Home Project

San-Fernando-Valley

Joined: 16 Mar 16

Posts: 558

Credit: 10866246174

RAC: 14188719

OK - tried to read this

7 Feb 2019 9:48:41 UTC

Message 169366 in response to message 169365

(moderation:

)

OK - tried to read this thread, but coudn't really grasp what/which tasks are meant.

THANKS.

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4349

Credit: 253590105

RAC: 36264

The workunits currently being

11 Feb 2019 13:58:00 UTC

Message 169441

(moderation:

)

The workunits currently being generated (LATeah10xxL) should work also on Turing and Volta cards. We'll continue to investigate.

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4349

Credit: 253590105

RAC: 36264

We'll try to figure out a way

11 Feb 2019 14:25:32 UTC

Message 169443

(moderation:

)

We'll try to figure out a way to prevent sending the scheduler older tasks to such cards.

Keith Myers

Joined: 11 Feb 11

Posts: 5057

Credit: 19254917478

RAC: 6966979

So is this the new plan of

11 Feb 2019 21:09:23 UTC

Message 169448

(moderation:

)

So is this the new plan of attack for Turing/Volta? Prevent sending incompatible tasks vice fixing the application itself?

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4349

Credit: 253590105

RAC: 36264

Keith Myers wrote:So is this

11 Feb 2019 22:25:00 UTC

Message 169449 in response to message 169448

(moderation:

)

Keith Myers wrote:

So is this the new plan of attack for Turing/Volta? Prevent sending incompatible tasks vice fixing the application itself?

Actually we will do both. The limiting factor is manpower, and producing and sending compatible workunits is the most efficient thing we can do right now.

The reason for this problem appears to be a new feature of Turing/Volta ("independent thread scheduling"), which we have very limited control of in OpenCL. We might again intensify our efforts to develop a CUDA version that will likely give us more performance on NVidia cards and solve this problem as well. But we need more time for this than what we currently have.

Latest data file for FGRPB1G GPU tasks

Forums › Cruncher's Corner

Comment viewing options

Forums › Cruncher's Corner