Latest data file for FGRPB1G GPU tasks

Oliver Behnke
Oliver Behnke
Moderator
Administrator
Joined: 4 Sep 07
Posts: 935
Credit: 25166626
RAC: 0

Thanks for that detail. I'll

Thanks for that detail. I'll dig out a similar test case then to avoid any red herrings...

Update:

  • I found a fresh workunit LATeah0104Y_1004 that
    • FAILS on the RTX 2080 Ti (Turing)
    • FAILS on the GV100 (Volta)
    • WORKS on the GTX 1080 Ti (Pascal)
  • Same results across all tested app versions: 1.20, 1.17, 1.12

That's something to work with! Off debugging...

Oliver

 

Einstein@Home Project

Oliver Behnke
Oliver Behnke
Moderator
Administrator
Joined: 4 Sep 07
Posts: 935
Credit: 25166626
RAC: 0

Weekend update:I could

Weekend update:

  • I could isolate the problem: it's an issue in the thread management where a specific local memory synchronization barrier isn't reached by all threads, so the process waits forever.
  • I tried a few ways to get around the problem but without any luck. Sure, removing the barrier gets things moving on the Turing (and Volta) but the computations would likely be incorrect, so we need to find the actual root cause.
  • An educated guess makes me think that NVIDIA's Independent Thread Scheduling (introduced with Volta) might be the culprit here, so that's what I'll look into next. Not sure how that applies to OpenCL, though...

Hang in there, we're on it...

Oliver

 

Einstein@Home Project

Yeti
Yeti
Joined: 17 Nov 04
Posts: 59
Credit: 1242603629
RAC: 332785

Thanks for the Update

Thanks for the Update

Supporting BOINC, a great concept !

San-Fernando-Valley
San-Fernando-Valley
Joined: 16 Mar 16
Posts: 259
Credit: 6676201637
RAC: 17574378

Hope I have the "right"

Hope I have the "right" thread for my question/problem:

 

FGRPopencl1K-nvidia  ...   LATeah2103L    VERY  long elapsed times   ( around 2 hours !).

 

WU 388601504

computer  12761622

Should I cancel/abort these ?

I am no "expert".  Do you need more informations?

THANKS

UPDATE:

Task ID   827356700    says after over 2 hours "new" and zero time ?

Used to run under 10 minutes.

 

 

 

 

Oliver Behnke
Oliver Behnke
Moderator
Administrator
Joined: 4 Sep 07
Posts: 935
Credit: 25166626
RAC: 0

If you are running on a

If you are running on a Turing or Volta GPU, I'd say yes. You can then opt-out of this application until we fixed the problem on those GPU architectures.

Oliver

 

Einstein@Home Project

San-Fernando-Valley
San-Fernando-Valley
Joined: 16 Mar 16
Posts: 259
Credit: 6676201637
RAC: 17574378

OK - tried to read this

OK - tried to read this thread, but coudn't really grasp what/which tasks are meant.

THANKS.  

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4260
Credit: 244784928
RAC: 21047

The workunits currently being

The workunits currently being generated (LATeah10xxL) should work also on Turing and Volta cards. We'll continue to investigate.

BM

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4260
Credit: 244784928
RAC: 21047

We'll try to figure out a way

We'll try to figure out a way to prevent sending the scheduler older tasks to such cards.

BM

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4681
Credit: 17490198459
RAC: 6910509

So is this the new plan of

So is this the new plan of attack for Turing/Volta?  Prevent sending incompatible tasks vice fixing the application itself?

 

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4260
Credit: 244784928
RAC: 21047

Keith Myers wrote:So is this

Keith Myers wrote:
So is this the new plan of attack for Turing/Volta?  Prevent sending incompatible tasks vice fixing the application itself?

Actually we will do both. The limiting factor is manpower, and producing and sending compatible workunits is the most efficient thing we can do right now.

The reason for this problem appears to be a new feature of Turing/Volta ("independent thread scheduling"), which we have very limited control of in OpenCL. We might again intensify our efforts to develop a CUDA version that will likely give us more performance on NVidia cards and solve this problem as well. But we need more time for this than what we currently have.

BM

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.