FGRPopencl2-ati beta test application is broken

Wedge009
Wedge009
Joined: 5 Mar 05
Posts: 77
Credit: 5,880,462,979
RAC: 13,405,650
Topic 225921

I didn't realise I had beta applications enabled - within the last 12 hours I've had three Windows hosts knocked out of commission by the introduction of this application whose tasks all fail within seconds with the error message 'The network BIOS session limit was exceeded'.

As a result those hosts are not able to get work for 24 hours by the too-many-errors safety mechanism.

I see this application is also available for Linux but I don't think I got any tasks assigned to my Linux hosts - I've turned off beta applications for now.

Soli Deo Gloria

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 913
Credit: 6,712,937,834
RAC: 25,637,355

yes this is the reason the

yes this is the reason the app is beta. always the possibility that something causes errors. having beta enabled accepts this risk.

 

there is actually a typo in the plan class for linux app, so it hasn't been sent to anyone.

_____________________________________________

Wedge009
Wedge009
Joined: 5 Mar 05
Posts: 77
Credit: 5,880,462,979
RAC: 13,405,650

My question is more about why

My question is more about why it was deployed with a 100% error rate, at least for AMD GPUs, less about the test application itself. My past experience with Einstein betas (from many years ago) was more positive than this.

Soli Deo Gloria

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 913
Credit: 6,712,937,834
RAC: 25,637,355

Wedge009 wrote:My question

Wedge009 wrote:

My question is more about why it was deployed with a 100% error rate, at least for AMD GPUs, less about the test application itself. My past experience with Einstein betas (from many years ago) was more positive than this.

several other users have reported that the application works for them. so not 100% failure rate, just 100% failure on your system.

possible that there's some system incompatibility with your system.

 

the note in preferences regarding beta tasks is pretty clear:

Quote:

Run test applications?:
This helps us develop applications, but may cause jobs to fail on your computer

 

 

it's possible that whatever driver you're using under windows 7 doesn't properly support the necessary opencl 2.0 features that are required by this new app, even if it's reporting to boinc opencl 2.0 support. 

 

arch tried on his window 10 machine and the app worked.

richie had errors similar to yours on a RX 580/ win 10 machine, but polaris opencl 2.0 support seems spotty.  and with the cluster that is AMD drivers as a whole, anything is possible.

_____________________________________________

Wedge009
Wedge009
Joined: 5 Mar 05
Posts: 77
Credit: 5,880,462,979
RAC: 13,405,650

Ian&Steve C. wrote:several

Ian&Steve C. wrote:

several other users have reported that the application works for them. so not 100% failure rate, just 100% failure on your system.

possible that there's some system incompatibility with your system.

Can you point me to where these reports are so that I can contribute to them? I wasn't aware of any such reports. I can appreciate that often problems can be on the user's end, but when I see multiple otherwise-reliable hosts suddenly failing with the same message my past experience suggests to me a problem with the application/server rather than with the user.

To give some context, I had recently been having trouble with my Internet connection (outside of my control) so in response I increased my work cache to 1 day to compensate. That gave me hundreds of tasks in cache, but unfortunately this started including the FGRPopencl2-ati application and when I reached those they all failed and got dumped on the server in one go, causing my hosts to be knocked out as unreliable.

I reiterate that past experience with beta applications have generally been positive.

Ian&Steve C. wrote:

it's possible that whatever driver you're using under windows 7 doesn't properly support the necessary opencl 2.0 features that are required by this new app, even if it's reporting to boinc opencl 2.0 support. 

It's true that where I'm still using Windows I'm using Windows 7 by choice, but I do have one Windows 10 laptop and the tasks failed on that host as well: https://einsteinathome.org/host/12840516/tasks/6/40

I also wonder what these 'new features' are required for OpenCL 2.0 support. AFAIK OpenCL 2.0 is old, was never really supported by Nvidia GPUs, and only mainly supported with AMD GPUs. Why suddenly require these specific features?

At least for my case the hardware concerned are Fiji and Vega20 (on Windows 7) and Vega + RDNA1 (Windows 10). So nothing Polaris or newer other than RDNA.

And while many are quick to bemoan AMD's drivers, I've generally found them pretty good on Windows and more recently Linux - support for PAL vs ROCr on Linux notwithstanding.

I'm not sure if what the intention here is, but I feel like this is somehow being made to be 'my fault'. Consider that if I didn't report this the issue could be going unnoticed - of course that assumes there is an issue to begin with, but that's what I'm trying to find out.

Soli Deo Gloria

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 913
Credit: 6,712,937,834
RAC: 25,637,355

Wedge009 wrote:Ian&Steve C.

Wedge009 wrote:

Ian&Steve C. wrote:

several other users have reported that the application works for them. so not 100% failure rate, just 100% failure on your system.

possible that there's some system incompatibility with your system.

Can you point me to where these reports are so that I can contribute to them? I wasn't aware of any such reports. I can appreciate that often problems can be on the user's end, but when I see multiple otherwise-reliable hosts suddenly failing with the same message my past experience suggests to me a problem with the application/server rather than with the user.

To give some context, I had recently been having trouble with my Internet connection (outside of my control) so in response I increased my work cache to 1 day to compensate. That gave me hundreds of tasks in cache, but unfortunately this started including the FGRPopencl2-ati application and when I reached those they all failed and got dumped on the server in one go, causing my hosts to be knocked out as unreliable.

I reiterate that past experience with beta applications have generally been positive.

Ian&Steve C. wrote:

it's possible that whatever driver you're using under windows 7 doesn't properly support the necessary opencl 2.0 features that are required by this new app, even if it's reporting to boinc opencl 2.0 support. 

It's true that where I'm still using Windows I'm using Windows 7 by choice, but I do have one Windows 10 laptop and the tasks failed on that host as well: https://einsteinathome.org/host/12840516/tasks/6/40

I also wonder what these 'new features' are required for OpenCL 2.0 support. AFAIK OpenCL 2.0 is old, was never really supported by Nvidia GPUs, and only mainly supported with AMD GPUs. Why suddenly require these specific features?

At least for my case the hardware concerned are Fiji and Vega20 (on Windows 7) and Vega + RDNA1 (Windows 10). So nothing Polaris or newer other than RDNA.

And while many are quick to bemoan AMD's drivers, I've generally found them pretty good on Windows and more recently Linux - support for PAL vs ROCr on Linux notwithstanding.

I'm not sure if what the intention here is, but I feel like this is somehow being made to be 'my fault'. Consider that if I didn't report this the issue could be going unnoticed - of course that assumes there is an issue to begin with, but that's what I'm trying to find out.

here: https://einsteinathome.org/content/improvements-code-clients?page=2#comment-188540

and here: https://einsteinathome.org/content/improvements-code-clients?page=3#comment-188577

give that whole thread a read. this new app is mainly aimed at nvidia (and there's an nvidia app that seems to work for everyone). the main change in the app is the use of "global" address space instead the constant that was being used before. this caused major slowdowns for nvidia cards. but the use of global is what requires opencl 2.0. and it just so happens that nvidia drivers now support  opencl 3.0 (including the necessary 2.0 features). nvidia cards are seeing a 50-100% speed increase.

in pre-release testing by myself and a few others, we saw modest speed improvements in AMD Navi GPUs also (like ~20%) but no improvement on Polaris cards (and getting the drivers to give opencl 2.0 support on an RX 570 required jumping through lots of hoops and trying many different driver versions). We didnt have any Vega cards to test, but according to AMD, Vega is supported for OpenCL 2.0 on their latest drivers with the ROCr stack, or with ROCm drivers. unsure how this support is on Windows 7 however.

I never said it was your "fault" just that the failure rate is not 100% since other users have had success with it, both on windows and Linux. you might just need a different driver.

_____________________________________________

Richie
Richie
Joined: 7 Mar 14
Posts: 638
Credit: 1,701,152,495
RAC: 63,130

What I've seen so far is that

What I've seen so far is that all succesfull v1.28 runs on AMD GPUs have been performed either

1) in linux (Polaris, Vega and RX 5000/6000 series GPUs all have been OK)

2) in Windows, but then with RX5000/6000 series GPUs only.

All systems with Polaris or Vega that I've seen have failed to run v1.28 tasks in Windows.

I don't believe that has anything to do with OS being Windows 7. Mine crashed the tasks in Windows 10 (19043).

At the same time I want to say that my observations are based on a very tiny amount of material. But if there's anybody that has been able to run v1.28 tasks with Polaris or Vega GPU in Windows I'd be eager to see that.

Currently I think that for some reason Polaris and Vega series cards may all be incompatible with the 1.28 app, but only in Windows environment. If there's been exceptions to this... let their shine come to me please.

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 913
Credit: 6,712,937,834
RAC: 25,637,355

I too don't believe it's due

I too don't believe it's due to Windows 7 directly, I was more suggesting that the drivers available for windows 7 might not have the proper support. He might try using the latest driver package if he's not already. looks like version 21.5.2 was released back in June of this year and is the latest available for Radeon VII/Windows 7.

 

Richie, can you link me to a Linux system that ran ok with a Polaris card?

I'm still unable to even get sent any 1.28 tasks to my Linux/Polaris card. I have beta enabled, but the scheduler keeps trying to send me the bad/incorrect plan class, and never even checks for the good one. not sure what's wrong. I've even tried resetting the project on this host, with no luck.

 

_____________________________________________

Richie
Richie
Joined: 7 Mar 14
Posts: 638
Credit: 1,701,152,495
RAC: 63,130

Ian&Steve C. wrote:Richie,

Ian&Steve C. wrote:
Richie, can you link me to a Linux system that ran ok with a Polaris card?

Actually I thought that you had been able to run them earlier while you were testing the changes.  "My 570 saw no improvement"... but sorry, I had obviously misunderstood what that was indeed referring to. I'm going to search more.

Ryusennin
Joined: 11 Aug 13
Posts: 1
Credit: 55,672,108
RAC: 67,052

It's not specific to Win7.

It's not specific to Win7. The new beta client is completely borked on Win10 as well. Since a few days, all my opencl2 tasks fail with computational errors in less than 10 seconds.

 

Ryzen 2600X + Vega 56 + Radeon 21.8.2

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 913
Credit: 6,712,937,834
RAC: 25,637,355

Richie wrote: Ian&Steve C.

Richie wrote:

Ian&Steve C. wrote:
Richie, can you link me to a Linux system that ran ok with a Polaris card?

Actually I thought that you had been able to run them earlier while you were testing the changes.  "My 570 saw no improvement"... but sorry, I had obviously misunderstood what that was indeed referring to. I'm going to search more.

 

I ran my 570 with pre-release code, which was implemented a bit differently than the officially released beta app. And honestly I’m not sure if Bernd included all of the same code that I have. 
 

ive been trying to get the new app to test back to back vs my previous code on my 570, but I’ve been unable to get the scheduler to send me the new beta app. Looking at the logs it never even tries to send it. I think there’s a bug in the scheduler surrounding the incorrect plan class name for Linux. Some Linux systems are recognizing the correct plan class, but mine never does. So I’ve never got the new app. 

_____________________________________________

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.