FGRPopencl2-ati beta test application is broken

Wedge009
Wedge009
Joined: 5 Mar 05
Posts: 122
Credit: 17391631162
RAC: 7075186

Ian&Steve C. wrote: here:

Hmm, the thread title and location didn't immediately suggest any relation to the beta application to me - I did try to see if anyone else had issues before making my post.

Of the few NV GPUs I have, I'm not using for E@H at present - opportunity cost for me is such that I prefer to use them for other projects where NV architectures do better. But if the beta applications improve things for NV GPUs that may change.

Regarding OpenCL, my understanding is that 3.0 largely ignores 2.0 (same as what Nvidia and much of the industry did) and uses 1.2 as a starting point. But if the desired feature is specified in OpenCL 2.0 only, then that's an interesting situation. Either way, it looks to me like the current application doesn't cope well on AMD GPUs.

AFAIK, AMD's Windows drivers still use PAL for their OpenCL implementation, only their Linux drivers use ROCr. Unfortunately, I can't get BOINC to run GPU tasks at all with ROCr implementation on Linux and am stuck with an old driver that still uses PAL implementation for now.


Richie wrote:

What I've seen so far is that all succesfull v1.28 runs on AMD GPUs have been performed either

1) in linux (Polaris, Vega and RX 5000/6000 series GPUs all have been OK)

2) in Windows, but then with RX5000/6000 series GPUs only.

All systems with Polaris or Vega that I've seen have failed to run v1.28 tasks in Windows.

I don't believe that has anything to do with OS being Windows 7. Mine crashed the tasks in Windows 10 (19043).

At the same time I want to say that my observations are based on a very tiny amount of material. But if there's anybody that has been able to run v1.28 tasks with Polaris or Vega GPU in Windows I'd be eager to see that.

Currently I think that for some reason Polaris and Vega series cards may all be incompatible with the 1.28 app, but only in Windows environment. If there's been exceptions to this... let their shine come to me please.

Thanks for that, this information is really helpful. As mentioned above I haven't got any of the new applications for Linux - are you using the PAL or ROCr-based OpenCL implementation?

Given this, I'm pretty sure the issue is hardware-specific rather than OS-specific - I managed to test the beta application on my last currently available Windows host, a GCN3-based APU on Windows 10, and that also failed within seconds with 'The network BIOS session limit was exceeded' error. (What does that even mean?)

Also as mentioned above, I don't have any Polaris or newer GPUs other than an RDNA1-based RX5600M in a Windows 10 laptop (which also failed on the beta application).

 

Ian&Steve C. wrote:

I was more suggesting that the drivers available for windows 7 might not have the proper support. He might try using the latest driver package if he's not already. looks like version 21.5.2 was released back in June of this year and is the latest available for Radeon VII/Windows 7.

I should have clarified that I'm already using the latest drivers. Since Windows 7 and GPUs before Vega are now considered 'legacy', that means 21.5.2 for Windows, as you just mentioned.

 

Ryusennin wrote:

It's not specific to Win7. The new beta client is completely borked on Win10 as well. Since a few days, all my opencl2 tasks fail with computational errors in less than 10 seconds.

Ryzen 2600X + Vega 56 + Radeon 21.8.2

Thanks for this as well. Does sound like it's currently no good with Vega GPUs.

Soli Deo Gloria

Richie
Richie
Joined: 7 Mar 14
Posts: 656
Credit: 1702989778
RAC: 0

Wedge009 wrote:As mentioned

Wedge009 wrote:
As mentioned above I haven't got any of the new applications for Linux - are you using the PAL or ROCr-based OpenCL implementation?

For these tasks I have run Windows systems only. I don't know anything about the driver stuff for AMD in linux, sorry. I gathered that information just from looking into messages and results that other users running linux had provided so far.

klepel
klepel
Joined: 12 Jun 12
Posts: 2
Credit: 461602977
RAC: 93631

Just to add, these two hosts

Just to add, these two hosts do not work with the new app as well:
https://einsteinathome.org/es/host/12811486/tasks/0/0
https://einsteinathome.org/es/host/12886622/tasks/0/0
These are an AMD 2200G and a AMD 5600G with integrated GPUs.
It is definitely the new app, as both host worked perfectly before under Win10!
 

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3956
Credit: 46868042642
RAC: 64464009

klepel wrote:Just to add,

klepel wrote:

Just to add, these two hosts do not work with the new app as well:
https://einsteinathome.org/es/host/12811486/tasks/0/0
https://einsteinathome.org/es/host/12886622/tasks/0/0
These are an AMD 2200G and a AMD 5600G with integrated GPUs.
It is definitely the new app, as both host worked perfectly before under Win10!
 

the new app has new features that require different drivers. it's very possible that the windows drivers, or those devices (or both), do not support the required features.

the old app was perfectly happy running on legacy opencl 1.2 features. AMD's driver support is pretty hit or miss. opencl 1.2 was pretty universally supported, but documentation and support for some more advanced features can be harder to iron out. I had to really dig to find the setup that worked on my RX 570 Polaris card. at least with linux you have a few driver models to try, but unfortunately you don't really get that option on Windows.

I wouldn't get my hopes up for APU support here. probably best to just turn off beta tasks in your case. there has been some others that reported that Vega (and older) sees no benefit with this new app anyway. The best improvement was seen on Navi 10 cards.

_________________________________________________________________________

Matt White
Matt White
Joined: 9 Jul 19
Posts: 120
Credit: 280798376
RAC: 0

I had issues with this host.

I had issues with this host. 435 tasks failed almost immediately, even after updating the driver to 21.5.2. (The previous driver was from 2019.) For now, I disabled beta apps. I'll give it a couple weeks and try again.

https://einsteinathome.org/host/12799882

If it is determined that the OS is the problem, upgrading to Windows 10 is not an issue for me. I'm doing that at some point anyway.

Clear skies,
Matt
Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3956
Credit: 46868042642
RAC: 64464009

I've tested on my test

I've tested on my test bench.

  • CPU: Intel Xeon E5-1680v2
  • MB: ASUS P9X79-E WS (GPU plugged into slot connected direct to CPU)
  • GPU: RX 570 4GB (Polaris)

 

Under Linux (Ubuntu 20.04.3 w/5.11.0-27 kernel) and ROCm 4.2 drivers. This system will successfully process the 1.28 tasks. they are a few seconds slower than the 1.18 tasks, but they complete and do not produce errors.

 

the same exact platform, but booting into Windows (win 10 x64) produces the same errors that most others are reporting with their windows hosts and older GPUs. I tried all three latest driver packages (21.8.2, 21.6.1, and even the enterprise Pro driver 21.Q2.1) and they all produced the same error.

 

Since the same exact hardware configuration works under Linux, and a handful of other users have reported success under windows with Navi GPUs. I think it's likely that the Windows drivers just don't have the proper level of support for the new code features on these older GPUs and only really give legacy support even though they are reporting 2.0. I'm more hesitant to think there's a problem inherent to the windows application itself since some Navi/Windows users had success on the same app.

 

Windows users with GPUs older than Navi should probably disable beta tasks for now.

_________________________________________________________________________

metalius
metalius
Joined: 29 Dec 05
Posts: 44
Credit: 173173853
RAC: 0

The same on Radeon R7 -

The same on Radeon R7 - permanent errors at 9 seconds.

Any solution?

Richie
Richie
Joined: 7 Mar 14
Posts: 656
Credit: 1702989778
RAC: 0

metalius wrote:The same on

metalius wrote:

The same on Radeon R7 - permanent errors at 9 seconds.

Any solution?

This is from elsewhere (written by IAN&STEVE C.) but applies directly to your situation too:

"you need to disable beta tasks in your preferences. These 1.28 are beta. All of your successful tasks have been with the standard 1.22 app. You have no success with 1.28. "

metalius
metalius
Joined: 29 Dec 05
Posts: 44
Credit: 173173853
RAC: 0

Thank You, I missed this

Thank You, I missed this moment.

ahorek's team
ahorek's team
Joined: 16 Dec 05
Posts: 19
Credit: 249306594
RAC: 6819

any feedback from developers?

any feedback from developers? Here's my output:

<core_client_version>7.16.11</core_client_version>
<![CDATA[
<message>
Byl p - exit code 69 (0x45)</message>
<stderr_txt>
02:07:43 (9648): [normal]: This Einstein@home App was built at: Aug 17 2021 14:12:21

02:07:43 (9648): [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/hsgamma_FGRPB1G_1.28_windows_x86_64__FGRPopencl2-ati.exe'.
02:07:43 (9648): [debug]: 1e+016 fp, 5.2e+009 fp/s, 2005255 s, 557h00m55s05
02:07:43 (9648): [normal]: % CPU usage: 1.000000, GPU usage: 1.000000
command line: projects/einstein.phys.uwm.edu/hsgamma_FGRPB1G_1.28_windows_x86_64__FGRPopencl2-ati.exe --inputfile ../../projects/einstein.phys.uwm.edu/LATeah3011L03.dat --alpha 2.59819959601 --delta -0.694603692878 --skyRadius 1.890770e-06 --ldiBins 15 --f0start 620.0 --f0Band 8.0 --firstSkyPoint 0 --numSkyPoints 1 --f1dot -1e-13 --f1dotBand 1e-13 --df1dot 1.69860773e-15 --ephemdir ..\..\projects\einstein.phys.uwm.edu\JPLEPH --Tcoh 2097152.0 --toplist 10 --cohFollow 10 --numCells 1 --useWeights 1 --Srefinement 1 --CohSkyRef 1 --cohfullskybox 1 --mmfu 0.1 --reftime 56100 --model 0 --f0orbit 0.005 --mismatch 0.1 --demodbinary 1 --BinaryPointFile ../../projects/einstein.phys.uwm.edu/templates_LATeah3011L03_0628_31597272.dat --debug 0 -o LATeah3011L03_628.0_0_0.0_31597272_1_0.out
output files: 'LATeah3011L03_628.0_0_0.0_31597272_1_0.out' '../../projects/einstein.phys.uwm.edu/LATeah3011L03_628.0_0_0.0_31597272_1_0' 'LATeah3011L03_628.0_0_0.0_31597272_1_0.out.cohfu' '../../projects/einstein.phys.uwm.edu/LATeah3011L03_628.0_0_0.0_31597272_1_1'
02:07:43 (9648): [debug]: Flags: X64 SSE SSE2 GNUC X86 GNUX86
02:07:43 (9648): [debug]: Set up communication with graphics process.
boinc_get_opencl_ids returned [0000000003962490 , 00007ffbf613f000]
Using OpenCL platform provided by: Advanced Micro Devices, Inc.
Using OpenCL device "Fiji" by: Advanced Micro Devices, Inc.
Max allocation limit: 3422552064
Global mem size: 0
read_checkpoint(): Couldn't open file 'LATeah3011L03_628.0_0_0.0_31597272_1_0.out.cpt': No error (0)
% fft length: 16777216 (0x1000000)
% Scratch buffer size: 136314880
Error during OpenCL FFT (error: -5)
ERROR: gen_fft_execute() returned with error 981817664
02:07:50 (9648): [CRITICAL]: ERROR: MAIN() returned with error '5'
FPU status flags: PRECISION
02:07:50 (9648): [normal]: done. calling boinc_finish(69).
02:07:50 (9648): called boinc_finish(69)

</stderr_txt>
]]>

the error -5 should mean "if there is a failure to allocate resources required by the OpenCL implementation on the device"

I have the most recent driver and it reports that OpenCL 2 is supported, but the global mem size looks wrong. Maybe global mem isn't supported on Fiji? Is there any workaround possible? Unfortunatelly, the source code doesn't seem to be public (or I didn't find it)

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.