One machine failing gamma beta.

Mr P Hucker
Mr P Hucker
Joined: 12 Aug 06
Posts: 838
Credit: 519758940
RAC: 18171
Topic 227926

I have a Ryzen 9 3900XT with two AMD GPUs: R9 Nano and R9 280X.  If I put Gamma onto beta, it breaks every single task with a computation error within seconds of starting.  But all my other computers (a variety of Intels, some old some new) all with R9 280X cards work fine.  I noticed the dodgy computer is getting tasks labelled "Gamma-ray pulsar binary search #1 on GPUs (FGRPopencl2-ati)".  The working computers get the same but with "(FGRPopencl1K-ati)" on the end.  Why is one machine getting different ones?  Is it because of the Nano?  Whatever they are, they don't run on the Nano or the 280X.

If this page takes an hour to load, reduce posts per page to 20 in your settings, then the tinpot 486 Einstein uses can handle it.

Mr P Hucker
Mr P Hucker
Joined: 12 Aug 06
Posts: 838
Credit: 519758940
RAC: 18171

I notice the "2" tasks are

I notice the "2" tasks are labelled as beta on the applications page.  The 1K ones are not.  Strange only one of my computers is given beta work.

 

I'm getting "The network BIOS session limit was exceeded", which is an error usually found when connecting across a network:

 

<core_client_version>7.20.1</core_client_version>
<![CDATA[
<message>
The network BIOS session limit was exceeded.
 (0x45) - exit code 69 (0x45)</message>
<stderr_txt>
14:51:38 (51932): [normal]: This Einstein@home App was built at: Aug 17 2021 14:12:21

14:51:38 (51932): [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/hsgamma_FGRPB1G_1.28_windows_x86_64__FGRPopencl2-ati.exe'.
14:51:38 (51932): [debug]: 1e+016 fp, 5.2e+009 fp/s, 2021881 s, 561h38m00s77
14:51:38 (51932): [normal]: % CPU usage: 0.010000, GPU usage: 0.333000
command line: projects/einstein.phys.uwm.edu/hsgamma_FGRPB1G_1.28_windows_x86_64__FGRPopencl2-ati.exe --inputfile ../../projects/einstein.phys.uwm.edu/LATeah3012L12220727.dat --alpha 2.59819959601 --delta -0.694603692878 --skyRadius 1.890770e-06 --ldiBins 15 --f0start 892.0 --f0Band 8.0 --firstSkyPoint 0 --numSkyPoints 1 --f1dot -1e-13 --f1dotBand 1e-13 --df1dot 1.69860773e-15 --ephemdir ..\..\projects\einstein.phys.uwm.edu\JPLEPH --Tcoh 2097152.0 --toplist 10 --cohFollow 10 --numCells 1 --useWeights 1 --Srefinement 1 --CohSkyRef 1 --cohfullskybox 1 --mmfu 0.1 --reftime 56100 --model 0 --f0orbit 0.005 --mismatch 0.1 --demodbinary 1 --BinaryPointFile ../../projects/einstein.phys.uwm.edu/templates_LATeah3012L12220727_0900_31668201.dat --debug 0 -o LATeah3012L12220727_900.0_0_0.0_31668201_1_0.out
output files: 'LATeah3012L12220727_900.0_0_0.0_31668201_1_0.out' '../../projects/einstein.phys.uwm.edu/LATeah3012L12220727_900.0_0_0.0_31668201_1_0' 'LATeah3012L12220727_900.0_0_0.0_31668201_1_0.out.cohfu' '../../projects/einstein.phys.uwm.edu/LATeah3012L12220727_900.0_0_0.0_31668201_1_1'
14:51:38 (51932): [debug]: Flags: X64 SSE SSE2 GNUC X86 GNUX86
14:51:38 (51932): [debug]: Set up communication with graphics process.
boinc_get_opencl_ids returned [0000000003a87680 , 00007ffe6145f000]
Using OpenCL platform provided by: Advanced Micro Devices, Inc.
Using OpenCL device "Fiji" by: Advanced Micro Devices, Inc.
Max allocation limit: 3422552064
Global mem size: 0
read_checkpoint(): Couldn't open file 'LATeah3012L12220727_900.0_0_0.0_31668201_1_0.out.cpt': No error (0)
% fft length: 16777216 (0x1000000)
% Scratch buffer size: 136314880
Error during OpenCL FFT (error: -5)
ERROR: gen_fft_execute() returned with error -343431264
14:51:46 (51932): [CRITICAL]: ERROR: MAIN() returned with error '5'
FPU status flags: PRECISION
14:51:46 (51932): [normal]: done. calling boinc_finish(69).
14:51:46 (51932): called boinc_finish(69)

</stderr_txt>
]]>



If this page takes an hour to load, reduce posts per page to 20 in your settings, then the tinpot 486 Einstein uses can handle it.

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 4045
Credit: 48080764183
RAC: 33735340

the "2" app means it requires

the "2" app means it requires OpenCL 2.0 (this app was built with features in the OpenCL 2.0 spec and are not supported if you have a device or driver with less than 2.0 support). the project will check the driver details and see if OpenCL 2.0 is supported. driver details are communicated to the project via BOINC in the sched RPC. if your system reports 2.0 support, and you have beta tasks turned on, then the project green-lights you for the 2.0.

the simple explanation to why your R9 Fury gets the "bad" tasks, is because it supports OpenCL 2.0 and your other cards do not.

 

the root cause of the errors seems to be that even though the drivers and device claim to be supporting openCL 2.0, they really dont support it correctly, or arent supporting all features of 2.0. driver issues are ubiquitous with AMD.

 

just turn off beta tasks.

or edit/lock your coproc file and force it to display only opencl 1.2 support

_________________________________________________________________________

Mr P Hucker
Mr P Hucker
Joined: 12 Aug 06
Posts: 838
Credit: 519758940
RAC: 18171

When the bad machine

When the bad machine downloads 2, it tries to run it on both cards (Fury and Tahiti) and they both fail.  The Tahiti on it (and the machines that don't get given those tasks) are OpenCL 1.2, the Fury is 2.0.  This is reported in Boinc at startup.  Actually that Fury does have trouble wi9th a couple of games the Tahiti is ok with.  I'll go with AMD bug.

I don't want to turn off beta as I'm doing the BRP7.  I'll try the coproc edit, that sounds like a good idea.

If this page takes an hour to load, reduce posts per page to 20 in your settings, then the tinpot 486 Einstein uses can handle it.

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 4045
Credit: 48080764183
RAC: 33735340

Peter Hucker of the Scottish

Peter Hucker of the Scottish Boinc Team wrote:

This is reported in Boinc at startup.

but as has been mentioned many times before, BOINC only communicates device info for the "first/best" GPU to the projects. the OpenCL 2.0 differentiation happens at the scheduler level, not the application level. as far as the project is concerned, the host has two R9 Fury GPUs, and meets the OpenCL 2.0 criteria so it gets sent the application/task. on the client side, BOINC doesn't inherently know that GPU1/device1 can't run the task. it's just going in the normal FIFO order and running the task as it comes up that's why it runs on both GPUs. hypothetically, if for example the drivers were fine and the R9 Fury could actually process it, you'd have to specifically exclude GPU1/device1 from that plan class (in cc_config) to avoid execution on it.

_________________________________________________________________________

Mr P Hucker
Mr P Hucker
Joined: 12 Aug 06
Posts: 838
Credit: 519758940
RAC: 18171

The Fury is d1, Tahiti d0.  I

The Fury is d1, Tahiti d0.  I guess it must be best, not first.

If this page takes an hour to load, reduce posts per page to 20 in your settings, then the tinpot 486 Einstein uses can handle it.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.