Still receiving AMD tasks despite being disabled

Bill
Bill
Joined: 2 Jun 17
Posts: 38
Credit: 243954730
RAC: 252668
Topic 224170

On this computer I have "disabled" using the AMD GPU for this project.  What I mean is under the project preferences, I have disabled computing on AMD GPUs.  I assumed that this meant that E@H would not send over any AMD GPU tasks for this project, but I discovered today that I have had a lot of tasks recently aborted.  One of them is here, which produced the error 201 (0x000000C9) EXIT_MISSING_COPROC.

I have had these settings for several months now, and I thought everything was working fine.  Am I doing something wrong?  A few other things to note:

1.  Yes, the location of the computer matches the location settings.  This computer's location is 'home', and those are the settings I am editing.

2.  I do not want to disable the AMD GPU for the computer in BOINC's CC_Config file.  This GPU works fine with MilkyWay@Home, which does not need more than 2 GB for GPU work (which is the reason I have this GPU disabled for this project).

 

Any help would be appreciated, thank you!

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5842
Credit: 109410201186
RAC: 35074749

Bill wrote:On this computer I

Bill wrote:
On this computer I have "disabled" using the AMD GPU for this project.

Your computer shows as having an internal AMD GPU (Radeon Vega Graphics) and also a discrete nvidia GPU (GTX 1660Ti).  Stopping the use of the AMD GPU will not stop the project sending tasks for your 1660Ti.  If you click on the Task ID for any of your current batch of GW GPU tasks it shows they are for -> Gravitational Wave search O2 Multi-Directional GPU v2.07 (GW-opencl-nvidia) - ie. your 1660Ti.

Disabling one particular type of GPU isn't the best way.  Instead, you should just remove all GPU searches at Einstein if you don't want Einstein to send you any GPU tasks of any description.  In your settings, scroll down below the GPU types and make sure any applications that use a GPU are not ticked.  Also just in case, make sure the setting for non-preferred apps is set to 'no'.  If the system runs out of work for your 'selected' apps, it might send you something you don't want :-).  Make sure to 'save' any changes you make.

Cheers,
Gary.

Bill
Bill
Joined: 2 Jun 17
Posts: 38
Credit: 243954730
RAC: 252668

Correct, I have two GPUs. 

Correct, I have two GPUs.  The AMD GPU (integral graphics to the Ryzen APU), can only use 2 GB of system memory.  Any tasks that I received for the APU would error out due to insufficient memory.  However, I am able to compute on the 1660Ti without any problems.  So, that's why I chose to turn off AMD computing and leave Nvidia computing on.

I would prefer not to disable all GPU computing.  The most I would be able to contribute with this computer is one CPU core if that was the case.

If disabling the GPU the way I have done isn't the right way to resolve this problem, then I don't understand the point in being able to disable different brand GPUs in the first case.  I think I'm going to ask for a bug fix, but I'm not sure if this is a BOINC problem, or a E@H problem.  I'm guessing it is a BOINC problem.

Also, the last task that I had linked with a problem seems to have disappeared, here is a link to another task with the same error.

San-Fernando-Valley
San-Fernando-Valley
Joined: 16 Mar 16
Posts: 260
Credit: 6916161637
RAC: 20151144

My two cents: I suspect

My two cents:

I suspect that there is a difference in the way that the recognition/differentiation is made between between internal/integral and external/discrete GPUs.

.. just a thought ...

 

alanb1951
alanb1951
Joined: 28 Nov 16
Posts: 18
Credit: 642157540
RAC: 424478

Another twopence

Another twopence worth...

Did you disable the GPU in the same "location" as that to which the computer is assigned?  I seem to recall some folks having had issues with that in the past...

Cheers - Al.

 

Bill
Bill
Joined: 2 Jun 17
Posts: 38
Credit: 243954730
RAC: 252668

Al- Yes, the computer is

Al-

Yes, the computer is set to home, and that is the location where I have the AMD GPUs disabled.

mikey
mikey
Joined: 22 Jan 05
Posts: 11889
Credit: 1828210831
RAC: 202156

Bill wrote: Al- Yes, the

Bill wrote:

Al-

Yes, the computer is set to home, and that is the location where I have the AMD GPUs disabled. 

An easy way would be to exclude the AMD gpu from Einstein in the cc_config.xml file by adding these lines:


<options>

<use_all_gpus>1</use_all_gpus>

<exclude_gpu>

<device_num>0</device_num>

<url>http://einstein.phys.uwm.edu/</url>

</exclude_gpu>

</options>

What that does first is tell Boinc to use all the gpu's, you probably already have that part, then it tells Boinc to exclude gpu #0 from Einstein but not from any other Project. You can add multiple lines if you want too but it's not necessary unless you have problems there too.

The one problem with this is I don't know which gpu is #0 your AMD gpu or your Nvidia gpu, but when you first start Boinc it tells you which is which near the top of the event log. So if the AMD gpu is not #0 then it will be #1 since you only have 2 gpu's in the box and you will need to adjust the file.

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4704
Credit: 17549755376
RAC: 6433681

BOINC should number the

BOINC should number the Nvidia card first since it has the higher CC capability and more memory.

 

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5842
Credit: 109410201186
RAC: 35074749

Bill wrote:Correct, I have

Bill wrote:
Correct, I have two GPUs.  The AMD GPU (integral graphics to the Ryzen APU), can only use 2 GB of system memory.

I'm sorry that I didn't fully understand your situation.  There was no mention of the 1660Ti in your original message and I misinterpreted your intention.  For some reason I got in in my head that you wanted to stop all GPU computing at Einstein.

In any case, I'm not sure the real issue with the "Radeon Vega Graphics" is a 2 GB memory limit.  According to BOINC, the capabilities of both GPUs are, "GTX 1660 Ti (4095MB)" and "Vega 8 Graphics (7204MB)".  You obviously have plenty of system memory and BOINC seems to think you have access to quite a lot of it.  If there is a limit lower than what is advertised, it's a BOINC problem.  A project is supposed to be able to use what BOINC says is available.

Even your latest problem task link no longer exists.  That's not surprising since, once the full quorum is completed, the individual tasks tend to disappear from the online database fairly quickly.  After all, as they say, size matters :-), and the project is desperate to trim the online database as soon as possible after the data is safely stored offline.

The easiest way to show people tasks of that type is just to point to the errors list for that particular search and mention to click on any example where the status shows "aborted".

I'm actually quite interested in a bit more detail about exactly what happened with these.  There are two quite different bits of information when you click any particular task ID link, "Client state: Aborted by user" and "Exit status: 201 (0x000000C9) EXIT_MISSING_COPROC".  I'm interested in what exactly happened.  I'm wondering if you did something like disable the AMD GPU with some existing tasks which then became "GPU missing" in BOINC manager, and then you had to abort those tasks to get rid of them, or was the sequence completely different?  Did these tasks definitely arrive after the setting had been changed?  In looking at the BOINC documentation, it seems that excluding various GPU types requires a client restart.  Did you actually do a full restart of the client at some point after the setting got changed and before the client made further work requests?  Is it possible that there wasn't a client restart and could that be why you got more of those tasks?

Bill wrote:
Any tasks that I received for the APU would error out due to insufficient memory.

Do you know that for sure or are you assuming that's the case?  The examples I looked at show 0.00 for both CPU and Run Time.  You only get a memory problem after a short run time and the app then works out it can't store all the data.  There would be some stderr output to show this and there is none of that in your examples.

I'm really not trying to be difficult :-).  I'm just trying to understand the two conflicting bits of information mentioned previously.  I fully agree with you that using the internal GPU is not a good idea and I just want to understand why you had difficulty disabling it.

In the end the <exclude_gpu> directive in cc_config.xml, using the 'type' rather than 'device_num' might be the best way to go.  I don't know for sure as I've never needed to exclude particular GPUs.  Please note the "requires a restart" comment in the documentation if you set this up :-).

Cheers,
Gary.

Bill
Bill
Joined: 2 Jun 17
Posts: 38
Credit: 243954730
RAC: 252668

Gary- Thanks for the

Gary-

Thanks for the response.  Yes, I probably should have mentioned the 1660Ti...I didn't think it was relevant, but perhaps it is.  I don't have much time to respond in too much detail right now, but reading through your post (a little quickly, admittedly), here's some more background.

I had in the past crunched E@H with the Vega 8 graphics.  About...I don't know, maybe six months ago or so, I realized I had been getting frequent BSODs with this computer.  I couldn't replicate it.  I noticed in posts on this forum that a new GPU application for E@H was available, and the discussion was that these applications needed more than 2 GB of GPU memory.  Since the Vega 8 uses 2 GB of the 16 GB my computer has, I assumed that this was the cause of the BSOD.  At that time, I disabled AMD GPU tasks in the settings, and since then, I have not had a BSOD.  I'm sure there are some flaws in how I troubleshooted the problem, but my computer has been working fine since.

Fast forward to the past few weeks, I've noticed these AMD GPU tasks being aborted.  I hadn't noticed it before, but then again, I am not babysitting BOINC as much as I used to, so perhaps I had just missed this detail for awhile.  That's how we got to where we are.

You mentioned the memory for the APU being greater than 2 GB.  You are correct, I have seen that before.  I don't know how to explain that bit (perhaps someone else can).  What I do know is in my bios settings I have dedicated APU memory set to 2 GB (I can't set that higher without problems), and Windows task manager says I have 2 GB dedicated memory, 7 GB shared, 9 GB total.  Regardless of how that is all divided out, it is all DDR4 memory from the system that is divided off, not some type of on-board GPU memory (which I think you understand).

So, where does that leave me?  I guess I have two problems.  First, it appears E@H AMD tasks are causing a BSOD.  I have not tried running AMD GPU tasks recently, so maybe this has gone away.  Assuming the tasks require more than 2GB of memory, then I don't know that there is much I can do here.  I could disable the specific applications that are causing the BSOD, but that means the Nvidia GPU would not be able to crunch those tasks.  If >2 GB is required for ALL E@H GPU tasks, then that means I can't crunch ANY GPU tasks without constant BSODs.  Ideally, it would be great if E@H could identify that my system should not be running these tasks.

The second problem, regardless of what is going on with the first problem, is that there is a setting in Einstein@Home that appears to disable AMD GPU tasks, but from what I can tell, isn't.  Here is the output from one of the tasks:

Name: h1_0414.70_O2C02Cl4In0__O2MDFS2_Spotlight_414.85Hz_60_0

Workunit ID: 507331627

Created: 3 Dec 2020 22:25:40 UTC

Sent: 4 Dec 2020 12:03:50 UTC

Report deadline: 10 Dec 2020 22:25:40 UTC

Received: 4 Dec 2020 13:04:33 UTC

Server state: Over

Outcome: Computation error

Client state: Aborted by user

Exit status: 201 (0x000000C9) EXIT_MISSING_COPROC

Computer: 12767141

Run time (sec): 0.00

CPU time (sec): 0.00

Peak working set size (MB): 0

Peak swap size (MB): 0

Peak disk usage (MB): 0

Validation state: Invalid

Granted credit: 0

Application: Gravitational Wave search O2 Multi-Directional GPU v2.07 (GW-opencl-ati)
windows_x86_64


Stderr output

<core_client_version>7.16.11</core_client_version>

 

For the record, I did not manually abort any of these tasks.  Also to clarify, I have rebooted this computer several times since changing any settings (which I haven't done for months), so I doubt any of these errors are caused by making sudden changes.

I have not tried editing my cc_config file.  I'll try that a little later.  I'm hesitant to make those changes, though.  Since there is a way of disabling the crunching of certain tasks via the system settings, and this is a little more intuitive and user-friendly, I would think that function should work.  If it doesn't work, then I think we need to figure out how to fix it.

I hope that helps.  I may not have answered all of your questions, so please let me know if you want additional information.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5842
Credit: 109410201186
RAC: 35074749

Bill wrote:... in my bios

Bill wrote:
... in my bios settings I have dedicated APU memory set to 2 GB (I can't set that higher without problems), and Windows task manager says I have 2 GB dedicated memory, 7 GB shared, 9 GB total.

OK, thanks for that.  If Windows sees the 2GB dedicated and 7GB shared, BOINC must be looking at the wrong thing if the 'dedicated' value is something you can adjust to be a 'limit'.  The last time I used any version of Windows was XP in 2006 so I have no experience with current Windows behaviour.

Perhaps you should report your observations over on the BOINC boards to see if someone (like Richard Haselgrove for example) might take a look in the BOINC code to see how BOINC comes up with its 'available memory' figure.  If BOINC reported a 2GB limit rather than the much larger figure, then the Einstein scheduler (hopefully) would know not supply tasks that  need more memory.

I have a similar APU style processor (athlon 3000G) in a few hosts.  They don't use the internal GPU but rather a discrete AMD GPU (eg. HD7850, RX 460, RX 570) and I think I must have disabled the internal GPU in the BIOS/UEFI when I set them up - I don't remember - but I've never had any issue with BOINC trying to use it.  The CPU is identified as "AuthenticAMD AMD Athlon 3000G with Radeon Vega Graphics [Family 23 Model 24 Stepping 1] and there is only a single coprocessor showing, eg. "AMD AMD Radeon HD 7800 Series (1944MB)" for one with a 2GB HD7850 GPU.  Are you able to disable the internal GPU and would that solve the problem for you?

Bill wrote:
So, where does that leave me?  I guess I have two problems.  First, it appears E@H AMD tasks are causing a BSOD.  I have not tried running AMD GPU tasks recently, so maybe this has gone away.  Assuming the tasks require more than 2GB of memory, then I don't know that there is much I can do here.  I could disable the specific applications that are causing the BSOD, but that means the Nvidia GPU would not be able to crunch those tasks.  If >2 GB is required for ALL E@H GPU tasks, then that means I can't crunch ANY GPU tasks without constant BSODs.  Ideally, it would be great if E@H could identify that my system should not be running these tasks.

In puzzling further about the strange combination of error messages for these AMD tasks ("Client state: Aborted by user" and "EXIT_MISSING_COPROC") here's something more for you to think about.  I've been re-reading your original report and I'm now struck by this bit, "... I discovered today that I have had a lot of tasks recently aborted."  I've just realised that "today" was 8th Dec.  The tasks in question had earlier dates than that.  I had assumed that you had seen these tasks actually on your computer - ie. BOINC Manager tasks tab, and then you either aborted them or they tried to run and immediately failed.  However I'm now thinking that "discovered" means looking at the website list only.  I'm thinking that perhaps you never ever had them sent to you at all.  Direct question - did you ever see these listed on the tasks tab of BOINC Manager?  Or was it just a list of errors you saw on the website?

A little while ago there was a flurry of activity to implement some sort of 'check' at the project end to prevent any 'large memory' tasks being sent to unsuitable GPUs.  The 'mechanism' to do this was not explained.  I'm just wondering if that mechanism just happens to cause a 'MISSING_COPROC' status message and then the error condition gets set to a description that already exists like, "Aborted by user", even though (as you indicated) they weren't actually aborted by you.  If you only ever 'discovered' these later by looking at the website then perhaps you never did actually receive these tasks.  I know this is highly speculative but it would be interesting to know if you've ever seen these tasks listed on your machine.

Bill wrote:
I have not tried editing my cc_config file.  I'll try that a little later.  I'm hesitant to make those changes, though.

If you can't or don't want to disable the internal GPU, the entry in cc_config.xml is not that hard to do - something like

<exclude_gpu><br />
    <url>einstein.phys.uwm.edu</url><br />
    <type>ATI</type><br />
    <app>einstein_O2MDF</app><br />
</exclude_gpu>

which goes in the options section of the file.  You could go with <device_num> instead but you would need to be sure what value BOINC was using for it.  The 'ATI' type should catch it unambiguously (provided 'ATI' hasn't been changed into a more modern word, or a different one for internal GPUs :-) ).  You wouldn't need to use this if you can disable the internal GPU in the motherboard firmware.

Sorry for the trailing <br /> stuff on the above lines.  Not sure what's doing that or how to prevent it.  Just omit those bits if you decide to use that example.  All the line endings should look like the very last one.

Cheers,
Gary.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.