app version refers to missing GPU type

Eugene Stemple
Eugene Stemple
Joined: 9 Feb 11
Posts: 67
Credit: 388614878
RAC: 440937
Topic 229977

It started a couple of days ago...  on routine work fetches I'm getting these lines in the event log:

20-Aug-2023 16:44:06 [Einstein@Home] Sending scheduler request: To fetch work.
20-Aug-2023 16:44:06 [Einstein@Home] Requesting new tasks for CPU and NVIDIA GPU
20-Aug-2023 16:44:15 [---] app version refers to missing GPU type ibo,GBT,long) is not available for your type of computer.
20-Aug-2023 16:44:15 [Einstein@Home] Scheduler request completed: got 2 new tasks
20-Aug-2023 16:44:15 [Einstein@Home] App version uses non-existent ibo,GBT,long) is not available for your type of computer. GPU
 

... bold text added for emphasis ...

It appears that (Arecibo,GBT,long) got mangled somehow to come through as just ...ibo,GBT,long) and probably more text is missing, but it's not clear to me whether it's just me or whether it is an E@H server thing.  I can't seem to find out which app it is referring to.  If anybody else is seeing something like this then it's not just me... OTOH, if it IS JUST ME... is a project reset the best way to recover?  I did think of deleting "the app" and let BOINC reload it.  The only reference to (Arecibo,GBT,long) that I can find is in the client_state.xml file where it is shown as the "user friendly name" for the BRP4G app.  Would it make sense to delete that app and see if BOINC recovers?  Or, just do a project reset to cover all bases?  Meanwhile, FGRP5 and O3AS work is continuing normally.

 

GWGeorge007
GWGeorge007
Joined: 8 Jan 18
Posts: 3117
Credit: 5010023412
RAC: 1627771

Eugene Stemple wrote: It

Eugene Stemple wrote:

It started a couple of days ago...  on routine work fetches I'm getting these lines in the event log:

20-Aug-2023 16:44:06 [Einstein@Home] Sending scheduler request: To fetch work.
20-Aug-2023 16:44:06 [Einstein@Home] Requesting new tasks for CPU and NVIDIA GPU
20-Aug-2023 16:44:15 [---] app version refers to missing GPU type ibo,GBT,long) is not available for your type of computer.
20-Aug-2023 16:44:15 [Einstein@Home] Scheduler request completed: got 2 new tasks
20-Aug-2023 16:44:15 [Einstein@Home] App version uses non-existent ibo,GBT,long) is not available for your type of computer. GPU
 

... bold text added for emphasis ...

It appears that (Arecibo,GBT,long) got mangled somehow to come through as just ...ibo,GBT,long) and probably more text is missing, but it's not clear to me whether it's just me or whether it is an E@H server thing.  I can't seem to find out which app it is referring to.  If anybody else is seeing something like this then it's not just me... OTOH, if it IS JUST ME... is a project reset the best way to recover?  I did think of deleting "the app" and let BOINC reload it.  The only reference to (Arecibo,GBT,long) that I can find is in the client_state.xml file where it is shown as the "user friendly name" for the BRP4G app.  Would it make sense to delete that app and see if BOINC recovers?  Or, just do a project reset to cover all bases?  Meanwhile, FGRP5 and O3AS work is continuing normally.

I'm not quite sure what has happened, but I'll try to offer some explanations as to what I think it could be.

First, I'd recommend that you upgrade your client from 7.14 to at least 7.18.  If I'm correct, the outdated client may have something to do with your missing GPU because it is no longer recognizing your computer, therefore can not see your GPU.  In addition, BOINC is no longer using HTTP for addressing your computer now, it only sees HTTPS which was not in operation at the time you had 7.14 installed.

Second, I'd recommend that you reduce your project selections to one and see if it does recognize it again.  Yes, you do have 6GB in your 1060 GPU, but you are using only 4GB of it.  Some of these projects take over 4GB now and therefore won't recognize your GPU.

Try using one at a time until you get one that works.

If this does or does not work, we'd appreciate it if you get back to us and tell us the good or bad news.

George

Proud member of the Old Farts Association

mikey
mikey
Joined: 22 Jan 05
Posts: 12778
Credit: 1862686186
RAC: 1543316

Eugene Stemple wrote: It

Eugene Stemple wrote:

It started a couple of days ago...  on routine work fetches I'm getting these lines in the event log:

20-Aug-2023 16:44:06 [Einstein@Home] Sending scheduler request: To fetch work.
20-Aug-2023 16:44:06 [Einstein@Home] Requesting new tasks for CPU and NVIDIA GPU
20-Aug-2023 16:44:15 [---] app version refers to missing GPU type ibo,GBT,long) is not available for your type of computer.
20-Aug-2023 16:44:15 [Einstein@Home] Scheduler request completed: got 2 new tasks
20-Aug-2023 16:44:15 [Einstein@Home] App version uses non-existent ibo,GBT,long) is not available for your type of computer. GPU
 

... bold text added for emphasis ...

It appears that (Arecibo,GBT,long) got mangled somehow to come through as just ...ibo,GBT,long) and probably more text is missing, but it's not clear to me whether it's just me or whether it is an E@H server thing.  I can't seem to find out which app it is referring to.  If anybody else is seeing something like this then it's not just me... OTOH, if it IS JUST ME... is a project reset the best way to recover?  I did think of deleting "the app" and let BOINC reload it.  The only reference to (Arecibo,GBT,long) that I can find is in the client_state.xml file where it is shown as the "user friendly name" for the BRP4G app.  Would it make sense to delete that app and see if BOINC recovers?  Or, just do a project reset to cover all bases?  Meanwhile, FGRP5 and O3AS work is continuing normally.  

A reset will delete every Einstein task you have on your pc and force you to get all new ones

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 5020
Credit: 18924062419
RAC: 6532085

Quote:Yes, you do have 6GB in

Quote:
Yes, you do have 6GB in your 1060 GPU, but you are using only 4GB of it.  Some of these projects take over 4GB now and therefore won't recognize your GPU.

This is incorrect.  The displaying of only 4GB of memory for the 1060 GPU is only a flaw in older BOINC versions.  Remedied in later versions to use 64 bit calls for probing memory.  The OP really should upgrade their client.

The gpu applications will use ALL of the card installed memory regardless of what BOINC reports if the application is well designed.

 

Eugene Stemple
Eugene Stemple
Joined: 9 Feb 11
Posts: 67
Credit: 388614878
RAC: 440937

See below for some more

See below for some more recent diagnostic searching...

But first, responding to some of the suggestions/issues in the responses.

[gwgeorge & keith] suggest upgrading the client (to something later than my 7.14.2).  I am intentionally holding at the 7.14.2 version for two reasons.  (1)  using "project_max_concurrent" in the app_config.xml file fails catastrophically in later versions - see other threads regarding downloading work units endlessly when using that parameter; (2) the later versions do NOT have, in the FILE pull-down menu, "shutdown connected client" and "exit BOINC manager".  I find both of those functions very useful to shutdown, and resume, BOINC gracefully in my setup with two instances of boinc and boincmgr running different projects.

[gwgeorge] The https: configuration is set up in the project global_prefs.xml file and as far as I know is not dependent on the client version.  And, anyway, that part of the server link is working properly.  And, to clarify, e@h is not failing to detect the GPU.  It is running O3AS (opencl) tasks normally.

[mikey]  Yes, I know all the bad things a project "reset" would do.  I would do an NNT and drain the cache before going down that path.  But if nothing else helps then that is always an option.

[keith]  Following up on your 4GB reporting limit in older clients...  I'm finding all kinds of <gpu_ram> parameters reported in the client_state.xml file.  7.864G for the FGRPB1G app down to 2.004G for the O3MDF app.  And in <coproc_cuda> parameters <available_ram> is 4.167G while <coproc_opencl> shows <global_mem_size> as 6.359G.  Never looked at that stuff before and I have no idea where those numbers come from.

Some additional file scanning gave some interesting (relevant?) information.  These 2 lines from client_state.xml.

    <name>einsteinbinary_BRP4G</name>
    <user_friendly_name>Binary Radio Pulsar Search (Arecibo,GBT,long)</user_friendly_name>

and these lines from sched_reply_einstein.phys.uwm.edu.xml.

<coproc>
        <type>ibo,GBT,long) is not available for your type of computer.</type>
     <count>647500445489094944987862487032585421213412207048870776038754297507971394031017770238076521610590413666285928503412500294475737461178726716849130298534562777624215719772160.000000</count>
    </coproc>

SORRY about that exceedingly long line.  It's what was in the sched_reply file !!!

Something is terribly wrong here.  As I understand it, a sched_request goes up to the server and it responds with a sched_reply.  There is nothing like a ...(Arecibo,GBT,long)... in the sched_request so where does that mangled reply come from.   And what's with that ~200 digit "count" in the reply?  That's a lot of coprocessors...<grin>!

I've set NNT with the expectation that a project reset may be the best/only recovery.  This error condition does not occur on every work request.  As best as I can deduce, it is only when the server is trying to send me a BRP4G task, which does not happen on every work request.

Aren't computers fun...?

 

Glenn Carver
Glenn Carver
Joined: 25 Apr 18
Posts: 3
Credit: 36848038
RAC: 239

I was about to post exactly

I was about to post exactly the same issue. I have been seeing this problem on a new machine I just attached to E@H which is failing on the hsgamma_FGRP5 task.

The problem is there's garbage in the <coproc> tag in the client_state.xml file for this app which shouldn't be there:

<app_version>

<app_name>hsgamma_FGRP5</app_name>
<version_num>108</version_num>
<platform>x86_64-pc-linux-gnu</platform>
<avg_ncpus>1.000000</avg_ncpus>
<flops>1000000000.000000</flops>
<plan_class>FGRPSSE</plan_class>
...........
<coproc>
<type>ibo,GBT,long) is not available for your type of computer.</type>

<count>647500445489094944987862487032585421213412207048870776038754297507971394031017770238076521610590413666285928503412500294475737461178726716849130298534562777624215719772160.000000</count>
</coproc> etc

Notice that hsgamma is defined in an <app_version> block.

The text in bold exactly matches the error message I see in the system logs & boincmgr

If I look on another machine I have which is successfully running the hsgamma app, then I do NOT have the <coproc> block for this app_version.

So it looks as if the project is sending out a malformed app description, or, something very weird happened on my machine (but now I know it's not just me!)

I will run down the existing tasks and try a project reset to see if that cures it. However, it may not offer an explanation as to why; which I am curious about as I work with CPDN.

 

 

 

Glenn Carver
Glenn Carver
Joined: 25 Apr 18
Posts: 3
Credit: 36848038
RAC: 239

Detaching/attaching the

Detaching/attaching the project didn't solve the problem.

I cleared running tasks. I then removed the project; checked that all instances of hsgamma had gone from client_state.xml; reattached and watched the log.

And again I see:

Tue 22 Aug 2023 10:06:31 BST | Einstein@Home | Master file download succeeded
Tue 22 Aug 2023 10:06:36 BST | Einstein@Home | Sending scheduler request: Project initialization.
Tue 22 Aug 2023 10:06:36 BST | Einstein@Home | Requesting new tasks for CPU and NVIDIA GPU
Tue 22 Aug 2023 10:06:39 BST |  | app version refers to missing GPU type ibo,GBT,long) is not available for your type of computer.
Tue 22 Aug 2023 10:06:39 BST |  | 
Tue 22 Aug 2023 10:06:39 BST | Einstein@Home | Scheduler request completed: got 2 new tasks
Tue 22 Aug 2023 10:06:39 BST | Einstein@Home | Project requested delay of 60 seconds
Tue 22 Aug 2023 10:06:39 BST | Einstein@Home | [error] App version uses non-existent ibo,GBT,long) is not available for your type of computer. GPU
Tue 22 Aug 2023 10:06:39 BST | Einstein@Home | 
Tue 22 Aug 2023 10:06:39 BST | Einstein@Home | [error] Missing coprocessor for task Ter5_1_dns_cfbf00052_segment_5_dms_200_40000_52_3500000_1; aborting
Tue 22 Aug 2023 10:06:39 BST | Einstein@Home | [error] Missing coprocessor for task LATeah1090F_1208.0_4653420_0.0_2; aborting

It appears E@H is responsible.

Maybe it's related to specific hardware?  In this case the machine is a 5900x + 1650 card. I have another machine 12400 + 1650 which doesn't have this issue.

Can someone at E@H investigate this? Appears it's adding a corrupt <coproc> XML block. I think I've done all I can here.

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4331
Credit: 251617827
RAC: 36218

I did some changes to the

I did some changes to the server code last week in particular with communicating the coproc usage to the client, in order to get the Apple M GPU app version delivered and working. Likely something went wrong there.

1. Can you find out and report when you started getting this error, as precisely as possible?

2. Does this happen on Macs only?

3. Does this actually hinder work fetch or is just a strange error?

Thanks a lot for reporting!

BM

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4331
Credit: 251617827
RAC: 36218

I just found a flaw in the

I just found a flaw in the code (uninitialized variable) and fixed it. Does the problem persist?

BM

Glenn Carver
Glenn Carver
Joined: 25 Apr 18
Posts: 3
Credit: 36848038
RAC: 239

That's fixed it.  I

That's fixed it.  I reattached to E@H, none of previous errors now appear in logs & hsgamma tasks running normally.

Thanks for the quick response. Appreciated.

Wedge009
Wedge009
Joined: 5 Mar 05
Posts: 128
Credit: 17593662658
RAC: 6986705

I was going nuts, thinking

I was going nuts, thinking there was a problem on my end. Two machines were suffering this problem over the past week or so. But things seem to be stabilising. Thanks for the information.

Soli Deo Gloria

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.