Brp7/MeerKat "not enough gpu memory"

Tom M
Tom M
Joined: 2 Feb 06
Posts: 6460
Credit: 9583794543
RAC: 6953779
Topic 230190

I "finally" got this system to recognize the Internal gpu under Boinc (repository version).

When I try to run a Brp7/MeerKat tasks it says "task paused, not enough GPU memory".

I have played with the "UMA Buffer" which seems to be the only Bios level way to try to allocate more GPU Memory.

No luck.  I even tried running the task with no other cpu task present.  No luck.

Tom M

 

 

A Proud member of the O.F.A.  (Old Farts Association).  Be well, do good work, and keep in touch.® (Garrison Keillor)  I want some more patience. RIGHT NOW!

GWGeorge007
GWGeorge007
Joined: 8 Jan 18
Posts: 3065
Credit: 4971337686
RAC: 1427864

Tom M wrote: I "finally" got

Tom M wrote:

I "finally" got this system to recognize the Internal gpu under Boinc (repository version).

When I try to run a Brp7/MeerKat tasks it says "task paused, not enough GPU memory".

I have played with the "UMA Buffer" which seems to be the only Bios level way to try to allocate more GPU Memory.

No luck.  I even tried running the task with no other cpu task present.  No luck.

Tom M

...and your question is...???

...and you have "played with the "UMA Buffer" ", meaning what...???   What did you do to play with it?

You have several error's in your "Stderr output" file, and one with a code


Level 1: $Id$
	Status code 14: XLAL function call failed


 

George

Proud member of the Old Farts Association

mountkidd
mountkidd
Joined: 14 Jun 12
Posts: 176
Credit: 12600662555
RAC: 8030104

The out-of-memory message is

The out-of-memory message is misleading - the real error is "

Couldn't create OpenCL command queue (error: -6)!"

And this has everything to do with a parameter in boinc-client.service called ProtectSystem.  This parameter was changed/added circa boinc release 7.18 and it needs to be turned off/disabled.  There is a thread with more info about this issue and near the end of the thread, the solution.

Also boinc 7.18 isn't intended for Linux and should not be in the repo - it was an Android release.  Please use the 7.20.5 from the Costamagna ppa.

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3958
Credit: 47036922642
RAC: 65113102

this is you actual error from

this is you actual error from the only BRP7 task you ran:

[12:53:01][5866][ERROR] Couldn't create OpenCL command queue (error: -6)

the verbiage about not having enough memory I'm certain is a byproduct of this earlier failure. it could be a driver issue, or it could be that the application just isn't coded in a way that the APU supports.

I would recommend not using the APU for this.

_________________________________________________________________________

HWpecker
HWpecker
Joined: 27 Jan 22
Posts: 25
Credit: 77748827
RAC: 0

---repost from other

---repost from other thread---

Been running only Meerkat x5 units for a few days now.
Once in a while there's a unit running running lonely with around 800s-900s being a bit off, they were validated.
Didn't see any of those on the screen myself.

Today I had a compute error without me tinkering:
https://einsteinathome.org/workunit/766701870

I'm running BRP7 Meerkat test app.
Not running petris app.
Ends with error 700 (and 1008), seems to be cuda illegal memory access?

[09:54:00][31629][ERROR] Error during CUDA device->host time series length transfer (error: 700)
[09:54:00][31629][ERROR] Demodulation failed (error: 1008)!
09:54:00 (31629): called boinc_finish(1008)

Don't know what really is happening there, lets post it in a Meerkat thread.

---/repost from other thread

 

I didn't run into a VRAM shortage, but error 700 during CUDA device seems to be an illegal memory access.
Maybe memory errors are related, I do run the Meerkat BRP7 test app atm.

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3958
Credit: 47036922642
RAC: 65113102

since it's only one i wouldnt

since it's only one i wouldnt worry about it.

maybe caused by a random memory error? any overclocks? maybe back it off a tad if so.

but again, unless it becomes a persistent thing, i wouldnt worry about it.

_________________________________________________________________________

HWpecker
HWpecker
Joined: 27 Jan 22
Posts: 25
Credit: 77748827
RAC: 0

There's wasn't too much of a

There's wasn't too much of a clock on it, but there is upthrottling during the day.

This card throttles in steps of 15MHz, visible in the NVsettings gui.
After I set the clock it can go up and down a bit + or - 15MHz or usually 0 throttle.

Been watching it regularly yesterday and it went up 30Mhz and rarely even up 45MHz, think it was temperature throttle. There has been an increase in invalids to around 4%(up to 6%?) recently (with 60-70% validation checks). Clocked it down with 45 now.

Then I changed fanspeeds and see the card throttle down when either warm and cold, The highest throttle is around 55C temperature. While I was at it, why not put the fans on 70 and wait, 80, 90 and 100% to check for noise and temps.

Hitting that 100% awaiting the spin up, the fans seemingly overspun and the message in NVsettings became Unsupported fanspeed, compute errors in boinc, the fans kept maxing out not responding to try and set it lower.
My compute errors on 16 Nov 2023 10:07:55 UTC were self induced.

Rebooting worked.

mikey
mikey
Joined: 22 Jan 05
Posts: 12693
Credit: 1839099161
RAC: 3720

HWpecker wrote: ---repost

HWpecker wrote:

---repost from other thread---

Been running only Meerkat x5 units for a few days now.
Once in a while there's a unit running running lonely with around 800s-900s being a bit off, they were validated.
Didn't see any of those on the screen myself.

Today I had a compute error without me tinkering:
https://einsteinathome.org/workunit/766701870

I'm running BRP7 Meerkat test app.
Not running petris app.
Ends with error 700 (and 1008), seems to be cuda illegal memory access?

[09:54:00][31629][ERROR] Error during CUDA device->host time series length transfer (error: 700)
[09:54:00][31629][ERROR] Demodulation failed (error: 1008)!
09:54:00 (31629): called boinc_finish(1008)

Don't know what really is happening there, lets post it in a Meerkat thread.

---/repost from other thread

 

I didn't run into a VRAM shortage, but error 700 during CUDA device seems to be an illegal memory access.
Maybe memory errors are related, I do run the Meerkat BRP7 test app atm. 

One person said the new tasks take 4.5gb of ram and won't run on 4gb gpu's any more.

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3958
Credit: 47036922642
RAC: 65113102

mikey wrote: One person said

mikey wrote:

One person said the new tasks take 4.5gb of ram and won't run on 4gb gpu's any more.

wrong app and outdated info.

HW is talking about the BRP7 app, which uses a small amount of VRAM (~1GB)

the app that used "4.5GB" was the O3AS app, and that was the old version of the app. Bernd released a new application and tasks that use about 2-2.5GB now.

_________________________________________________________________________

mikey
mikey
Joined: 22 Jan 05
Posts: 12693
Credit: 1839099161
RAC: 3720

Ian&Steve C. wrote: mikey

Ian&Steve C. wrote:

mikey wrote:

One person said the new tasks take 4.5gb of ram and won't run on 4gb gpu's any more.

wrong app and outdated info.

HW is talking about the BRP7 app, which uses a small amount of VRAM (~1GB)

the app that used "4.5GB" was the O3AS app, and that was the old version of the app. Bernd released a new application and tasks that use about 2-2.5GB now. 

Thank you very much, I knew Bernd had fixed the new ones but forgot about the rest.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.