Discussion Thread for the Continuous GW Search known as O2MD1 (now O2MDF - GPUs only)

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3965
Credit: 47235802642
RAC: 65370990

Peter Hucker

Peter Hucker wrote:

Ian&Steve C. wrote:

and the next request:

2020-05-06 15:14:39.1243 [PID=29295]    [version] GPU RAM calculated: min: 1792 MB, use: 1557 MB, WU#452169093 CPU: 1557 MB

 

cross referencing with actual GPU mem use in nvidia-smi, it looks like that "GPU RAM calculated: min:" value is about how much the memory the WU will use on the GPU (i see about 1GB and 1.8GB used on the GPUs running GW tasks on that system). it doesn't look like it's referencing what the card actually has available at all though. at least not for the nvidia app.

https://einsteinathome.org/host/12803486/log

So it's just some kind of internal calculation to do a request to the GPU?  And not going to prevent you getting them when it's too big?  How come it didn't send any GW to Tbar's HD7750?  Something made it realise that card couldn't run it.

 

like I mentioned in my post, it doesn't look like it's doing the memory check for nvidia cards. i don't see that line under the check for cuda devices in mine or Tbar's logs

_________________________________________________________________________

Mr P Hucker
Mr P Hucker
Joined: 12 Aug 06
Posts: 838
Credit: 519371204
RAC: 15292

Ian&Steve C. wrote:like I

Ian&Steve C. wrote:

like I mentioned in my post, it doesn't look like it's doing the memory check for nvidia cards. i don't see that line under the check for cuda devices in mine or Tbar's logs

Maybe something else prevents his HD7750 getting them.  Compute capabilities etc.  Not sure about for GPUs, but Boinc when starting lists all the SSE2 etc your CPU can do.

Oh well, unless they can fix it, I'm on Gamma and Milkyway only.

If this page takes an hour to load, reduce posts per page to 20 in your settings, then the tinpot 486 Einstein uses can handle it.

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3965
Credit: 47235802642
RAC: 65370990

it's also possible that it

it's also possible that it only spits out that error if it sees a violation.

_________________________________________________________________________

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3965
Credit: 47235802642
RAC: 65370990

I made my test bench with a

I made my test bench with a 4GB GTX 1650, report to BOINC that it only has 1GB of memory. this did not stop the Science App from using more and load up 1.8GB of data to the GPU, as I said would happen since what BOINC and the Science App are doing are totally separate from each other.

I'll wait until the next schedule request for a new task to see if it will send me another one, or not. see if it's doing the check for nvidia cards. The only way I can think that the Scheduler would know what the GPU memory is, is by passing along what BOINC detected. so we'll see.

_________________________________________________________________________

TBar
TBar
Joined: 3 Apr 20
Posts: 24
Credit: 891961726
RAC: 0

Peter Hucker

Peter Hucker wrote:

Ian&Steve C. wrote:

like I mentioned in my post, it doesn't look like it's doing the memory check for nvidia cards. i don't see that line under the check for cuda devices in mine or Tbar's logs

Maybe something else prevents his HD7750 getting them.  Compute capabilities etc.  Not sure about for GPUs, but Boinc when starting lists all the SSE2 etc your CPU can do.

Oh well, unless they can fix it, I'm on Gamma and Milkyway only.

It looks pretty straight foreword to me. The Server calculates the amount of vram needed for the WU it's considering sending, compares it to what's available on the GPU, and doesn't send the task.

2020-05-06 14:17:04.7727 [PID=5325 ]    [version] GPU RAM calculated: min: 1792 MB, use: 1589 MB, WU#452363325 CPU: 1589 MB
2020-05-06 14:17:04.7728 [PID=5325 ]    [version] OpenCL GPU RAM required min: 1879048192.000000, supplied: 1073741824

The CPU on that machine is a 6th gen Intel, it has up to AVX2.

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3965
Credit: 47235802642
RAC: 65370990

my test bench which is only

my test bench which is only reporting 1GB of ram, still gets sent work >1GB.

https://einsteinathome.org/host/12830576

looks like it's not doing the ram comparison for nvidia cards. must be only for ATI/AMD cards.

_________________________________________________________________________

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3965
Credit: 47235802642
RAC: 65370990

update: I was about to get

update:

I was about to get the scheduler to finally deny my a task. in my previous attempt to trick the scheduler, I edited the available ram metrics (this is what shows on your host page). it wasn't until I reduced the global_mem_size in the coproc file, that the scheduler now says I don't have enough memory. the Science App still runs the task fine though if you already have it since what BOINC says doesn't matter to the Science App

2020-05-06 17:27:32.2026 [PID=16673]    [version] GPU RAM calculated: min: 1792 MB, use: 1608 MB, WU#453439406 CPU: 1608 MB
2020-05-06 17:27:32.2026 [PID=16673]    [version] OpenCL GPU RAM required min: 1879048192.000000, supplied: 873741824

 

so it looks like the check happens on nvidia cards too, but it doesnt flag out in the log unless you actually violate the limit. and its checking the gloabal_mem_size as detected by BOINC (not available size) to send you a task.

doesn't explain why 3GB cards are being sent tasks that are too large though, unless the scheduler is underestimating the amount of GPU ram that the WU needs or something. I'll have to watch it a little more closely to see what happens.

_________________________________________________________________________

TBar
TBar
Joined: 3 Apr 20
Posts: 24
Credit: 891961726
RAC: 0

Oh, the Server is checking

Oh, the Server is checking the NV cards, and when it doesn't find a problem it gives a "plan class ok" instead of a comparison. You don't see that OK in my HD7750 log. I see it on my NV Hosts though. The question is why is Your BOINC giving that false 1650 ram reading? Did you hack something? Are you still running that Highly Edited version of BOINC? You do realize any time you post some problem with BOINC people are going to ask which BOINC are you running and did you change it.

Considering the highest vram estimate I've seen for a GW tasks is under 2 GB, I's say it's being underestimated by the tool.

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3965
Credit: 47235802642
RAC: 65370990

TBar wrote: Oh, the Server

TBar wrote:

Oh, the Server is checking the NV cards, and when it doesn't find a problem it gives a "plan class ok" instead of a comparison. You don't see that OK in my HD7750 log. I see it on my NV Hosts though. The question is why is Your BOINC giving that false 1650 ram reading? Did you hack something? Are you still running that Highly Edited version of BOINC? You do realize any time you post some problem with BOINC people are going to ask which BOINC are you running and did you change it.

Considering the highest vram estimate I've seen for a GW tasks is under 2 GB, I's say it's being underestimated by the tool.

reading comprehension usually helps in these situations ;)

It should have been clear that I purposefully edited the coproc_info.xml file (using information that YOU posted in the past no less) as a test to try to trigger the issue (since I do not have any GPUs on hand with less than 3GB of VRAM) and find out what EXACTLY the scheduler is looking at. which I did. the scheduler is looking at the global_mem_size parameter under the opencl section.

 

the first issue I see is that it's checking global mem size instead of available mem size, since the cards running a monitor or desktop environment will have some of their GPU memory unavailable.

the second issue is that even a task taking up 3200+MB of GPU ram should trigger this conflict and not get sent the job since these 3GB cards only show a global mem size of about 3017MB. which is why I think the scheduler might be underestimating the WU size. I need to catch one in the act so I can check the log.

_________________________________________________________________________

TBar
TBar
Joined: 3 Apr 20
Posts: 24
Credit: 891961726
RAC: 0

The first problem I see is

The first problem I see is you are trying to trouble shoot something while running Highly Edited versions of everything. That doesn't fly with anyone knowledgeable about such things. If you want to troubleshoot the Project, then use what the project is designed to use. That's what I'm doing, and I made the call by just looking at the vram estimates of a number of tasks the Server was trying to send. None of the estimates were over 2 GBs when I know they should be higher.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.