Problem with GPU-CPU tasks

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5876
Credit: 118565867613
RAC: 23329179

RE: RE: RE: ... What is

Quote:
Quote:
Quote:
...
What is the significance of:
Error allocating device memory: 268435456 bytes (error: -61)
[EDIT] I noticed in the output from above the following:
Max allocation limit: 264241152

It means you just need to stick another 4,194,304 bytes of memory on that GPU card and you'll be sweet :-).

Thanks for reporting this as you've saved me having a go at 4x unnecessarily. Hopefully, as the app matures, the Devs might be able to save some memory somewhere and get a task sufficiently under 0.5GB so that 2 could run on a 1GB card and 4 on a 2GB card.

After looking at the earlier results I decided to use a different WU profile to exclude the Gamma-ray pulsar search #3 v1.11 (FGRPopencl-ati) WUs.

By excluding these WUs I can successfully run 4 GPU jobs on the AMD card at a cool 54C. I live in a warm climate and summer is coming. My NVIDIA cards run a around 66C.


Yes, if you want to use a GPU efficiently, the FGRP3 GPU app doesn't do that yet. I had both of my hosts running FGRP3 3x for the purposes of accumulating performance data. Both eventually produced errors very similar to the one you reported (as listed above). It took about 24 hours for the first one to have a problem but the second one continued for quite a bit longer before also failing similarly.

My theory is that the memory requirement per task is rather variable and that (on average) 3 tasks will fit in 2GB. By chance, circumstances may arise where the 3 tasks have large enough requirements that the available memory is pretty much exhausted. If one task finishes and the next one to start needs a bit more memory .... you get the picture. I have no idea if this is an accurate assessment of the situation or not :-).

When each of my hosts failed, every remaining task in the cache reported the same 'Error allocating memory' and crashed after a few seconds of run time. BOINC then went into a 24 hour backoff with the errored tasks remaining unreported. As well as the error tasks, there were a couple of fully completed tasks waiting to be reported.

I've seen this trashed work cache and 24 hr backoff behaviour in the past with CPU apps when BOINC suddenly decides a required file is missing or has failed the MD5 checksum. I've developed a procedure for recovering the work cache in such circumstances. It even works to recover partly crunched tasks if the checkpoint information is still stored in the state file. It requires a good understanding of the internal structure of the state file and a fair bit of editing to remove/correct the stuff that BOINC has inserted/changed, so it's not for the average volunteer.

I've now discovered the procedure works equally well with GPU tasks since I've fully recovered all tasks for both of my hosts. It took about 5-10mins per host and each one had 20+ trashed GPU tasks in their work caches. Before restarting BOINC, I edited app_config.xml to allow only 2 concurrent GPU tasks. I haven't yet seen any further problems.

Sure, if you exclude FGRP3 GPU tasks and run BRP5 4x on a 7850 GPU, you will get a much better performance. However, my aim is to use some resources to try to assist the Devs to prove the worth of the FGRP3 GPU app, which they don't yet trust. This can only be done by having lots of direct comparisons against the results of the CPU app, which is trusted. For the moment, I'm continuing on with both my hosts crunching FGRP3 GPU tasks 2x, even though the RAC will only be 25% of what it could be by crunching BRP5.

Cheers,
Gary.

Anonymous

RE: RE: RE: RE: ... W

Quote:
Quote:
Quote:
Quote:
...
What is the significance of:
Error allocating device memory: 268435456 bytes (error: -61)
[EDIT] I noticed in the output from above the following:
Max allocation limit: 264241152

It means you just need to stick another 4,194,304 bytes of memory on that GPU card and you'll be sweet :-).

Thanks for reporting this as you've saved me having a go at 4x unnecessarily. Hopefully, as the app matures, the Devs might be able to save some memory somewhere and get a task sufficiently under 0.5GB so that 2 could run on a 1GB card and 4 on a 2GB card.

After looking at the earlier results I decided to use a different WU profile to exclude the Gamma-ray pulsar search #3 v1.11 (FGRPopencl-ati) WUs.

By excluding these WUs I can successfully run 4 GPU jobs on the AMD card at a cool 54C. I live in a warm climate and summer is coming. My NVIDIA cards run a around 66C.


Yes, if you want to use a GPU efficiently, the FGRP3 GPU app doesn't do that yet. I had both of my hosts running FGRP3 3x for the purposes of accumulating performance data. Both eventually produced errors very similar to the one you reported (as listed above). It took about 24 hours for the first one to have a problem but the second one continued for quite a bit longer before also failing similarly.

My theory is that the memory requirement per task is rather variable and that (on average) 3 tasks will fit in 2GB. By chance, circumstances may arise where the 3 tasks have large enough requirements that the available memory is pretty much exhausted. If one task finishes and the next one to start needs a bit more memory .... you get the picture. I have no idea if this is an accurate assessment of the situation or not :-).


Seems reasonable. I seem to recall that some of my jobs finished while others did not.

Quote:

When each of my hosts failed, every remaining task in the cache reported the same 'Error allocating memory' and crashed after a few seconds of run time. BOINC then went into a 24 hour backoff with the errored tasks remaining unreported. As well as the error tasks, there were a couple of fully completed tasks waiting to be reported.

I've seen this trashed work cache and 24 hr backoff behaviour in the past with CPU apps when BOINC suddenly decides a required file is missing or has failed the MD5 checksum. I've developed a procedure for recovering the work cache in such circumstances. It even works to recover partly crunched tasks if the checkpoint information is still stored in the state file. It requires a good understanding of the internal structure of the state file and a fair bit of editing to remove/correct the stuff that BOINC has inserted/changed, so it's not for the average volunteer.

I've now discovered the procedure works equally well with GPU tasks since I've fully recovered all tasks for both of my hosts. It took about 5-10mins per host and each one had 20+ trashed GPU tasks in their work caches. Before restarting BOINC, I edited app_config.xml to allow only 2 concurrent GPU tasks. I haven't yet seen any further problems.

Sure, if you exclude FGRP3 GPU tasks and run BRP5 4x on a 7850 GPU, you will get a much better performance. However, my aim is to use some resources to try to assist the Devs to prove the worth of the FGRP3 GPU app, which they don't yet trust. This can only be done by having lots of direct comparisons against the results of the CPU app, which is trusted. For the moment, I'm continuing on with both my hosts crunching FGRP3 GPU tasks 2x, even though the RAC will only be 25% of what it could be by crunching BRP5.

I am running a full complement of WUs on other machines limiting GPU work to 3 units. My reason for excluding the FGRP3 to enable 4 GPU units on this one machines was out of a desire to determine the "heat" characteristics on the AMD card. All of my NVIDIA cards running 3 GPU units run hotter than does my AMD card running 4 GPU WUs. These machines are all in the same room so it seems that AMD is better at heat management than NVIDIA. With 4 GPU units running the AMD is around 54C while the NVIDIA cards are at around 74C. This is a significant difference and one I can't really explain. As such I would be hesitant to push 4 GPU WUs on a NVIDIA card.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5876
Credit: 118565867613
RAC: 23329179

RE: ... All of my NVIDIA

Quote:
... All of my NVIDIA cards running 3 GPU units run hotter than does my AMD card running 4 GPU WUs. These machines are all in the same room so it seems that AMD is better at heat management than NVIDIA. With 4 GPU units running the AMD is around 54C while the NVIDIA cards are at around 74C. This is a significant difference and one I can't really explain.


I've been upgrading a few of my hosts which have GTX650s running BRP5 2x. The OS was installed well over a year ago and the nvidia driver was 304.51 - quite old. I had deliberately not upgraded these hosts because of reports of slow crunching with some later versions. The current drivers are now supposed to be OK and 331.49 is now showing in the repos so I figured it was time to upgrade.

Before changing anything I ran the nvidia-settings utility on several hosts to see what the current conditions were. I found temperatures in the range of 55-60C and fan settings showing around 41-44%. There is no third wire so actual RPMs are unknown. The power applied to the fan increases based on temperature and the 'profile' that controls this should be adjustable, if you know the necessary incantation. Have you checked that the fans are running 100% when the temperature is over 70C?

I have 12 hosts running Milkyway on AMD HD4850s. They will soon have completed 4 years running 24/7. Their temps have always been in the range of 70-90C. The fans all run 100% and so far only one fan has failed. If the temperature gets over 100C (I've seen 106C), tasks start failing so when they go above 90C, I pull the card and replace the thermal grease under the heat sink. This always seems to drop the temps back to around 70-75C. When I replace the grease like this I always find the old grease has really dried out into a pretty solid 'cake' which seems a bit too thick. I try to apply a fairly thin new layer. So far I've done about half the cards and I haven't had to 're-do' any a second time (yet) :-).

Cheers,
Gary.

Anonymous

RE: I found temperatures

Quote:
I found temperatures in the range of 55-60C and fan settings showing around 41-44%. There is no third wire so actual RPMs are unknown.

I have been running NVIDIA driver 331.38 for quite awhile and it provides RPM. I forgot that I had uplifted the driver to provide a slider for fan speed. I have it currently set at 59% and the rpm oscillates around 2490 with a temp around 59C - 66C. I had forgotten that I had added the fan slider and increased the RPM to 2490 on this machine. I might now need to uplift the driver on the other NVIDIA machine which I know does not have it installed.

John Jamulla
John Jamulla
Joined: 26 Feb 05
Posts: 32
Credit: 1194347110
RAC: 504271

Hi - Thanks for the info. I

Hi - Thanks for the info.
I am ranting because I'm putting significant power and dollar resources into this project (because I think it's important science), and all of the sudden I'm getting like less than 1/2 the amount of work out of my machines, and a lot more problems. I should be able to let these things run for long periods with no involvement.

Few things, it may be coincidence, but it definitely happened (I have 3 machines) to one machine at a time as I loaded a new version of BOINC.
Also what's weird is I have 3 machines doing basically the same thing, they don't act the same. One machine will use 7 CPU tasks, other will use 8, (bot 4 core hyperthreaded), but one uses only 1 GPU task at all, the other uses I think 1 per GPU. I have only 1 set of preferences.
1 is a i7-2660k, a i7-3770k, one is a i7-3930k (6-core).

The 2660K has been the best performer, OC'd to I think 4.5 GHz, 3 GPUs (NVidia), a GTX 470, a GTX 550Ti I think, and a GTX 570. I was running 8 CPUs tasks, and 9 GPU tasks fine for a long long time, and getting close to 100k credit per day.
I'm currently down I think in the high 39K or low 40K. Nothing else changed.
Now I think I'm only getting 7 CPU tasks, and I think 1 GPU task per . I DID NOT change anything.
I don't think that's how mature software should work. Just one day they're completely different?

Coincidentally, this is when I started getting openCL apps. Maybe has nothing to do with it, but if I have a NVIDIA set of GPUs, I don't think I should get all openCL apps, especially if it's a "crappy" implementation, should I?
That's a very poor implementation if so, not sure if that's a BOINC thing, or a einstein thing.

Now I've been a programmer for years, and been doing debugging, etc. a long long time. I would say this is "crappy" rollout/implementation.
Espcially for a group that's WANTING to use as much CPU as possible for their work.

My machine (best performer) was running "the same" for months, over 6 months, nice and steady, with a certain amount of credit per day/week/month, nice and steady (and high). Machine ran 24x7.
Then - I loaded a new BOINC, and new apps (OpenCL) showed up, and my credit went to Hell, the CPUs were under-utilized, and the GPUs weren't being used much either.
Multiple ways I checked this:
a) The task Manager shows less utilization,
b) CPU-Z and
c) GPU-Z show less utilization.
d) There's less tasks running in BOINC (einstein@home only)
e) I get lots less credit.

Now that I'm done with my rant, maybe you guys can give me some advice pretty please. I'll try to start paying attention, but would really like to understand what's going on, and how to fix it.

How do you guys (maybe step-wise idea) "benchmark" your machine and config file changes with BOINC/einstein? I haven't found a really good way to correlate individual tasks being run (how long they run, how much credit) without painstakingly writing down individual tasks, then finding them in the results lists, and that gives me a basic idea, not a good idea of I ran Nx on a single GPU and it took this long, etc. etc.
Maybe I'm just being stupid, but seems quite tedious (and I'm not even 100% sure how) to correlate the data.

John Jamulla
John Jamulla
Joined: 26 Feb 05
Posts: 32
Credit: 1194347110
RAC: 504271

Ok - Figured out how to

Ok - Figured out how to disable FGRP tasks/openCL, and the tasks I had on GPU/CPU and behavior mostly went back to the way it was before, when I was getting tons of credit (good). But I still don't understand why I can't get all my GPUs working with the number of tasks I set, as expected.
I set the GPU tasks to .5 so I expect each GPU to get 2 GPU tasks running all the time, but the 3rd GPU only gets 1 or none.

Does anyone want to point me to a place or explain to me why I am seeing that behavior?

Basically, one both machines now, I have 2 GPU tasks per GPU running on the first two GPUs (set via my account "home" prefs for BRP and GW = .5), and only 1 GPU task is running on the last (3rd) GPU. I have all 8 tasks running on the CPUs.

I run only einstein@home with BOINC on all 3 machines (3rd machine Mobo dead and off currently).

The two machines discussing here are QUAD core machines, a i7-2660k and i7-3770k. Hyperthreading is ON, so total of 8 cores possible for each of these machines. Both machines have same behavior.

Currently, BOTH of these machines have 3 NVIDIA GPUs each, they are a mix of GPUs per machine.
the i7-2770k has 2 GTX 660ti, and 1 GTX770. The i7-2660k has a GTX 560ti, a GTX 460, and GTX 570.

What I don't understand is if I go into the BOINC mgr, Tools|Computing preferences and I set "On multiprocessor systems use at more X", where X is say 90, I get 7 of 8 CPU tasks, no change to GPU tasks. If I set to 80, I get 6 CPU tasks, and the last GPU gets NO GPU tasks..... I thought it would get MORE (it's set to take 2 GPU tasks).

I don't understand why my 3rd GPU is not getting 2 tasks.

Also - is there a way to set different numbers of GPU tasks per individual GPUs?

mikey
mikey
Joined: 22 Jan 05
Posts: 12799
Credit: 1878878749
RAC: 1475185

RE: I don't understand why

Quote:
I don't understand why my 3rd GPU is not getting 2 tasks.

Take out one gpu and see if the other two get 2 tasks each, if so then put the 3rd one back in in a different slot and see what happens. There are some setups where multiple Nvidia gpu's don't do what they are supposed to do and the Boinc Developers are looking into it.

It is a major change that will be coming when they get it right, each gpu will be a stand alone device, both on our pc's and at the Server level. Meaning we can then control each one separately instead of all together, or have multiple config files that can cause their own issues.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.