not getting GPU WUs anymore

San-Fernando-Valley

Joined: 16 Mar 16

Posts: 409

Credit: 10204593455

RAC: 23364078

7 May 2018 14:04:51 UTC

Topic 214872

(moderation:

)

here is another problem that makes me wonder if I, my rigs or Einstein is slowly deteriorating ...

Been getting plenty of GPU WUs since yesterday (couple of hundreds).

Now not receiving any anymore -

Server is working (generating), so what am I doing wrong?

Maybe I should take a day off?

Would appreciate any tips or help ...

Jim1348

Joined: 19 Jan 06

Posts: 463

Credit: 257957147

RAC: 0

Are you also doing the

7 May 2018 14:35:22 UTC

Message 165283

(moderation:

)

Are you also doing the Einstein CPU work units along with some other CPU projects? Then it is probably a BOINCism. The scheduler gets all bent out of shape, since it can't keep track of the resource share of the GPU and CPU separately. Or something like that. No one really knows. I use two separate machines for the Einstein CPU and GPU work. It is silly, I know.

Ace Casino

Joined: 25 Feb 05

Posts: 36

Credit: 1470286967

RAC: 862526

Maybe you don't need any more

7 May 2018 14:43:39 UTC

Message 165284

(moderation:

)

Maybe you don't need any more GPU's right now.

Look in your BOINC Event log and see if it says: "Not requesting tasks, don't need".

You wont get more WU's if BOINC doesn't think you can finish them by deadline.

San-Fernando-Valley

Joined: 16 Mar 16

Posts: 409

Credit: 10204593455

RAC: 23364078

Jim1348 wrote:Are you also

7 May 2018 14:47:07 UTC

Message 165285 in response to message 165283

(moderation:

)

Jim1348 wrote:

Are you also doing the Einstein CPU work units along with some other CPU projects?.

No, just running Einstein.

Started crunching about two days ago - CPU and GPU - on 6 rigs.

Everything was fine.

Suddenly no more GPU WUs.

I will follow your advice - just have to wait till those long running CPU WUs are finished.

Have a nice day!

San-Fernando-Valley

Joined: 16 Mar 16

Posts: 409

Credit: 10204593455

RAC: 23364078

Ace Casino wrote:Maybe you

7 May 2018 15:10:58 UTC

Message 165288 in response to message 165284

(moderation:

)

Ace Casino wrote:

Maybe you don't need any more GPU's right now.

Look in your BOINC Event log and see if it says: "Not requesting tasks, don't need".

You wont get more WU's if BOINC doesn't think you can finish them by deadline.

Thanks for pointing me to the event log!

I noticed the message "not requesting tasks: some task is suspended via Manager".

I had suspended all CPU WUs (not the project).

What I did not know (I guess I'm just a little dumb), is that the Manager does not differentiate between CPU and GPU tasks (WUs).

Matter of fact - my logic was, that if you want to pause the whole projetct, you SUSPEND the project.

AND if you just want to suspend a certain task (WU) then you just suspend THAT sole task (WU), expecting that the Manager still delivers!

I guess I'm having a bad day!

Thanks for replying!

rjs5

Joined: 3 Jul 05

Posts: 32

Credit: 612968392

RAC: 857901

Did you recently update

7 May 2018 21:46:57 UTC

Message 165289

(moderation:

)

Did you recently update drivers?

Your computer show the good 391.35 Nvidia driver. The newer one 397.31 has problems. Nvidia issued a HOTFIX 397.55.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5872

Credit: 117656962742

RAC: 35186839

San-Fernando-Valley wrote:...

8 May 2018 8:02:33 UTC

Message 165295 in response to message 165288

(moderation:

)

San-Fernando-Valley wrote:

... I had suspended all CPU WUs (not the project).

Were you suspending tasks because you were getting too many of them and you were just trying to get their numbers to deplete a little before allowing any more?

If so, and seeing as you now understand that you can't do it that way, perhaps you might be interested in what you can do to partially mitigate the problem of oversupply of CPU tasks.

There are a couple of reasons why you can get this problem. The main reason is that Einstein uses duration correction factor (DCF) to manage estimated crunch time and there's just one DCF and it applies project wide. For most people, CPU tasks take longer than the estimate so whenever a CPU task finishes, the DCF increases which drags the estimates of both CPU and GPU tasks much higher. The CPU task estimate is now correct (for the time being) but the GPU task estimate is way too high. GPU tasks tend to take less than the estimate anyway. A series of fast finishing GPU tasks will progressively drag down the DCF so that the CPU estimate becomes way too low and BOINC over-fetches CPU tasks as a result. Eventually another long running CPU task finishes, the DCF has a big jump upwards and the whole cycle repeats. Overall, you will tend to have less GPU tasks and more CPU tasks than what you really need to match your desired cache size. The problem gets worse with larger cache sizes.

However, there is another factor which makes the CPU task over-fetch even worse. It becomes quite a severe factor if you are running high end nVidia GPUs with multiple concurrent GPU tasks. As an example let's say you have a quad core host with HT disabled. You have 4 cores/4 threads. Lets say you are running 3 concurrent GPU tasks. By default, the nVidia GPU would need 3 of your 4 cores to be 'reserved' for GPU support duties. The problem is that BOINC still sees 4 cores and will fetch CPU tasks for 4 cores even though only one core would be allowed to crunch CPU tasks. Hence a big CPU task over-fetch. You can mitigate this a bit with HT enabled. Fetching for 8 cores but only crunching with 5. However, the CPU tasks take a lot longer and the DCF swings are even larger than before.

The best way to prevent this second cause of CPU over-fetch is make sure that BOINC only fetches work for cores that will actually be crunching. Taking the previous example, you could change the BOINC preferences so that BOINC can use only 25% of the cores. That way, BOINC will fetch for only one CPU core of the 4. You now have to stop GPU tasks from also trying to 'reserve' cores to support the 3 concurrent GPU tasks. You do that with an app_config.xml file. So for the example above, a suitable file would look like

    <app_config>
        <app>
            <name>hsgamma_FGRPB1G</name>
            <gpu_versions>
                <gpu_usage>0.33</gpu_usage>
                <cpu_usage>0.2</cpu_usage>
            </gpu_versions>
        </app>
    </app_config>

The 0.33 allows 3 GPU tasks to share the GPU whilst 3 x 0.2 is less than a full core so no further CPU cores are taken up with GPU support duties. You already have 3 CPU cores from the 25% of cores setting.

As with most things in life, there are pros and cons. The con here is that you can't just get rid of app_config.xml and have things revert to the way they were before. I use the file all the time and changes can be made by editing it. Boinc Manager has an option to 'reread config files' to pick up any changes. I've never tried to get rid of a file completely but my understanding is that you can't just delete it. When you first place this file in the project directory, the contents get incorporated into the state file. If you remove the file, that act alone doesn't remove what was incorporated into the state file. You have two choices to fix that. You could manually edit the state file - potentially dangerous if you make a mistake or don't really know what to remove. Or, you could reset the project, which has its own set of disadvantages.

I love that file because it has solved the problem of how to enable lots of different preferences with the restriction to just 4 locations (venues). Each of my hosts can essentially be customised independent of all the others. I can't see myself ever wanting to remove it. When things change, I just edit it and click on 'read config files'.

Cheers,
Gary.

San-Fernando-Valley

Joined: 16 Mar 16

Posts: 409

Credit: 10204593455

RAC: 23364078

Gary Roberts

15 May 2018 16:53:52 UTC

Message 165395 in response to message 165295

(moderation:

)

Gary Roberts wrote:

San-Fernando-Valley wrote:
... I had suspended all CPU WUs (not the project).

Were you suspending tasks because you were getting too many of them and you were just trying to get their numbers to deplete a little before allowing any more?

If so, and seeing as you now understand that you can't do it that way, perhaps you might be interested in what you can do to partially mitigate the problem of oversupply of CPU tasks.

Just had time to read your fine reply - Absolutely well explained. THANK YOU!

BUT, I still don't quite undestand the special problem I had. Let me explain a little further, hoping to not stress your time ...:

Let's take just one of my rigs: If I remeber correctly, I had 6 cores loaded with CPU WUs and 2 cores with GPU WUs (both cores sharing one GPU).

For some reason, that I don't really remeber (heat problems?) I suspended all 6 CPU WUs - while the 2 GPU WUs were happily crunching.

As they (GPUs) had finished (they run about 10 minutes), I had expected to get more GPU WUs delivered (of course tried UPDATE) -- or in the worst case get more CPU WUs or a mix of CPU + GPU jobs. But nothing happend - no WUs delivered. After looking into the EVENT LOG, I found the message that I have already mentioned.

I guess that was very naiive of me, but I don't have much experience on behalf of how the MANAGER works!

My goal was to SUSPEND CPU work, but to continue GPU work - till the heat would cool off in the evening.

I'm wondering if, when running more than one project at the same time, if the Manager differentiates and delivers WUs for the other project, or not. Maybe I'll try that out if I ever have time ...

Right now i'm getting virus detection for an EINSTEIN file - strange - will open a new post/thread!

Have a nice day!

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5872

Credit: 117656962742

RAC: 35186839

San-Fernando-Valley wrote:For

16 May 2018 2:08:55 UTC

Message 165398 in response to message 165395

(moderation:

)

San-Fernando-Valley wrote:

For some reason, that I don't really remeber (heat problems?) I suspended all 6 CPU WUs - while the 2 GPU WUs were happily crunching.

As they (GPUs) had finished (they run about 10 minutes), I had expected to get more GPU WUs delivered (of course tried UPDATE) -- or in the worst case get more CPU WUs or a mix of CPU + GPU jobs. But nothing happend - no WUs delivered. After looking into the EVENT LOG, I found the message that I have already mentioned.

BOINC regards a suspended task as an abnormal situation and a reason for caution. By design, BOINC is prevented from requesting any work (even work for unrelated searches) from a project if just one task is suspended. It's quite OK to suspend tasks temporarily if you want to test things like heat production with different numbers of tasks crunching. However, you must remove suspension completely if you want BOINC to be able to request any new work from that project. So, what you describe is exactly as per design. I'm sorry I didn't comment specifically about that. I thought the event log message had made it clear.

San-Fernando-Valley wrote:

My goal was to SUSPEND CPU work, but to continue GPU work - till the heat would cool off in the evening.

You could achieve that goal (temporarily at least whilst waiting for cooler temperatures) by using local computing preferences in BOINC Manager. If you click on that, you can find a tab where you can change the % of CPU cores BOINC is allowed to use. With 8 cores and 2 GPU tasks, you could temporarily change the %cores setting from 100% to 25% and apply the change. That would have the immediate effect of suspending the 6 CPU cores that were crunching CPU tasks whilst leaving the 2 GPU tasks (with 2 supporting CPU cores) to continue crunching.

The 6 CPU tasks will have stopped crunching and will be listed as 'waiting to run' or something like that. As GPU tasks finish, the results will be returned and new ones will start. BOINC will continue to download new work as needed since nothing is actually suspended by the user. You can't leave it that way forever as BOINC would eventually go into panic mode as the deadline approached. You could play around with the %cores setting to find a mix of CPU and GPU tasks that gave an acceptable level of heat. The disadvantage is that tasks from other other projects will not be able to run either if the setting is too restrictive.

If you want a long term solution you need to be prepared to experiment with settings. It may well be that you might choose to 'turn off' (through settings) CPU tasks from Einstein completely, just leaving the GPU tasks. That way, you could allocate some fraction of the remaining 6 cores to CPU tasks from other projects.

San-Fernando-Valley wrote:

I'm wondering if, when running more than one project at the same time, if the Manager differentiates and delivers WUs for the other project, or not. Maybe I'll try that out if I ever have time ...

If tasks for one project are suspended, only that project is prevented from being asked for new work.

San-Fernando-Valley wrote:

Right now i'm getting virus detection for an EINSTEIN file - strange - will open a new post/thread!

Extremely likely to be a false positive. Complain to your virus software vendor :-).

Cheers,
Gary.

San-Fernando-Valley

Joined: 16 Mar 16

Posts: 409

Credit: 10204593455

RAC: 23364078

G.R.: Thanks again for your

16 May 2018 6:50:53 UTC

Message 165399 in response to message 165398

(moderation:

)

G.R.: Thanks again for your successfull effort of detailed explaining.

I am aware of the possible settings.

I had hoped to be able to do it in a 'simple' way -- but what is really simple in life?

I appreciate your time taken to explain very elaboratly.

Have a nice day ...

not getting GPU WUs anymore

Forums › Cruncher's Corner

Comment viewing options

Forums › Cruncher's Corner