not getting GPU WUs anymore

San-Fernando-Valley
San-Fernando-Valley
Joined: 16 Mar 16
Posts: 409
Credit: 10205953455
RAC: 23235468
Topic 214872

here is another problem that makes me wonder if I, my rigs or Einstein is slowly deteriorating ...

 

Been getting plenty of GPU WUs since yesterday (couple of hundreds).

Now not receiving any anymore -

Server is working (generating), so what am I doing wrong?

Maybe I should take a day off?

Would appreciate any tips or help ...

 

Jim1348
Jim1348
Joined: 19 Jan 06
Posts: 463
Credit: 257957147
RAC: 0

Are you also doing the

Are you also doing the Einstein CPU work units along with some other CPU projects?  Then it is probably a BOINCism.  The scheduler gets all bent out of shape, since it can't keep track of the resource share of the GPU and CPU separately.  Or something like that.  No one really knows.  I use two separate machines for the Einstein CPU and GPU work.  It is silly, I know.

Ace Casino
Ace Casino
Joined: 25 Feb 05
Posts: 36
Credit: 1470375283
RAC: 861497

Maybe you don't need any more

Maybe you don't need any more GPU's right now.

Look in your BOINC Event log and see if it says: "Not requesting tasks, don't need".

You wont get more WU's if BOINC doesn't think you can finish them by deadline.

 

San-Fernando-Valley
San-Fernando-Valley
Joined: 16 Mar 16
Posts: 409
Credit: 10205953455
RAC: 23235468

Jim1348 wrote:Are you also

Jim1348 wrote:
Are you also doing the Einstein CPU work units along with some other CPU projects?.

No, just running Einstein.

Started crunching about two days ago - CPU and GPU - on 6 rigs.

Everything was fine.

Suddenly no more GPU WUs.

I will follow your advice - just have to wait till those long running CPU WUs are finished.

Have a nice day!

San-Fernando-Valley
San-Fernando-Valley
Joined: 16 Mar 16
Posts: 409
Credit: 10205953455
RAC: 23235468

Ace Casino wrote:Maybe you

Ace Casino wrote:

Maybe you don't need any more GPU's right now.

Look in your BOINC Event log and see if it says: "Not requesting tasks, don't need".

You wont get more WU's if BOINC doesn't think you can finish them by deadline.

 

Thanks for pointing me to the event log!

I noticed the message "not requesting tasks: some task is suspended via Manager".

I had suspended all CPU WUs (not the project).

What I did not know (I guess I'm just a little dumb), is that the Manager does not differentiate between CPU and GPU tasks (WUs).

Matter of fact - my logic was, that if you want to pause the whole projetct, you SUSPEND the project.

AND if you just want to suspend a certain task (WU) then you just suspend THAT sole task (WU), expecting that the Manager still delivers!

I guess I'm having a bad day!

Thanks for replying!

rjs5
rjs5
Joined: 3 Jul 05
Posts: 32
Credit: 613261721
RAC: 877966

Did you recently update

Did you recently update drivers?

Your computer show the good 391.35 Nvidia driver. The newer one 397.31 has problems.  Nvidia issued a HOTFIX 397.55.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117661676051
RAC: 35254838

San-Fernando-Valley wrote:...

San-Fernando-Valley wrote:
... I had suspended all CPU WUs (not the project).

Were you suspending tasks because you were getting too many of them and you were just trying to get their numbers to deplete a little before allowing any more?

If so, and seeing as you now understand that you can't do it that way, perhaps you might be interested in what you can do to partially mitigate the problem of oversupply of CPU tasks.

There are a couple of reasons why you can get this problem.  The main reason is that Einstein uses duration correction factor (DCF) to manage estimated crunch time and there's just one DCF and it applies project wide.  For most people, CPU tasks take longer than the estimate so whenever a CPU task finishes, the DCF increases which drags the estimates of both CPU and GPU tasks much higher.  The CPU task estimate is now correct (for the time being) but the GPU task estimate is way too high.  GPU tasks tend to take less than the estimate anyway.  A series of fast finishing GPU tasks will progressively drag down the DCF so that the CPU estimate becomes way too low and BOINC over-fetches CPU tasks as a result.  Eventually another long running CPU task finishes, the DCF has a big jump upwards and the whole cycle repeats.  Overall, you will tend to have less GPU tasks and more CPU tasks than what you really need to match your desired cache size.  The problem gets worse with larger cache sizes.

However, there is another factor which makes the CPU task over-fetch even worse.  It becomes quite a severe factor if you are running high end nVidia GPUs with multiple concurrent GPU tasks.  As an example let's say you have a quad core host with HT disabled.  You have 4 cores/4 threads.  Lets say you are running 3 concurrent GPU tasks.  By default, the nVidia GPU would need 3 of your 4 cores to be 'reserved' for GPU support duties.  The problem is that BOINC still sees 4 cores and will fetch CPU tasks for 4 cores even though only one core would be allowed to crunch CPU tasks.  Hence a big CPU task over-fetch.  You can mitigate this a bit with HT enabled.  Fetching for 8 cores but only crunching with 5.  However, the CPU tasks take a lot longer and the DCF swings are even larger than before.

The best way to prevent this second cause of CPU over-fetch is make sure that BOINC only fetches work for cores that will actually be crunching.  Taking the previous example, you could change the BOINC preferences so that BOINC can use only 25% of the cores.  That way, BOINC will fetch for only one CPU core of the 4.  You now have to stop GPU tasks from also trying to 'reserve' cores to support the 3 concurrent GPU tasks.  You do that with an app_config.xml file.  So for the example above, a suitable file would look like

    <app_config>
        <app>
            <name>hsgamma_FGRPB1G</name>
            <gpu_versions>
                <gpu_usage>0.33</gpu_usage>
                <cpu_usage>0.2</cpu_usage>
            </gpu_versions>
        </app>
    </app_config>

The 0.33 allows 3 GPU tasks to share the GPU whilst 3 x 0.2 is less than a full core so no further CPU cores are taken up with GPU support duties.  You already have 3 CPU cores from the 25% of cores setting.

As with most things in life, there are pros and cons.  The con here is that you can't just get rid of app_config.xml and have things revert to the way they were before.  I use the file all the time and changes can be made by editing it.  Boinc Manager has an option to 'reread config files' to pick up any changes.  I've never tried to get rid of a file completely but my understanding is that you can't just delete it.  When you first place this file in the project directory, the contents get incorporated into the state file.  If you remove the file, that act alone doesn't remove what was incorporated into the state file.  You have two choices to fix that.  You could manually edit the state file - potentially dangerous if you make a mistake or don't really know what to remove.  Or, you could reset the project, which has its own set of disadvantages.

I love that file because it has solved the problem of how to enable lots of different preferences with the restriction to just 4 locations (venues).  Each of my hosts can essentially be customised independent of all the others.  I can't see myself ever wanting to remove it.  When things change, I just edit it and click on 'read config files'.

 

Cheers,
Gary.

San-Fernando-Valley
San-Fernando-Valley
Joined: 16 Mar 16
Posts: 409
Credit: 10205953455
RAC: 23235468

Gary Roberts

Gary Roberts wrote:
San-Fernando-Valley wrote:
... I had suspended all CPU WUs (not the project).

Were you suspending tasks because you were getting too many of them and you were just trying to get their numbers to deplete a little before allowing any more?

If so, and seeing as you now understand that you can't do it that way, perhaps you might be interested in what you can do to partially mitigate the problem of oversupply of CPU tasks.

 

 

Just had time to read your fine reply - Absolutely well explained. THANK YOU!

BUT, I still don't quite undestand the special problem I had. Let me explain a little further, hoping to not stress your time ...:

Let's take just one of my rigs:  If I remeber correctly, I had 6 cores loaded with CPU WUs and 2 cores with GPU WUs (both cores sharing one GPU).

For some reason, that I don't really remeber (heat problems?) I suspended all 6 CPU WUs - while the 2 GPU WUs were happily crunching.

As they (GPUs) had finished (they run about 10 minutes), I had expected to get more GPU WUs delivered (of course tried UPDATE) -- or in the worst case get more CPU WUs or a mix of CPU + GPU jobs. But nothing happend - no WUs delivered. After looking into the EVENT LOG, I found the message that I have already mentioned.

I guess that was very naiive of me, but I don't have much experience on behalf of how the MANAGER works!

My goal was to SUSPEND CPU work, but to continue GPU work - till the heat would cool off in the evening.

I'm wondering if, when running more than one project at the same time, if the Manager differentiates and delivers WUs for the other project, or not.  Maybe I'll try that out if I ever have time ...

Right now i'm getting virus detection for an EINSTEIN file - strange - will open a new post/thread!

Have a nice day!

 

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117661676051
RAC: 35254838

San-Fernando-Valley wrote:For

San-Fernando-Valley wrote:

For some reason, that I don't really remeber (heat problems?) I suspended all 6 CPU WUs - while the 2 GPU WUs were happily crunching.

As they (GPUs) had finished (they run about 10 minutes), I had expected to get more GPU WUs delivered (of course tried UPDATE) -- or in the worst case get more CPU WUs or a mix of CPU + GPU jobs. But nothing happend - no WUs delivered. After looking into the EVENT LOG, I found the message that I have already mentioned.

BOINC regards a suspended task as an abnormal situation and a reason for caution.  By design, BOINC is prevented from requesting any work (even work for unrelated searches) from a project if just one task is suspended.  It's quite OK to suspend tasks temporarily if you want to test things like heat production with different numbers of tasks crunching.  However, you must remove suspension completely if you want BOINC to be able to request any new work from that project.  So, what you describe is exactly as per design.  I'm sorry I didn't comment specifically about that.  I thought the event log message had made it clear.

San-Fernando-Valley wrote:
My goal was to SUSPEND CPU work, but to continue GPU work - till the heat would cool off in the evening.

You could achieve that goal (temporarily at least whilst waiting for cooler temperatures) by using local computing preferences in BOINC Manager.  If you click on that, you can find a tab where you can change the % of CPU cores BOINC is allowed to use.  With 8 cores and 2 GPU tasks, you could temporarily change the %cores setting from 100% to 25% and apply the change.  That would have the immediate effect of suspending the 6 CPU cores that were crunching CPU tasks whilst leaving the 2 GPU tasks (with 2 supporting CPU cores) to continue crunching.

The 6 CPU tasks will have stopped crunching and will be listed as 'waiting to run' or something like that.  As GPU tasks finish, the results will be returned and new ones will start.  BOINC will continue to download new work as needed since nothing is actually suspended by the user.  You can't leave it that way forever as BOINC would eventually go into panic mode as the deadline approached.  You could play around with the %cores setting to find a mix of CPU and GPU tasks that gave an acceptable level of heat.  The disadvantage is that tasks from other other projects will not be able to run either if the setting is too restrictive.

If you want a long term solution you need to be prepared to experiment with settings.  It may well be that you might choose to 'turn off' (through settings) CPU tasks from Einstein completely, just leaving the GPU tasks.  That way, you could allocate some fraction of the remaining 6 cores to CPU tasks from other projects.

San-Fernando-Valley wrote:
I'm wondering if, when running more than one project at the same time, if the Manager differentiates and delivers WUs for the other project, or not.  Maybe I'll try that out if I ever have time ...

If tasks for one project are suspended, only that project is prevented from being asked for new work.

San-Fernando-Valley wrote:
Right now i'm getting virus detection for an EINSTEIN file - strange - will open a new post/thread!

Extremely likely to be a false positive.  Complain to your virus software vendor :-).

 

Cheers,
Gary.

San-Fernando-Valley
San-Fernando-Valley
Joined: 16 Mar 16
Posts: 409
Credit: 10205953455
RAC: 23235468

G.R.: Thanks again for your

G.R.: Thanks again for your successfull effort of detailed explaining.

I am aware of the possible settings.

I had hoped to be able to do it in a 'simple' way -- but what is really simple in life?

I appreciate your time taken to explain very elaboratly.

Have a nice day ...

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.