Looks like there could be a problem with the daily task quota.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,870
Credit: 115,702,816,648
RAC: 34,753,358
Topic 217564

As discussed earlier in this message there is a new data file for GPU tasks and the tasks being distributed for it are completing close to 4 times faster than previous tasks.  If you have a reasonably modern GPU, and if the tasks for this new data file continue to behave in the same way, you will likely get to the stage where you reach your daily quota.  You could easily run out of work well before any new daily allocation kicks in.

In the past, there have been milder versions of this problem - for example, check out this thread.  Apart from that discussion, there is a further link there to an even earlier example.  It was really only the most productive GPUs that were seeing the issue.  In both those examples, there were suggested workarounds (the creation of extra virtual CPUs) as a way of increasing the daily quota and that provided a way out of the issue at the time.

This time it's likely to be a much bigger problem unless the project admins have drastically increased the daily GPU task limits.  I don't know if that's been done so I thought I'd just try to give people a heads-up about it.  If you start to see messages about "no work sent - reached daily quota of nnnn tasks" (or something similar) then you will understand why.

Apart from problems that are quota related, the big increase in downloads and uploads to support these fast tasks is bound to put extra stress on the servers.  I suspect this may well lead to communications problems/outages or other forms of 'overload' difficulties.  It might be a bit of a rough ride for a while :-).

 EDIT: On thinking more about what could be causing these extremely fast tasks, I'm wondering if perhaps they have inadvertently been created to have the work content of CPU style tasks.  When GPU tasks were originally created, they were made by 'bundling' the content of 5 CPU tasks and so the credit award was set at 5 x 693 = 3465.   Since the new tasks are running not far short of 5 times quicker than usual, perhaps that's the cause and perhaps it might be rectified quickly and things can get back to normal.

 

Cheers,
Gary.

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4,885
Credit: 18,416,769,057
RAC: 5,850,533

How do you determine what

How do you determine what your daily quota of tasks are at Einstein?  The project doesn't have the normal Application Details page on a host like all my other projects.  That is where I find the daily quota figures for a host for Seti, MW and GPUGrid.

 

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,870
Credit: 115,702,816,648
RAC: 34,753,358

I don't know for sure.  I did

I don't know for sure.  I did a search for daily quota and came across this particular message from Bernd where he gives a calculation (complete with a wrong answer) that suggests 32*8 per co-processor instance (which should be 256 and not 128 -- maybe it should have been 32*4 which would give the 128 answer).  On top of that you add 32 per CPU core.  The really strange thing is that very message also refers to LATeah2003L.dat and the fact that they are fast running - just like the current lot.

Perhaps these tasks have all been done before???

 

Cheers,
Gary.

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2,143
Credit: 2,920,472,634
RAC: 956,393

I'm slightly doubtful about

I'm slightly doubtful about this conversation - I think we might be confusing two separate concepts.

For projects using the current standard BOINC server code, we have two limiters:

Maximum tasks in progress. This keeps a lid on the number of tasks cached by each host, and hence keeps the size of the project's task database under control. It applies to every computer attached to the project.

Maximum daily quota. This keeps individual runaway rogue computers under control - if a computer starts throwing errors in quick succession, the daily quota enforces a breathing space and hopefully prompts the owner to investigate and rectify the problem. As soon as the host starts returning valid work, its individual quota is allowed to start rising again.

Under the circumstances described, 'maximum tasks in progress' would be the tool of choice. But of course, Einstein doesn't use the current server code, and I don't know if it's available here.

There's another problem with the old server code in use here: it can't track estimated runtime across applications. We're still relying on the 'one size fits all' Duration Correction Factor. On my machine, the 2003L tasks are estimating at 30 minutes, with DCF controlled by the BRP tasks running on my intel GPU. But the 2003L tasks are actually finishing in 7:30. Unless the <rsc_fpops_est> is controlled by the workunit generator, hosts without DCF control from another app will request more, and more, and more work until they stabilise with four times the normal cache. And then they'll go 'pop', and run into deadline trouble, when normal-length tasks start to work their way to the top of the queue again.

It's the dynamic control of cache sizes that we need now, not a mad dash through the mornings followed by a drought in the evenings.

mmonnin
mmonnin
Joined: 29 May 16
Posts: 291
Credit: 3,311,333,208
RAC: 550,357

I am getting an event message

I am getting an event message saying its 512 daily tasks and I'm already through nearly 200 of them as my queue is cut short at 339. That is with a RX580.

JugNut
JugNut
Joined: 27 Feb 15
Posts: 12
Credit: 1,135,288,353
RAC: 0

I have a PC that has one gtx

One of my PC's has a single gtx 980 ti that's now reached its daily quota of 480 tasks. Another PC with a pair of 1080 ti' has reached its daily quota of 768 tasks, they then get deferred for 8.5 hours.   

Both PC's will be out of work within the next 5 hours.

 

PS: I only crunch Einstein on GPU's

 

DanNeely
DanNeely
Joined: 4 Sep 05
Posts: 1,364
Credit: 3,562,358,667
RAC: 0

This isn't precisely the

This isn't precisely the problem I had in mind when I setup a several day work queue; but hopefully it'll give me enough buffer to keep going until we switch to a less problematic data file.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,870
Credit: 115,702,816,648
RAC: 34,753,358

I thought I should spell this

I thought I should spell this out for everybody's benefit.

If you've been smart enough to set up a multi-day cache before 2003L came along, you won't be getting any work supply refusals ... yet :-).  You'll still be fetching the new tasks in normal quantities only, until all the cached older (and slower) tasks are finished.  The new tasks will still show the previous (and 4+ times longer) estimate until that happens.

When you get to crunching the first of the new tasks, the estimate will be corrected downwards and you will start to fetch lots more tasks.  That's OK (for a while) as the new tasks will be crunching quickly anyway.  This is when daily quota limits might affect work fetch.

You should keep an eye out for whatever follows 2003L.  As soon as you get any, you should run one to see how they compare (or watch the boards because it's bound to be reported).  If they turn out to have the 'normal' elapsed time for crunching, you'd better find out about that quickly so that you can reduce the work cache to about a fifth of what you really want.  If you don't immediately do that, you will likely end up with far more than you can handle by the time those tasks reach the top of the queue and the estimate for them blows out by a rather large factor.

 

Cheers,
Gary.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,870
Credit: 115,702,816,648
RAC: 34,753,358

Richard Haselgrove wrote:I'm

Richard Haselgrove wrote:
I'm slightly doubtful about this conversation - I think we might be confusing two separate concepts.
....

It's probably a lot more complicated than just the difference between current server code and the much older code that Einstein is based on.

The reason for that is the Devs here added (and have continued to add) a fair bit of local customisation to whatever version they originally started with.  It's been mentioned a few times over the years that the reason for not upgrading to a newer code base was the volume of local customisations that would need to be ported as well.  That was seen to be a much bigger task than just making further additions/modifications to the original base as and when needed for a particular purpose.

When something like the current big change in crunch time comes along and I can see that there will be problems for the inexperienced, I just feel compelled to provide some sort of explanation/warning about what might happen as a result.  I understand that very few actively read the boards and probably even fewer understand the complexities of what is happening, but I try anyway.

There are several others who contribute regularly as well and I'm always pleased to see their comments and responses to people asking for help.  In particular, because of your great understanding of the issues, your thoughts and suggestions are most welcome.

 

Cheers,
Gary.

Phillip Spencer
Phillip Spencer
Joined: 7 Oct 11
Posts: 10
Credit: 51,483,402
RAC: 0

Gary Roberts wrote:[When

Gary Roberts wrote:

When something like the current big change in crunch time comes along and I can see that there will be problems for the inexperienced, I just feel compelled to provide some sort of explanation/warning about what might happen as a result.  I understand that very few actively read the boards and probably even fewer understand the complexities of what is happening, but I try anyway.

There are several others who contribute regularly as well and I'm always pleased to see their comments and responses to people asking for help.  In particular, because of your great understanding of the issues, your thoughts and suggestions are most welcome.

 

I certainly don't understand many of the complexities but I do appreciate the comments and advice that you and others such as Richard H provide.

Richie
Richie
Joined: 7 Mar 14
Posts: 656
Credit: 1,702,989,778
RAC: 0

Now would be a good time to

Now would be a good time to check out if a host really needs maxed out work cache anymore... Laughing

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.