Looks like there could be a problem with the daily task quota.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5872

Credit: 117837745027

RAC: 34781703

18 Dec 2018 6:35:14 UTC

Topic 217564

(moderation:

)

As discussed earlier in this message there is a new data file for GPU tasks and the tasks being distributed for it are completing close to 4 times faster than previous tasks. If you have a reasonably modern GPU, and if the tasks for this new data file continue to behave in the same way, you will likely get to the stage where you reach your daily quota. You could easily run out of work well before any new daily allocation kicks in.

In the past, there have been milder versions of this problem - for example, check out this thread. Apart from that discussion, there is a further link there to an even earlier example. It was really only the most productive GPUs that were seeing the issue. In both those examples, there were suggested workarounds (the creation of extra virtual CPUs) as a way of increasing the daily quota and that provided a way out of the issue at the time.

This time it's likely to be a much bigger problem unless the project admins have drastically increased the daily GPU task limits. I don't know if that's been done so I thought I'd just try to give people a heads-up about it. If you start to see messages about "no work sent - reached daily quota of nnnn tasks" (or something similar) then you will understand why.

Apart from problems that are quota related, the big increase in downloads and uploads to support these fast tasks is bound to put extra stress on the servers. I suspect this may well lead to communications problems/outages or other forms of 'overload' difficulties. It might be a bit of a rough ride for a while :-).

EDIT: On thinking more about what could be causing these extremely fast tasks, I'm wondering if perhaps they have inadvertently been created to have the work content of CPU style tasks. When GPU tasks were originally created, they were made by 'bundling' the content of 5 CPU tasks and so the credit award was set at 5 x 693 = 3465. Since the new tasks are running not far short of 5 times quicker than usual, perhaps that's the cause and perhaps it might be rectified quickly and things can get back to normal.

Cheers,
Gary.

Keith Myers

Joined: 11 Feb 11

Posts: 4976

Credit: 18784177396

RAC: 7355114

How do you determine what

18 Dec 2018 9:07:40 UTC

Message 168317

(moderation:

)

How do you determine what your daily quota of tasks are at Einstein? The project doesn't have the normal Application Details page on a host like all my other projects. That is where I find the daily quota figures for a host for Seti, MW and GPUGrid.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5872

Credit: 117837745027

RAC: 34781703

I don't know for sure. I did

18 Dec 2018 9:48:35 UTC

Message 168318

(moderation:

)

I don't know for sure. I did a search for daily quota and came across this particular message from Bernd where he gives a calculation (complete with a wrong answer) that suggests 32*8 per co-processor instance (which should be 256 and not 128 -- maybe it should have been 32*4 which would give the 128 answer). On top of that you add 32 per CPU core. The really strange thing is that very message also refers to LATeah2003L.dat and the fact that they are fast running - just like the current lot.

Perhaps these tasks have all been done before???

Cheers,
Gary.

Richard Haselgrove

Joined: 10 Dec 05

Posts: 2143

Credit: 2962069208

RAC: 697047

I'm slightly doubtful about

18 Dec 2018 10:55:35 UTC

Message 168321

(moderation:

)

I'm slightly doubtful about this conversation - I think we might be confusing two separate concepts.

For projects using the current standard BOINC server code, we have two limiters:

Maximum tasks in progress. This keeps a lid on the number of tasks cached by each host, and hence keeps the size of the project's task database under control. It applies to every computer attached to the project.

Maximum daily quota. This keeps individual runaway rogue computers under control - if a computer starts throwing errors in quick succession, the daily quota enforces a breathing space and hopefully prompts the owner to investigate and rectify the problem. As soon as the host starts returning valid work, its individual quota is allowed to start rising again.

Under the circumstances described, 'maximum tasks in progress' would be the tool of choice. But of course, Einstein doesn't use the current server code, and I don't know if it's available here.

There's another problem with the old server code in use here: it can't track estimated runtime across applications. We're still relying on the 'one size fits all' Duration Correction Factor. On my machine, the 2003L tasks are estimating at 30 minutes, with DCF controlled by the BRP tasks running on my intel GPU. But the 2003L tasks are actually finishing in 7:30. Unless the <rsc_fpops_est> is controlled by the workunit generator, hosts without DCF control from another app will request more, and more, and more work until they stabilise with four times the normal cache. And then they'll go 'pop', and run into deadline trouble, when normal-length tasks start to work their way to the top of the queue again.

It's the dynamic control of cache sizes that we need now, not a mad dash through the mornings followed by a drought in the evenings.

mmonnin

Joined: 29 May 16

Posts: 291

Credit: 3435646540

RAC: 4123566

I am getting an event message

18 Dec 2018 11:35:00 UTC

Message 168323

(moderation:

)

I am getting an event message saying its 512 daily tasks and I'm already through nearly 200 of them as my queue is cut short at 339. That is with a RX580.

JugNut

Joined: 27 Feb 15

Posts: 12

Credit: 1135288353

RAC: 0

I have a PC that has one gtx

18 Dec 2018 16:30:19 UTC

Message 168339

(moderation:

)

One of my PC's has a single gtx 980 ti that's now reached its daily quota of 480 tasks. Another PC with a pair of 1080 ti' has reached its daily quota of 768 tasks, they then get deferred for 8.5 hours.

Both PC's will be out of work within the next 5 hours.

PS: I only crunch Einstein on GPU's

DanNeely

Joined: 4 Sep 05

Posts: 1364

Credit: 3562358667

RAC: 0

This isn't precisely the

19 Dec 2018 5:07:02 UTC

Message 168353

(moderation:

)

This isn't precisely the problem I had in mind when I setup a several day work queue; but hopefully it'll give me enough buffer to keep going until we switch to a less problematic data file.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5872

Credit: 117837745027

RAC: 34781703

I thought I should spell this

19 Dec 2018 5:44:42 UTC

Message 168354

(moderation:

)

I thought I should spell this out for everybody's benefit.

If you've been smart enough to set up a multi-day cache before 2003L came along, you won't be getting any work supply refusals ... yet :-). You'll still be fetching the new tasks in normal quantities only, until all the cached older (and slower) tasks are finished. The new tasks will still show the previous (and 4+ times longer) estimate until that happens.

When you get to crunching the first of the new tasks, the estimate will be corrected downwards and you will start to fetch lots more tasks. That's OK (for a while) as the new tasks will be crunching quickly anyway. This is when daily quota limits might affect work fetch.

You should keep an eye out for whatever follows 2003L. As soon as you get any, you should run one to see how they compare (or watch the boards because it's bound to be reported). If they turn out to have the 'normal' elapsed time for crunching, you'd better find out about that quickly so that you can reduce the work cache to about a fifth of what you really want. If you don't immediately do that, you will likely end up with far more than you can handle by the time those tasks reach the top of the queue and the estimate for them blows out by a rather large factor.

Cheers,
Gary.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5872

Credit: 117837745027

RAC: 34781703

Richard Haselgrove wrote:I'm

19 Dec 2018 6:42:57 UTC

Message 168355 in response to message 168321

(moderation:

)

Richard Haselgrove wrote:

I'm slightly doubtful about this conversation - I think we might be confusing two separate concepts.
....

It's probably a lot more complicated than just the difference between current server code and the much older code that Einstein is based on.

The reason for that is the Devs here added (and have continued to add) a fair bit of local customisation to whatever version they originally started with. It's been mentioned a few times over the years that the reason for not upgrading to a newer code base was the volume of local customisations that would need to be ported as well. That was seen to be a much bigger task than just making further additions/modifications to the original base as and when needed for a particular purpose.

When something like the current big change in crunch time comes along and I can see that there will be problems for the inexperienced, I just feel compelled to provide some sort of explanation/warning about what might happen as a result. I understand that very few actively read the boards and probably even fewer understand the complexities of what is happening, but I try anyway.

There are several others who contribute regularly as well and I'm always pleased to see their comments and responses to people asking for help. In particular, because of your great understanding of the issues, your thoughts and suggestions are most welcome.

Cheers,
Gary.

Phillip Spencer

Joined: 7 Oct 11

Posts: 10

Credit: 51483402

RAC: 0

Gary Roberts wrote:[When

21 Dec 2018 11:37:25 UTC

Message 168459

(moderation:

)

Gary Roberts wrote:

When something like the current big change in crunch time comes along and I can see that there will be problems for the inexperienced, I just feel compelled to provide some sort of explanation/warning about what might happen as a result. I understand that very few actively read the boards and probably even fewer understand the complexities of what is happening, but I try anyway.

There are several others who contribute regularly as well and I'm always pleased to see their comments and responses to people asking for help. In particular, because of your great understanding of the issues, your thoughts and suggestions are most welcome.

I certainly don't understand many of the complexities but I do appreciate the comments and advice that you and others such as Richard H provide.

Richie

Joined: 7 Mar 14

Posts: 656

Credit: 1702989778

RAC: 0

Now would be a good time to

21 Dec 2018 11:39:33 UTC

Message 168461

(moderation:

)

Now would be a good time to check out if a host really needs maxed out work cache anymore...

Looks like there could be a problem with the daily task quota.

Forums › Cruncher's Corner

Comment viewing options

Forums › Cruncher's Corner