Suddenly getting huge number of tasks sent to my system

hadron
hadron
Joined: 27 Jan 23
Posts: 62
Credit: 101735690
RAC: 590006
Topic 229717

OS: openSUSE Leap 15.4

Boinc version: 7.18.1

NOTE: No comments about the Boinc version, please. I know it was released for Android only, but SUSE seems to have made it work under Linux. It has been working without problems here for ages.

CPU: 12-core AMD Ryzen 9 5900X; of 24 threads, 22 are allotted to Boinc.

Other projects: Cosmology@H, LHC@H running Atlas and Theory

All of a sudden, I have been getting a huge number of FGRP5 tasks assigned to my system. Right now, I am up to 226 tasks "in progress". Until this is resolved, I've turned FGRP5 off in my project preferences.

I was limiting Einstein@H tasks in an app_config.xml file as follows:

<app_config>
   <project_max_concurrent>8</project_max_concurrent>
</app_config>

Everything was working fine with this; I would have up to 8 tasks running, with a small number (up to 6 or so) waiting to start. Not knowing if CPU-based gravitational wave tasks will be returning, I set up for the possibility that they will. This involves changes to the app_config file, which at present looks like this:


<app_config>
   <app>
       <name>hsgamma_FGRP5</name>
       <max_concurrent>6</max_concurrent>
   </app>
   <project_max_concurrent>8</project_max_concurrent>
</app_config>
 

The sudden deluge of assigned tasks began very soon after I made that change.

It will be greatly appreciated if anyone can tell me why this is happening, and what I need to do to stop it. With 220 tasks waiting to run right now, I doubt they will all be able to complete before their expiry time. I can, of course, clear them out faster by limiting the number of running Cosmology and LHC tasks, but I really do want to make sure this situation will not return in the future.

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2143
Credit: 2959246157
RAC: 707345

I've looked (briefly - it's

I've looked (briefly - it's late in the evening here) at your host 13083685. It shows that it was issued with new tasks repeatedly, at one minute intervals.

This is an obscure problem which I've seen a few times before, and which mainly seems to manifest itself on the Einstein project - but the origin is on your machine.

If you study the event log using BOINC Manager, I expect you'll find that your client requested new work, again and again and again, - the same amount each time. This being Einstein, it will have received new tasks every time.

You've done the right thing by preventing the client requesting any more - 'no new tasks' would have been enough. Let them run through, but don't be hard on yourself - this problem has existed for years, and Einstein will survive if you can't quite finish them.

I find that making a small change to your client configuration, like setting an extra Event Log option in an attempt to work out what's going on, is enough to cure it - which is intensely frustrating. If you can work out what's going wrong, please tell us - but otherwise, just stop work fetch if you ever see it again.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117695532482
RAC: 35105633

Richard Haselgrove wrote:I

Richard Haselgrove wrote:
I find that making a small change to your client configuration, like setting an extra Event Log option in an attempt to work out what's going on, is enough to cure it - which is intensely frustrating. If you can work out what's going wrong, please tell us - but otherwise, just stop work fetch if you ever see it again

Hi Richard, thanks for your insights.

This problem has indeed shown up quite a few times at Einstein.  I've never experienced it myself but a search using "max_concurrent" shows lots of examples where others have.

The release notes for BOINC 7.20.0 contain an entry:-

  • Client: fix work-fetch logic when max concurrent limits are used

It was my understanding that this was a genuine BOINC problem (not just specific to Einstein) that was now fixed.  Are you implying that the issue hasn't been resolved?

My impression was that the problem tended to be triggered when max_concurrent was used in addition to project_max_concurrent which seems to line up with what the OP saw.

I have no issue with the OP desiring to use an older BOINC version but if both max_concurrent options are also needed, wouldn't it be best to upgrade to 7.20.x (or later) if that truly does fix the issue?

Cheers,
Gary.

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4964
Credit: 18751705606
RAC: 7106856

I concur with Gary's

I concur with Gary's assessment.  The problem is the older 7.18.1 client which does not have the fix for the recurring work request problem when utilizing the project max and max concurrent variables.

If the OP wishes to continue using these project task limit variables, he needs to update to the newer clients which have fixed this issue.

 

hadron
hadron
Joined: 27 Jan 23
Posts: 62
Credit: 101735690
RAC: 590006

Thanks to all for the

Thanks to all for the replies. I'm upgrading sometime this week to Leap 15.5, which has Boinc 7.22.1, so from what has been said, this should resolve the issue. In the meantime, I'm going to restrict Cosmology and LHC to 1 running task each, and let Einstein have the remaining 20. No new tasks from Einstein until after the update.

Once all that is done, I'll post again to let everyone know how it went -- using the existing app_config, of course. If it fails to resolve the issue, I'll just clear out all Einstein tasks, remove the project limit, restart Einstein, and enable task fetch again.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117695532482
RAC: 35105633

hadron wrote:Once all that is

hadron wrote:
Once all that is done, I'll post again to let everyone know how it went ...

Thanks very much for offering to report back.  It's very important to have confirmation that the issue truly is resolved.

Good luck with all the upgrades!

Cheers,
Gary.

Eugene Stemple
Eugene Stemple
Joined: 9 Feb 11
Posts: 67
Credit: 377080635
RAC: 587436

Keith Myers wrote: I concur

Keith Myers wrote:

I concur with Gary's assessment.  The problem is the older 7.18.1 client which does not have the fix for the recurring work request problem when utilizing the project max and max concurrent variables.

I am running a Debian (12.0) system, which upgraded from 11.7.0 on June 10.  That release provided boinc client/manager at version 7.20.5 and I am also using project max and concurrent settings for Einstein - and a mix of gpu and cpu tasks to make things even worse.  The 7.20.5 version gave me exactly the symptoms reported by the OP, i.e. seemingly endless work fetches until a task limit was reached - and hopeless to complete by deadlines.  I have reverted back to the 7.14.2 version which has been stable for me for many years.  Whatever shortcomings that older version may have they are not obvious and the work cache limits are observed even if somewhat wider than specified.  (It's the effect of DCF being corrupted by the gpu/cpu task mix.)  I've set a work buffer size of 0.3+0.1 day.

 

hadron
hadron
Joined: 27 Jan 23
Posts: 62
Credit: 101735690
RAC: 590006

Update: For a couple of

Update:

For a couple of reasons which aren't important here, I decided to defer the system upgrade to next week. Therefore, I let Boinc clear out all pending tasks for all projects (Cosmology, Einstein, LHC and Rosetta). Then I cleared the <project_max_concurrent> out of the Einstein app_config file, did a reset on Einstein, and set all my app_config files back to what they were before this fiasco started. Then a restart of Boinc to read in the new settings, set all projects back to fetch new tasks, and I thought all would return to how it was before.

How wrong I was. I went off to do other things, then came back after about an hour to find that I was now in possession of 700 Einstein tasks in progress. The system upgrade is now on hold until I can figure this out. AIf I cannot, then I will simply do the upgrade including the Boinc version upgrade, and hope for the best.

Back to the drawing board. Once all the Einstein tasks are cleared out, since simply resetting Einstein didn't work, I will delete the project and add it back. Most likely this won't happen until I've had a chance to restart Boinc again (LHC tasks are notorious for bombing out if they are stopped or suspended while in progress, and I really don't want to lose them).

Obviously, something other than just the <project_max_concurrent> is at fault here, but I have no idea what that might be -- there is nothing anywhere to give me any hint.

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4964
Credit: 18751705606
RAC: 7106856

No, the simple reason for the

No, the simple reason for the excessive work grab was your project reset.  You threw away all the DCF calculations for the project and application you had established prior.  Boinc upon restart had no clue how fast your host is until it can get some validated tasks again to calculate DCF and how many you can complete within a timeframe. Boinc had to start from scratch again.

 

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2143
Credit: 2959246157
RAC: 707345

Once again, the Event Log is

Once again, the Event Log is your friend. Look back either in the Event Log itself, or in the persistent file copies stdoutdae.txt and stdoutdae.old, to try to work out when and how the tasks were fetched.

If you see a massive single fetch, then a smaller one, then an even smaller one, and so on until they diminish to 'tiny', and finally the work requests stop entirely, then Keith's explanation is the correct one.

If you see a reasonable number being fetched to start with, but then the same number being requested and delivered repeatedly at intervals of just over a minute, then I refer you to the answer I gave to your initial post.

But I agree with Keith on one thing: project resets are very, very, rarely needed. Usually, simple stopping and re-starting the BOINC client is the most drastic step needed.

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4964
Credit: 18751705606
RAC: 7106856

The only reason I have ever

The only reason I have ever needed to do a project reset was to try and clear the database of "ghost" tasks.

Unfortunately, that is broken on most projects now.  Does not clear the ghost tasks.  So no point in ever using a project reset again.

 

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.