one machine not requesting work

archae86
archae86
Joined: 6 Dec 05
Posts: 3,107
Credit: 6,205,083,648
RAC: 1,982,140
Topic 228039

While some things are odd at Einstein right now, an oddity I don't understand is that one of my three machines is not requesting work since a few hours ago. At first I thought it was because it had too many tasks with upload pending.  But over time the other two machines finished enough tasks to have even more pending, and they continue to request work.

I've reviewed the preferences comparison page (both project and computing) and don't see a mismatch among my three machines.

Yet consistently if I click "update project" the log includes this discouraging text:

[send] CPU: req 0.00 sec, 0.00 instances; est delay 0.00
[send] ATI: req 0.00 sec, 0.00 instances; est delay 0.00
[send] work_req_seconds: 0.00 secs

No, I don't have suspended tasks.  I did have a time-of-day Compute only during a particular period each day, but I have disabled that the same way I did for the other two machines.

I even tried enabling CPU tasks, and clicked a type, but still get zero request both for CPU and for GPU.

I'm all ears for pet ways to get into (and out of) this condition.

Thanks

 

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 3,070
Credit: 9,543,827,049
RAC: 24,320,073

Try the logging option

Try the logging option work_fetch_debug.

 

archae86
archae86
Joined: 6 Dec 05
Posts: 3,107
Credit: 6,205,083,648
RAC: 1,982,140

Keith Myers wrote: Try the

Keith Myers wrote:

Try the logging option work_fetch_debug.

Thanks

I tried it the neophyte's way, by going to:

BOINCmgr|Options|Event Log options

I ticked the box for work_fetch_debug and hit Apply, then Save for good measure

Then did a project update.

The log gives two clear indications that excessive uploads is my problem:

8/27/2022 5:23:20 PM | Einstein@Home | [work_fetch] REC 350013.644 prio -1.000 can't request work: too many uploads in progress
 

is one, and the other is:

8/27/2022 5:23:20 PM | Einstein@Home | skip: too many uploads in progress
8/27/2022 5:23:20 PM |  | [work_fetch] No project chosen for work fetch
 

I remain puzzled as to why the other two machines, which have more tasks in the "upload pending" state than does the troubled one have nevertheless continued to request work (and sometimes to receive it).

But I am much more hopeful that once the general project problem with pending uploads on MeerKAT tasks is fixed, this machine will probably resume requesting work.  I'll leave it alone until then.

Thanks

archae86
archae86
Joined: 6 Dec 05
Posts: 3,107
Credit: 6,205,083,648
RAC: 1,982,140

archae86 wrote: I remain

archae86 wrote:

I remain puzzled as to why the other two machines, which have more tasks in the "upload pending" state than does the troubled one have nevertheless continued to request work (and sometimes to receive it).

Well, now a second machine has stopped getting work.  This one actually has a message right in the ordinary message log (without a debug flag set) saying "Not requesting tasks: too many uploads in progress"

That second machine currently has 58 MeerKAT tasks uploading, while the first to stop has only 17.  I remain a bit puzzled, but plan to be patient, and have good hope that sometime Monday things will probably clear up.

 

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 3,070
Credit: 9,543,827,049
RAC: 24,320,073

You can try increasing the

You can try increasing the number of connection allowed per host and per project in the cc_config.xml file.

The default is only 8 allowed connections per host and only 2 per project.

Try setting  the max allowed to 32 and 16 connections for the projects.

As an example:

 <max_file_xfers>18</max_file_xfers>
 <max_file_xfers_per_project>8</max_file_xfers_per_project>

Then re-read your config files in the Manager.

 

Harri Liljeroos
Harri Liljeroos
Joined: 10 Dec 05
Posts: 2,205
Credit: 2,184,635,180
RAC: 1,489,666

Boinc has a built in feature

Boinc has a built in feature that when your hanging uploads reach the value of 2 x number of CPU cores, you won't be allowed to download any new work from that project. Good idea when no uploads are going thru, not so good if the upload problem concerns only one sub-project.

archae86
archae86
Joined: 6 Dec 05
Posts: 3,107
Credit: 6,205,083,648
RAC: 1,982,140

Harri Liljeroos wrote:Boinc

Harri Liljeroos wrote:
Boinc has a built in feature that when your hanging uploads reach the value of 2 x number of CPU cores, you won't be allowed to download any new work from that project.

Thanks.  You have explained the mystery of the difference in failure threshold of my three machines.  As it turns out my three machines include a considerable range of "reported" CPU cores, as I upped the reported number on some in order to be able to get adequate daily work issue sometime in the past, and then forgot about it.

cores pending uploads
4       17
28      58
16      35

Yes, all three are currently not getting any work for this reason.  I had all three set to get MeerKAT only.  Now, of course, they can't get anything else until the upload problem gets fixed.  I hope that may be Monday.

 

 

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 2,381
Credit: 19,295,730,724
RAC: 37,496,568

A short term fix could be to

A short term fix could be to use the ncpus variable in cc_config to assign a large number of CPU cores. Change the computer preferences to not get any MeerKAT BRP7 work before attempting to get more work, to prevent the situation getting worse. 
 

mid you run any CPU work, you’ll need to adjust the max_concurrent setting for that application in the app_config to prevent too many CPU tasks from running on fake CPU cores. 

_________________________________________________________________________

archae86
archae86
Joined: 6 Dec 05
Posts: 3,107
Credit: 6,205,083,648
RAC: 1,982,140

Ian&Steve C. wrote:A short

Ian&Steve C. wrote:
A short term fix could be to use the ncpus variable in cc_config to assign a large number of CPU cores. Change the computer preferences to not get any MeerKAT BRP7 work before attempting to get more work, to prevent the situation getting worse. 

It took me a while to think of it, but I did think of all that (including turning off accepting Beta work, just in case) and was in the process of implementing it at the time you posted.

It worked.

Thanks

 

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.