FGRPB1G work shortage

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5845
Credit: 109956260620
RAC: 31303876

Link wrote:I meant setting

Link wrote:
I meant setting additional 0.15 (or more) days limits BOINC scheduler requests to 1 hour after finishing a WU (unless you also disable network access)

This doesn't make sense to me.  How could making a 0.15 (or more) days addition to the cache size cause a 1 hour limit to scheduler requests?  And how could disabling network access override this limiting action?  And what is the relevance of "after finishing a WU"??  Surely work requests will continue (subject to any imposed back-offs) for as long as the work on hand is less than what the cache size specifies, irrespective of whether tasks in progress are finishing or not??

Link wrote:
... you can't stop BOINC from reporting tasks latest 1 hour after the result file was uploaded

I haven't made any comment about reporting completed work.  I'm very happy for BOINC to do its own thing with reporting.

My aim is to minimise my impact on the scheduler if possible.  When work completes, uploading of results is handled by an upload server, not the scheduler.  In my experience, if no work fetch is needed when a task completes, BOINC seems to accumulate around 6 to 10 of them before reporting is attempted.  In many cases this is quite a deal longer than 1 hour and I'm happy with that.  My main concern is with work fetch which has a much greater impact on the scheduler.  Ask yourself this question.  Which has the lower impact, a single request for 30 new tasks followed by 3 hours of silence  -OR-  30 individual requests for single tasks each time, spread over the same total time period?  It used to be easy to get 20-30 tasks in a single fetch event.  Not possible with the current situation.

Way back when I first started adding lots of hosts, I thought about how to minimise impact.  Getting the required work with the smallest number of scheduler requests was one way to do that.  The other was to 'download once and deploy' rather than allow each individual host to download their own personal copies of the exact same set of files.  My scripts were designed with that in mind.

I share my internet access with an operating commercial business which relies on quite heavy internet access.  These days speeds are much better and costs are cheaper so it's not as important now to be as frugal as possible with my use as it was when I first started.  They pay the bills so I try not to abuse the privilege.

Link wrote:
So your points 3 & 4 in your script (changing cache settings) are actually unnecessary (unless you forgot to mention, that you disable net access until next run of the script or run an old version of BOINC), because BOINC will contact the scheduler anyway about once an hour and not once in 4 hours, so it might also request tasks at the same time. Your script might even create one or more additional scheduler requests if the last reporting was shortly before the script runs.

I didn't forget to mention disabling net access because it never needs to happen.  You seem to be saying that reporting of completed work might cause BOINC to request work.  How can that happen if the work cache size (eg. now changed to 0.05 days) is way below the amount of work on hand???

The points 1. to 4. summarised what happened for many years prior to the current restricted work situation and it worked very well to reduce the total number of work requests as well as unnecessary file downloads.  I used to run the script every 8 hours (3 times per day) so after allowing each host to fill up to the limit, reducing the cache to 0.05 days absolutely guaranteed there would be no further fetching until the time a script run increased it again (eg 8 hours later).  Some years ago the 8 hour cycle was reduced to 4 hours because it allowed earlier warnings of any misbehaving hosts.  Also, for much of all that time the settings were more like 0.05 and 1.0 or a little higher.  I started increasing above that value when Bernd first announced that FGRPB1G was ending.  I wondered if there might be occasional work supply issues as the search wound down.  Fortunately I had got to the 3 day level before the current work supply issues arose.

Just over a month ago when many work requests were getting the response, "Scheduler request completed: got 0 new tasks", it was obvious that if I continued to minimise scheduler requests, the hosts would run out of work.  I decided to change the work cache values that the script was manipulating from 0.05 low and 3.0 high to 3.0 low and 3.1 high.  I also changed from 4 hourly run intervals to 3 hourly runs.  The aim was to allow BOINC to ask for work as soon as there was less than 3.0 days worth on board.  By bumping up to 3.1 each host had a change (albeit a very small one) to get an extra 2.4hrs of work and so not need to ask again for most of the balance of the 3 hr interval.  For the more likely case that a host got little or even no work for that run, the 3.0 days limit would allow that host to keep trying for as long as necessary.

By continuing to use the script every 3 hours, I still get the benefit of keeping the ancilliary files up to date (no unnecessary downloads for resends) and I get even more frequent checks that all hosts are running as they should.  The fact that I haven't found a single host in trouble with work fetch (every one still seems to always have around 3 days or work) tells me that the scheme is working as intended.

Cheers,
Gary.

mikey
mikey
Joined: 22 Jan 05
Posts: 11944
Credit: 1832481918
RAC: 216643

Ian&Steve C. wrote:   what I

Ian&Steve C. wrote:


 

what I really need is a scheme of “worker bee” hosts to get work en masse, and shuffle them along to my big hosts for crunching and reporting. I “could” split my multi-GPU hosts to single GPU logical hosts with separate clients, but I really don’t like split stats. 

I thought each task had to be returned by the host that downloaded it was a security measure Boinc put in awhile back has that changed?

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3709
Credit: 34635586729
RAC: 42675484

mikey wrote:Ian&Steve C.

mikey wrote:

Ian&Steve C. wrote:


 

what I really need is a scheme of “worker bee” hosts to get work en masse, and shuffle them along to my big hosts for crunching and reporting. I “could” split my multi-GPU hosts to single GPU logical hosts with separate clients, but I really don’t like split stats. 

I thought each task had to be returned by the host that downloaded it was a security measure Boinc put in awhile back has that changed?

not necessarily. I’ve seen it happen a few cases. Both intentionally and unintentionally. 
 

it might be down to how some projects handle it. But I’ve seen some cases anecdotally where a task was manually moved (files moved, client_state recrafted, etc) and reported by a system that didn’t download it, and the reporting system received the credit. This was on SETI. I believe there’s a user in Primegrid doing something similar.  

_________________________________________________________________________

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4751
Credit: 17675372781
RAC: 5783714

I was thinking of the SETI

I was thinking of the SETI Rescheduler when you mentioned your "worker bee" scenario.  But that was just for moving cpu tasks to the gpu if I remember correctly.

Not for moving one type of campaign tasks to another campaign.

 

magic_sam
magic_sam
Joined: 30 Dec 21
Posts: 23
Credit: 350013187
RAC: 1277735

Hi all, What's the status

Hi all,

What's the status regarding that FGRPB1G work shortage ?

I just picked up 4 new tasks, even though the "tasks to send" counter remains at 0:

https://einsteinathome.org/server_status.php

Cheers, Sam

Link
Link
Joined: 15 Mar 20
Posts: 97
Credit: 605105
RAC: 528

Gary Roberts wrote:This

Gary Roberts wrote:
This doesn't make sense to me.  How could making a 0.15 (or more) days addition to the cache size cause a 1 hour limit to scheduler requests?  And how could disabling network access override this limiting action?  And what is the relevance of "after finishing a WU"??

1) 0.15 days = 3.6 hours. Means if no WU finishes before, BOINC will not contact the scheduler to request more work within the next 3.6 hours (minimum), which is nearly your 4 hours (0.17 days would be better if someone really wants at least 4 hours). So without any WUs beeing completed, BOINC would be able to request work every 4 hours (or whatever up to 10 days) without using any scripts. That's what the additional days are for, to limit scheduler requests.

2) Newer versions of BOINC however report completed work automatically latest 1 hour after uploading the result file, they won't wait any longer unless you disable network access. So with network access and GPUs completing a WU every few minutes, BOINC will contact the scheduler about every hour no matter if it needs more work or not (and the addtional 0.15 days make sure it won't need any work before that automatic report).

 

Gary Roberts wrote:
My aim is to minimise my impact on the scheduler if possible.  When work completes, uploading of results is handled by an upload server, not the scheduler.  In my experience, if no work fetch is needed when a task completes, BOINC seems to accumulate around 6 to 10 of them before reporting is attempted. In many cases this is quite a deal longer than 1 hour and I'm happy with that.

Yes, it simply waits 1 hour after upload of the first result. So it's 1 hour + the time between last scheduler request and first result upload.

 

Gary Roberts wrote:
My main concern is with work fetch which has a much greater impact on the scheduler.  Ask yourself this question.  Which has the lower impact, a single request for 30 new tasks followed by 3 hours of silence  -OR-  30 individual requests for single tasks each time, spread over the same total time period?

With additional (for example) 0.15 days it would be just 2-3 requests for 10+ tasks in those 3 hours, piggyback on those automatic requests for reporting completed tasks. I think the scheduler will be able handle it. ;-) 30 individual requests would be only in case of 0.00 additional days in the settings, that's indeed not optimal, even without that many hosts as you have.

 

Gary Roberts wrote:

I share my internet access with an operating commercial business which relies on quite heavy internet access.  These days speeds are much better and costs are cheaper so it's not as important now to be as frugal as possible with my use as it was when I first started.  They pay the bills so I try not to abuse the privilege. (...)

By continuing to use the script every 3 hours, I still get the benefit of keeping the ancilliary files up to date (no unnecessary downloads for resends) and I get even more frequent checks that all hosts are running as they should.

That's of course something different and monitoring your hosts with that script might be indeed necessary with that amount of hosts, but your micromanagement of scheduler requests and downloads still seems a bit more complicated than necessary in 2022. Nevermind.

.

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4751
Credit: 17675372781
RAC: 5783714

magic_sam wrote: Hi

magic_sam wrote:

Hi all,

What's the status regarding that FGRPB1G work shortage ?

I just picked up 4 new tasks, even though the "tasks to send" counter remains at 0:

https://einsteinathome.org/server_status.php

Cheers, Sam

Work is being created so slowly that the RTS buffer never has chance to fill because any produced task is downloaded immediately by all the starving hosts.

 

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3709
Credit: 34635586729
RAC: 42675484

Keith Myers wrote: I was

Keith Myers wrote:

I was thinking of the SETI Rescheduler when you mentioned your "worker bee" scenario.  But that was just for moving cpu tasks to the gpu if I remember correctly.

Not for moving one type of campaign tasks to another campaign.

No I was thinking of Ville's custom efforts around the time that SETI closed down. he was playing around with what's possible and specifically stated that he successfully moved tasks from one system to another for reporting and that it worked. I can't link his post because it's in the private team forum.

Ville Saari wrote:

I made a scientific experiment. I had earlier moved one task from one host to another and tested what happens when it is returned by a different computer than what received it. The computer id on the task changed on the web site and the task seemed to receive credit normally. But I could only see that the task got validated and received credit but I couldn't be sure the credit of either computer actually got updated.

So now I moved enough tasks to another computer so that it could crunch those exclusively for the full 30 minute scheduler request period. Then I checked the credit of that computer on the web site couple of minutes after it had returned those results and the host credits in the local client_state.xml that was last updated in the same scheduler request.

The web site credit minus client_state.xml credit was exactly the same as the sum of all the immediately validated task credits from that batch of returned tasks on the web site. So this confirmed that when tasks downloaded by host A are returned by host B, host B really receives credit for those tasks.

this was on SETI, Apr 17, 2020.

 

There is also a user on Primegrid doing something like this, where many hosts on the backend are presented to the project as a single host.

http://www.primegrid.com/hosts_user.php?userid=914937

he runs all his backend systems through a "scheduling proxy" so to speak, similar to the SuperHost proposed setup that was never implemented: https://boinc.berkeley.edu/trac/wiki/SuperHost

 

there was also that post not too long ago where a guy had several Raspberri Pis here at Einstein, and they all acted independently, but on the website, they were all viewed as a single host for some reason. So this gives some evidence that something like this "should" be possible if I can figure out how it works.

_________________________________________________________________________

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3709
Credit: 34635586729
RAC: 42675484

magic_sam wrote:Hi

magic_sam wrote:

Hi all,

What's the status regarding that FGRPB1G work shortage ?

I just picked up 4 new tasks, even though the "tasks to send" counter remains at 0:

https://einsteinathome.org/server_status.php

Cheers, Sam

still the same situation.

the status page only updates every 5 mins or so, but the server is working to generate and send tasks continuously. the tasks come and go so quickly that the status updates every minute usually only catch it when it says 0. but occasionally you'll catch it saying 1, or 5, or 100. but most of the time 0.

_________________________________________________________________________

mikey
mikey
Joined: 22 Jan 05
Posts: 11944
Credit: 1832481918
RAC: 216643

Ian&Steve C. wrote: mikey

Ian&Steve C. wrote:

mikey wrote:

Ian&Steve C. wrote:


 

what I really need is a scheme of “worker bee” hosts to get work en masse, and shuffle them along to my big hosts for crunching and reporting. I “could” split my multi-GPU hosts to single GPU logical hosts with separate clients, but I really don’t like split stats. 

I thought each task had to be returned by the host that downloaded it was a security measure Boinc put in awhile back has that changed?

not necessarily. I’ve seen it happen a few cases. Both intentionally and unintentionally. 
 

it might be down to how some projects handle it. But I’ve seen some cases anecdotally where a task was manually moved (files moved, client_state recrafted, etc) and reported by a system that didn’t download it, and the reporting system received the credit. This was on SETI. I believe there’s a user in Primegrid doing something similar.   

Yes I too used to do 'sneaker net' computing when doing the original Seti but later on they said there were protections against it, it's interesting that PrimeGrid still allows it.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.