Locality scheduling, not working for me

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4,027
Credit: 216,091,153
RAC: 87,084

There certainly is an old

There certainly is an old post somewhere that explains the "locality scheduling" of the GW tasks in general. However, this must be older than 12y or so. With more time I may dig it out. In general:

The name of a workunit (example: "h1_0960.35_O2C02Cl5In0__O2MD1S3_Spotlight_961.25Hz_1485") consists of four parts:

1. A data file basename. This is the part before the double underscore ("__"), here "h1_0960.35_O2C02Cl5In0".

2. A "run label". In the example this is "O2MD1S3". This stands for Data from Ligo's Observation run #2, Multi-Directional search (as opposed to "AS" for All-Sky), "S" for "Spotlight" (targeting a group of adjacent skypoints, not only one) and frequency range #3.

3. A "comment" - this is irrelevant to the workunit generator and scheduler, here it contains "Spotlight" again and the base analysis frequency.

4. A "job number".

The "data file basename" identifies actually a _group_ of data files that are necessary for the analysis. The "basename" is the name of the data file which is lexicographically smallest in that group.

Data file names do indeed contain a frequency. Each data file covers a range of 50mHz of frequency-domain observational data. The range of data files needed for an analysis of a given frequency depends on a few parameters, most (but not all) of which are visible on the command-line. There is, of course, the analysis frequency, the max. "spindown" value (abs(f1dot)+f1dotBand values on the command-line) and Tspan (the overall observation time covered by the data, basically the difference between min and max values in the segmentList file mentioned on the command-line). Rest assured that every file you downloaded for a workunit is actually necessary to perform this particular analysis.

There are some hundred millions or even billions of "templates" (combination of parameters) to be searched / calculated for the same set of data files. Thus these are split into several (usually thousands of) individual workunits. These use the same set of data files, hence have the same "data file basename", but differ in the job number (and possibly in the comment). The job number uniquely identifies a workunit of such a group of workunits that share the same set of data files.

The scheduler tries to minimize the download volume for you (i.e. your client). This means it tries to assign to you tasks that need files that you already have, ideally such with the same data file basename which you already had, or, if possible, at least adjacent ones.

Data files are "sticky" (BOINC term), this is they stay on your host until the client gets sent a "delete" request from the project (or the project is reset).

Usually the workunit generator generates the workunits in ascending order of frequency. When it has generated all work for a set of data files, i concludes that the first = lexicographically smallest = smallest frequency file can be deleted (there should no work be able to be generated that still requires this file). This might happen already with the same scheduler contact with which the client receives the last task of this workunit. The client will not delete the file as long as it has a workunit in the cache that needs it.

Now, it does happen occasionally that this scheme doesn't work completely. First of all at the end of a run when there are only few tasks left to be sent, it's difficult for the scheduler to find tasks that match the files you have. It will assign tasks from anywhere in the frequency range, including such that need files that you already had and that were deleted.

We also did a couple of manual manipulations to suit our needs and requirements that may have created more such ("few tasks") situations than necessary. I'll try to think a bit harder on what happened and how to avoid this.

BM

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4,027
Credit: 216,091,153
RAC: 87,084

Actually - there is at leas

Actually - there is at least one serious error in the system's configuration that is related to locality scheduling.

While I fix it - thanks for starting this thread to point me to it.

BM

Raistmer*
Raistmer*
Joined: 20 Feb 05
Posts: 198
Credit: 57,222,966
RAC: 118,001

Thank you, Bernd, for

Thank you, Bernd, for detailed file name parcing, that's very appreciated!

Regarding locality scheduling, yep - that's how we expect it to work but as you already found currently it doesn't do very nice. And considering how much data required for each task this could drain bandwidth a lot causing many participants on limited internet plan classes to restrain from main project objective  - gravitational waves search.

MarkJ
MarkJ
Joined: 28 Feb 08
Posts: 436
Credit: 96,539,496
RAC: 43,397

Its looking a lot better now.

Its looking a lot better now. Instead of changing frequency on every scheduler request it seems to be getting additional work for the frequency (data files) it has on board, like its supposed to. Thank you Bernd.

Eugene Stemple
Eugene Stemple
Joined: 9 Feb 11
Posts: 36
Credit: 94,936,163
RAC: 38,163

@Gary & Bernd I am into a

@Gary & Bernd

I am into a new month for data usage limits.  I think I'll wait a few more days, just to see other comments here that confirm a "nicer" locality scheduling, before dipping my toes back into the fire-hose of download data.  An initial torrent is expected since (practically) nothing has been retained.  There is a setting in Preferences -> Computing -> Advanced -> Network Usage that says "Limit usage to ____ Mbytes every ___ days."  It may be prudent to put some reasonable numbers in there and see if that feature works.  Even if a side effect might be to restrain the number of tasks that can be downloaded - until the data set gets built back up.

 

Raistmer*
Raistmer*
Joined: 20 Feb 05
Posts: 198
Credit: 57,222,966
RAC: 118,001

Attempting to get GW work -

Attempting to get GW work - no tasks still. They became so popular or smth broken?

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 887
Credit: 5,643,604,388
RAC: 32,137,756

A problem I had previously

A problem I had previously with GW GPU locality scheduling, was that the server refused to send me tasks, because I didnt already have the required data files for the frequency range available on the server. the server showed many GW tasks available, and my GPUs were sitting idle and requesting work in the schedule request. but wouldnt give me anything.

It wasn't until I reset the project to clear out the existing old data files, that the server sent me data files for tasks that were available and it resumed sending me work.

 

a bit of a bug in the operation there.

_____________________________________________

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4,027
Credit: 216,091,153
RAC: 87,084

Most likely you don't get

Most likely you don't get work because the available / allowed diskspace is exhausted (see your scheduler log) on the client as a result on the servers's misconfiguration. A project reset will help, but I'm also trying to find a way to do this less invasively from the server side.

BM

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 887
Credit: 5,643,604,388
RAC: 32,137,756

Bernd Machenschalk

Bernd Machenschalk wrote:

Most likely you don't get work because the available / allowed diskspace is exhausted (see your scheduler log) on the client as a result on the servers's misconfiguration. A project reset will help, but I'm also trying to find a way to do this less invasively from the server side.

that wasnt the case for me, I allow BOINC to use up to 90% of my disk (250GB) and Einstein was only using a few GB. other projects weren't using much either. plenty available space.

 

but this wasnt recently, maybe a few months ago. I haven't run GW in a while. 

_____________________________________________

Eugene Stemple
Eugene Stemple
Joined: 9 Feb 11
Posts: 36
Credit: 94,936,163
RAC: 38,163

Good news on this issue, I

Good news on this issue, I think.  I resumed work fetch 32 hours ago and have downloaded more than 200 GW tasks and 2.45 GB of data files.  Not a single "BOINC will delete..." event in the log file.  And also a good sign that I am frequently seeing, for example, "got 13 new tasks" without any additional data downloads.  So, presumably, the new tasks are using the existing data set, i.e. locality scheduling, as they are supposed to.

Before resuming, from a NNT state, I set my E@H project preferences to "download no more than 500 MB per 1 day."  I have exceeded that limit but I'm willing to give the servers more time - maybe the user specified limit is applied as an average over "N" days.  And, of course, my UTC-7 "day" boundaries are different than the server's day boundaries.  Watching this data usage statistic pretty closely and ready to hit the NNT button to throttle back if necessary.  Anyway, 2.4 GB per 32 hours is much better than the 9 GB per day I was seeing before.

 

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.