Locality scheduling, not working for me

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4352

Credit: 253935526

RAC: 34751

There certainly is an old

1 Mar 2021 14:49:00 UTC

Message 183852

(moderation:

)

There certainly is an old post somewhere that explains the "locality scheduling" of the GW tasks in general. However, this must be older than 12y or so. With more time I may dig it out. In general:

The name of a workunit (example: "h1_0960.35_O2C02Cl5In0__O2MD1S3_Spotlight_961.25Hz_1485") consists of four parts:

1. A data file basename. This is the part before the double underscore ("__"), here "h1_0960.35_O2C02Cl5In0".

2. A "run label". In the example this is "O2MD1S3". This stands for Data from Ligo's Observation run #2, Multi-Directional search (as opposed to "AS" for All-Sky), "S" for "Spotlight" (targeting a group of adjacent skypoints, not only one) and frequency range #3.

3. A "comment" - this is irrelevant to the workunit generator and scheduler, here it contains "Spotlight" again and the base analysis frequency.

4. A "job number".

The "data file basename" identifies actually a _group_ of data files that are necessary for the analysis. The "basename" is the name of the data file which is lexicographically smallest in that group.

Data file names do indeed contain a frequency. Each data file covers a range of 50mHz of frequency-domain observational data. The range of data files needed for an analysis of a given frequency depends on a few parameters, most (but not all) of which are visible on the command-line. There is, of course, the analysis frequency, the max. "spindown" value (abs(f1dot)+f1dotBand values on the command-line) and Tspan (the overall observation time covered by the data, basically the difference between min and max values in the segmentList file mentioned on the command-line). Rest assured that every file you downloaded for a workunit is actually necessary to perform this particular analysis.

There are some hundred millions or even billions of "templates" (combination of parameters) to be searched / calculated for the same set of data files. Thus these are split into several (usually thousands of) individual workunits. These use the same set of data files, hence have the same "data file basename", but differ in the job number (and possibly in the comment). The job number uniquely identifies a workunit of such a group of workunits that share the same set of data files.

The scheduler tries to minimize the download volume for you (i.e. your client). This means it tries to assign to you tasks that need files that you already have, ideally such with the same data file basename which you already had, or, if possible, at least adjacent ones.

Data files are "sticky" (BOINC term), this is they stay on your host until the client gets sent a "delete" request from the project (or the project is reset).

Usually the workunit generator generates the workunits in ascending order of frequency. When it has generated all work for a set of data files, i concludes that the first = lexicographically smallest = smallest frequency file can be deleted (there should no work be able to be generated that still requires this file). This might happen already with the same scheduler contact with which the client receives the last task of this workunit. The client will not delete the file as long as it has a workunit in the cache that needs it.

Now, it does happen occasionally that this scheme doesn't work completely. First of all at the end of a run when there are only few tasks left to be sent, it's difficult for the scheduler to find tasks that match the files you have. It will assign tasks from anywhere in the frequency range, including such that need files that you already had and that were deleted.

We also did a couple of manual manipulations to suit our needs and requirements that may have created more such ("few tasks") situations than necessary. I'll try to think a bit harder on what happened and how to avoid this.

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4352

Credit: 253935526

RAC: 34751

Actually - there is at leas

1 Mar 2021 14:59:00 UTC

Message 183853

(moderation:

)

Actually - there is at least one serious error in the system's configuration that is related to locality scheduling.

While I fix it - thanks for starting this thread to point me to it.

Raistmer*

Joined: 20 Feb 05

Posts: 208

Credit: 182261360

RAC: 10241

Thank you, Bernd, for

1 Mar 2021 15:42:09 UTC

Message 183855 in response to message 183853

(moderation:

)

Thank you, Bernd, for detailed file name parcing, that's very appreciated!

Regarding locality scheduling, yep - that's how we expect it to work but as you already found currently it doesn't do very nice. And considering how much data required for each task this could drain bandwidth a lot causing many participants on limited internet plan classes to restrain from main project objective - gravitational waves search.

MarkJ

Joined: 28 Feb 08

Posts: 437

Credit: 139002861

RAC: 0

Its looking a lot better now.

1 Mar 2021 21:29:12 UTC

Message 183862

(moderation:

)

Its looking a lot better now. Instead of changing frequency on every scheduler request it seems to be getting additional work for the frequency (data files) it has on board, like its supposed to. Thank you Bernd.

BOINC blog

Eugene Stemple

Joined: 9 Feb 11

Posts: 67

Credit: 415972605

RAC: 465402

@Gary & Bernd I am into a

2 Mar 2021 4:43:33 UTC

Message 183869

(moderation:

)

@Gary & Bernd

I am into a new month for data usage limits. I think I'll wait a few more days, just to see other comments here that confirm a "nicer" locality scheduling, before dipping my toes back into the fire-hose of download data. An initial torrent is expected since (practically) nothing has been retained. There is a setting in Preferences -> Computing -> Advanced -> Network Usage that says "Limit usage to ____ Mbytes every ___ days." It may be prudent to put some reasonable numbers in there and see if that feature works. Even if a side effect might be to restrain the number of tasks that can be downloaded - until the data set gets built back up.

Raistmer*

Joined: 20 Feb 05

Posts: 208

Credit: 182261360

RAC: 10241

Attempting to get GW work -

2 Mar 2021 18:10:58 UTC

Message 183881

(moderation:

)

Attempting to get GW work - no tasks still. They became so popular or smth broken?

Ian&Steve C.

Joined: 19 Jan 20

Posts: 4159

Credit: 50393051541

RAC: 41816840

A problem I had previously

2 Mar 2021 22:02:33 UTC

Message 183884

(moderation:

)

A problem I had previously with GW GPU locality scheduling, was that the server refused to send me tasks, because I didnt already have the required data files for the frequency range available on the server. the server showed many GW tasks available, and my GPUs were sitting idle and requesting work in the schedule request. but wouldnt give me anything.

It wasn't until I reset the project to clear out the existing old data files, that the server sent me data files for tasks that were available and it resumed sending me work.

a bit of a bug in the operation there.

_________________________________________________________________________

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4352

Credit: 253935526

RAC: 34751

Most likely you don't get

3 Mar 2021 12:53:00 UTC

Message 183899

(moderation:

)

Most likely you don't get work because the available / allowed diskspace is exhausted (see your scheduler log) on the client as a result on the servers's misconfiguration. A project reset will help, but I'm also trying to find a way to do this less invasively from the server side.

Ian&Steve C.

Joined: 19 Jan 20

Posts: 4159

Credit: 50393051541

RAC: 41816840

Bernd Machenschalk

3 Mar 2021 13:06:18 UTC

Message 183901 in response to message 183899

(moderation:

)

Bernd Machenschalk wrote:

Most likely you don't get work because the available / allowed diskspace is exhausted (see your scheduler log) on the client as a result on the servers's misconfiguration. A project reset will help, but I'm also trying to find a way to do this less invasively from the server side.

that wasnt the case for me, I allow BOINC to use up to 90% of my disk (250GB) and Einstein was only using a few GB. other projects weren't using much either. plenty available space.

but this wasnt recently, maybe a few months ago. I haven't run GW in a while.

_________________________________________________________________________

Eugene Stemple

Joined: 9 Feb 11

Posts: 67

Credit: 415972605

RAC: 465402

Good news on this issue, I

5 Mar 2021 5:32:33 UTC

Message 183950

(moderation:

)

Good news on this issue, I think. I resumed work fetch 32 hours ago and have downloaded more than 200 GW tasks and 2.45 GB of data files. Not a single "BOINC will delete..." event in the log file. And also a good sign that I am frequently seeing, for example, "got 13 new tasks" without any additional data downloads. So, presumably, the new tasks are using the existing data set, i.e. locality scheduling, as they are supposed to.

Before resuming, from a NNT state, I set my E@H project preferences to "download no more than 500 MB per 1 day." I have exceeded that limit but I'm willing to give the servers more time - maybe the user specified limit is applied as an average over "N" days. And, of course, my UTC-7 "day" boundaries are different than the server's day boundaries. Watching this data usage statistic pretty closely and ready to hit the NNT button to throttle back if necessary. Anyway, 2.4 GB per 32 hours is much better than the 9 GB per day I was seeing before.

Locality scheduling, not working for me

Forums › Cruncher's Corner

Comment viewing options

Forums › Cruncher's Corner