Locality scheduling, not working for me

Eugene Stemple

Joined: 9 Feb 11

Posts: 67

Credit: 377669308

RAC: 590487

26 Feb 2021 5:04:04 UTC

Topic 224918

(moderation:

)

At least it (locality scheduling) is not working the way I "think" it should be working with GW tasks. My understanding is that data files, of the h1_ and l1_ paired form, should hang around for a while after downloading with the hope/expectation that subsequent processing tasks will make use of them without the need to download them again. I was recently troubled by the "bloat" of data files in the E@h directory, >5000 files (20 GB) dating back to August 2020, and chose to do a "Project Reset" on Feb. 12. Yes, it got rid of all the stored data, and, yes, I was prepared for an interval of heavy data downloads to rebuild the inventory of data files from which "locality scheduling" might use to some advantage. The first 24 hours yielded 90 "resent lost tasks" and 9.2 GB of downloads. I was not totally surprised by that data volume but thought it would be interesting to see how quickly the daily download volume would taper off. And so I have been monitoring the downloads over the 14 days since the reset. The download volume is NOT DECLINING. In a 12-hour span yesterday E@h downloaded 586 data files (pairs of h1 and l1) for a total of 4.6 GB, i.e. 9.2 GB daily rate. How many of those data files are being retained? Practically none of them. As I examine the E@h directory just now - there are just 74 h1/l1 pairs of data files. And the oldest of them was downloaded Feb. 24 at 23:31 - not even one day ago!

The troublesome pattern I have been seeing for these two weeks is this: a block of data files, for example h1_0577.30 through h1_0577.95, is downloaded; within 2 hours (usually less) it is "tagged" with "BOINC will delete..."; (not really deleted, of course) but within 48 hours the entire block has been deleted. It is hard to log exactly when the actual delete takes place as I can't find anything in the event log options to record that information. But when files don't appear in the directory listing I think it is safe to assume they've been (really) deleted.

As if that weren't bad enough, I am tracking numerous occasions in which data files are being downloaded, then deleted, and then downloaded again.

Here are a couple of examples. All times/dates are local host times, UTC-7. The "delete" times are tagged with ### simply to indicate that the relevant files were gone at the time indicated but were presumably deleted much earlier.

data block 0907.35 - 0909.20 downloaded 2/23 02:00

tagged "BOINC will delete..." 2/23 03:04

really gone 2/23 16:41 ###

data block 0907.35 - 0907.50 downloaded 2/23 22:35 * a partial overlap with original block *

really gone 2/25 20:10 ###

------------------------------------------

data block 0901.85 - 0902.00 downloaded 2/23 03:04

data block 0902.05 - 0903.00 downloaded 2/23 03:05

tagged "BOINC will delete..." 2/23 04:16 *both of the above*

really gone 2/23 16:42 ###

data block 0901.30 - 0902.70 downloaded 2/24 06:15 * an overlap with the original blocks *

really gone 2/25 20:10 ###

------------------------------------------

I am looking at, and comparing, the FULL file names, eg. h1_0944.50_02C02Cl5In0.9Izv , although in the above examples I'm just abbreviating in a format that I think everybody will understand.

I really hope someone can offer insight into anything I can do (or should have done...) to make locality scheduling more efficient in the use of download bandwidth. I'm running a GTX 1060 which completes a task in roughly 12 minutes - so about 120 tasks per day. In very "round numbers" this translates to 75 MB of download resource per task completed. That is not a sustainable data rate for my ISP connection.

(I am now in NNT mode, buffer empty, waiting for some guidance.)

Keith Myers

Joined: 11 Feb 11

Posts: 4968

Credit: 18759042113

RAC: 7159137

Forget everything you knew

26 Feb 2021 8:43:58 UTC

Message 183743

(moderation:

)

Forget everything you knew about GW tasks. These new S3 tasks are completely different.

The data files do not hang around. They are deleted as soon as you finished the work units that needed them.

That could be as soon as a single task or a few more. They do not hang around for months.

Read through this thread. https://einsteinathome.org/content/do-some-tasks-have-much-larger-download-size-others

So the end result is a constant daily renewal of data files. You need to be aware of this issue if you are on a limited download plan with your ISP.

I stopped S3 GW work as soon as I figured this out. GR work is not so demanding though.

Raistmer*

Joined: 20 Feb 05

Posts: 208

Credit: 181428947

RAC: 6029

Well, if exact same file

27 Feb 2021 12:37:10 UTC

Message 183771

(moderation:

)

Well, if exact same file (would be good if topic starter of someone else would check not only full names but MD5 of files to be absolutely sure it's the same data chunk) re-downloaded it would mean locality scheduling FAILS to work. Most probably, due to misconfiguration on server side.

Maybe it was done by project deliberately: it's possible that total amount of "support" data for tasks so huge that keeping that support data on host local drive would prevent new tasks to be processed.

So OP could check if current BOINC directory size near to disk limit? If yes, one could observe same effect as cache thrashing when too big data set causes constant cache line evictions - and cache cant provide any benefits at all.

What if BOINC will be allowed to store much more than currently allowed on HDD?

Raistmer*

Joined: 20 Feb 05

Posts: 208

Credit: 181428947

RAC: 6029

And another suggestion: if

27 Feb 2021 12:42:00 UTC

Message 183772

(moderation:

)

And another suggestion: if BOINC disk usage far from limit try to increase task queue size.

This will increase chances that tasks in long queue happen to share same support data set so that data set will serve for both of them w/o deletion and re-downloading. Hence locality scheduling at least partially will work again.

Eugene Stemple

Joined: 9 Feb 11

Posts: 67

Credit: 377669308

RAC: 590487

@Raistmer BOINC computing

28 Feb 2021 6:12:51 UTC

Message 183790

(moderation:

)

@Raistmer

BOINC computing preferences: "use no more than 30 GB"

E@H computing preferences " Disk: use no more than 20 GB"

Buffering (cache) limits are : 0.3 days plus 0.1 additional day.

Host disk usage was bouncing around 3 to 4 GB. It is now 757 MB after all tasks previously downloaded have finished. I didn't think to do an MD5 on the data files. It would be a good way to verify that second download is, in fact, a duplicate. I "assumed" that an identical file name, and size, would be sufficient indication of duplication. I'll keep that in mind if I enable GW tasks in the future. Based on Keith M comments (previous post in the thread) I think I will drop GW tasks from my active applications, then monitor these forum posts for some indication that GW tasks can be done with a less burdensome data download requirement. I am trying to imagine what the SERVER network loading must be if there are 10,000 active hosts, each one sucking up 8 GB per day! Maybe most of them are smarter than I am and dropped out of the GW workunits long ago.

With regard to increasing the queue size... I am afraid to try it. On one hand, the larger data set (as you suggest) should increase the probability of a task "hit" on existing data; on the other hand it might just mean downloading even more data files - and discarding them as is being done now.

Keith Myers

Joined: 11 Feb 11

Posts: 4968

Credit: 18759042113

RAC: 7159137

Yes, might be wise. I was

28 Feb 2021 7:37:37 UTC

Message 183791

(moderation:

)

Yes, might be wise. I was very alarmed at the amount of recurring data files being downloaded for just 10 tasks.

I use a custom client to hard limit the amount of tasks in the cache at any time irrespective of any cache limit.

I just could not tolerate the number of one-shot data files being downloaded for every new task and immediately deleted after the task finished.

I have only 1 TB per month of data downloads per my contract and the Einstein GW work was eating up 25% of that in only one week of allowing GW S3 work.

I have 3 other projects besides Einstein that need to download work also.

Raistmer*

Joined: 20 Feb 05

Posts: 208

Credit: 181428947

RAC: 6029

Yep, being on limited plan

28 Feb 2021 15:09:41 UTC

Message 183800

(moderation:

)

Yep, being on limited plan it's understandable.

I have limitless one so will attempt to clarify this issue little more in time. Still learning structure of data for this project.

Currently have another problems with GW data (4 at once don't fit in RAM and app_config caused enormous amount of tasks download, w/o respect to any client limits).

Raistmer*

Joined: 20 Feb 05

Posts: 208

Credit: 181428947

RAC: 6029

Smth definitely not right in

28 Feb 2021 18:30:21 UTC

Message 183805

(moderation:

)

Smth definitely not right in this area.

I got my first "BOINC will delete" lines in log.... when host ASKED for another task, but NOT FINISHED prev one yet (!!!).

So, if those bunch of files not needed for currently in computing task why it was downloaded at all ???

And it's quite "clean" experiment - there were no GW tasks for few days before at all and only single GW task before host asked for another one....

Keith Myers

Joined: 11 Feb 11

Posts: 4968

Credit: 18759042113

RAC: 7159137

That's the same kind of

28 Feb 2021 18:35:37 UTC

Message 183807

(moderation:

)

That's the same kind of idiocy I was seeing with the new GW work. Huh??

Raistmer*

Joined: 20 Feb 05

Posts: 208

Credit: 181428947

RAC: 6029

Hm... some "funny" question

28 Feb 2021 18:42:17 UTC

Message 183808

(moderation:

)

Hm... some "funny" question came in my mind looking on all of this: did anyone check if E@h GW app process really access all those downloaded files during task processing?

Maybe part of them downloaded then deleted w/o single access to them through task execution?...

Keith Myers

Joined: 11 Feb 11

Posts: 4968

Credit: 18759042113

RAC: 7159137

Never got that deep into that

28 Feb 2021 18:46:02 UTC

Message 183810

(moderation:

)

Never got that deep into that kind of investigation. You would have to look for the recently downloaded filename and then search through all the slots and try and find it being used on a GW task.

Locality scheduling, not working for me

Forums › Cruncher's Corner

Comment viewing options

Forums › Cruncher's Corner