Locality scheduling, not working for me

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5872

Credit: 117730522254

RAC: 34964759

Eugene Stemple wrote:I

5 Mar 2021 9:37:55 UTC

Message 183954 in response to message 183950

(moderation:

)

Eugene Stemple wrote:

I resumed work fetch 32 hours ago and have downloaded more than 200 GW tasks and 2.45 GB of data files.

I have a group of hosts on which I had been running the 'S2' variant of the search at a multiplicity of 4 (x4), several weeks ago. There was the sudden transition to S3 (and soon afterwards to S3a) and there were huge downloads (and file deletes) for each work fetch. There was no way I could support that volume of data so I changed all the GW hosts immediately to the GRP search, where they have stayed ever since.

As soon as Bernd announced that he'd found "... one serious error in the system's configuration ..." I allowed the remaining GRP to finish on one host and then changed back to GW so I could test if the problem had been resolved for me. When I was running the S2 search, the frequency term in the task name was around 470Hz and there had been no problem running x4 on an 8GB RX 570.

With the new S3a tasks, the two frequencies in a task name were 658.95Hz and 659.70, so the DF value was 0.75. The first task received had an issue number of 861, so not the top of the range by any means. I have continued to receive this same series of tasks and the issue number is now down close to 300. The DF is now down to 0.4. I seem to be getting pretty much a complete consecutive series of numbers with no gaps.

As I write this, the website tells me I have 548 GW tasks in total with 152 pending, 50 validated and zero errors or invalids. I checked the data files that were issued at the start of that series. There were 64 files (each roughly 4MB) totaling around 0.25GB. The scheduler continues to send tasks for this one set of data.

There are a couple of reasons for you to get that much larger volume of downloads that you mention. Either the scheduler gave you lots of resends for a variety of different frequency ranges, or perhaps your initial requests were for a large enough batch of tasks that the scheduler was forced to source these from around ten or so different frequency ranges. That's just a guess based on my single frequency range only needing 0.25GB of data.

The biggest trick you can employ right at the very beginning of a new work fetch when first starting up, is to set the cache size to 0.01days so that you ask for the smallest possible number of tasks. Once you get that initial single batch of data files for the one frequency range, the scheduler will honour it for all the subsequent work requests, as long as you don't ask for too many at a time. I've always found asking for batches of 5 to 10 new tasks at most, doesn't trigger the scheduler to switch to different frequency ranges.

If you take care in slowly working up to your desired work cache size (mine is 1.6 days now) you too should be able to keep the data downloads well under control like I have. I don't see any need for daily limits.

I anticipate being able to get pretty much the full series of these tasks right down to the zero issue number and that should allow me to check for numbers of tasks and any crunch time variations for all the DF ranges from the initial 0.75 down to whatever the low limit turns out to be. I have tested all multiplicities from x1 to x4 on DF 0.75 tasks. All work without error but for my GPU, the 658.95Hz frequency, and the 0.75 DF, the optimum multiplicity was x3. This should continue to be safe for a while anyway :-).

Cheers,
Gary.

Raistmer*

Joined: 20 Feb 05

Posts: 208

Credit: 181428947

RAC: 6029

Gary can you log memory

5 Mar 2021 18:42:02 UTC

Message 183969

(moderation:

)

Gary can you log memory amount needed for different DF task? Is it roughly the same or noticeable decrease/increase?

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5872

Credit: 117730522254

RAC: 34964759

All my hosts run Linux (not a

6 Mar 2021 2:15:00 UTC

Message 183987 in response to message 183969

(moderation:

)

All my hosts run Linux (not a 'major' distro) and I've never taken the time to see if there are any utilities around that could give that sort of information. I haven't seen anything in my distro's repo that would be suitable.

I run lots of hosts and can usually quickly work out suitable conditions where they crunch efficiently without issue. I don't really have time (or a great deal of interest) in tweaking things to be close to the edge, so I tend to be conservative in choosing operating conditions. Long term stable running is my goal.

I put most effort into writing monitoring scripts that loop continuously to detect potential issues. This is working surprisingly well so even with the large number of hosts, failures are relatively infrequent and I get notified very quickly when something does happen.

Many of the hosts have long uptimes (eg. 250+ days) and the main limit to that is when it's thunderstorm season (like now) or when the electricity authority causes a power glitch, usually around midnight to 1am local time and usually only on one phase of the 3-phase system :-). The glitches are quite short so mostly only the older (or lower quality) PSUs that don't have enough hold-up time to survive the glitch, are affected.

Archae86 did a very nice study of the affect of different DF values on memory use at different multiplicities for one of his systems. That was last year and we are doing a different search (and currently different frequency ranges) but I would expect there to be similar behaviour (higher memory use at higher DF and higher multiplicity) to what he reported.

Cheers,
Gary.

Raistmer*

Joined: 20 Feb 05

Posts: 208

Credit: 181428947

RAC: 6029

I'll look at it, thanks.

6 Mar 2021 21:38:02 UTC

Message 184003

(moderation:

)

I'll look at it, thanks.

Eugene Stemple

Joined: 9 Feb 11

Posts: 67

Credit: 377736799

RAC: 593312

...reviving this thread just

6 Jul 2021 0:09:12 UTC

Message 187007

(moderation:

)

...reviving this thread just to note a recent discouraging development. First, the -good news- : I do not see any cases where data chunks are redundantly reloaded; now the -bad news- : my host is downloading big chunks of data, anywhere from 16 to 34 pairs of h1_ and l1_ files (i.e. 125 MB to 265 MB), running ONE TASK using some subset of the data and then deleting the files. It came to my attention when I exceeded my data usage quota for the month of June. There had been no excess usage in March, April, and May.

E@home is keeping practically nothing in the local disk cache (presently only 2.6 GB) whereas it used to hang on to 15 to 20 GB. Not building up, or retaining, enough for locality scheduling to have anything to work with.

As best as I can estimate, the servers began acting up around June 7. Was there a transition around that time to sending "S3b" (note: B!) CPU tasks? They used to be "S3a" tasks. My task list does not go back far enough - there is a "S3" task on June 5 (no data before that) and then everything is "S3b" since then.

I've enabled a BOINC network usage limit of 800 MB/day and I'll let my host do the best it can for E@home with that restriction.

Gene.

Mike Hewson

Moderator

Joined: 1 Dec 05

Posts: 6590

Credit: 318830268

RAC: 410737

It looks like the O2 data set

6 Jul 2021 2:29:00 UTC

Message 187008 in response to message 187007

(moderation:

)

It looks like the O2 data set analysis is winding down and this usually means that there are 'odds & ends' to clean up within the search database. Hence it is much more likely that unrelated tasks may be offered, and hence not able to be efficiently dealt with by locality scheduling, alas. If the downloads are an issue for you, then maybe you could simply uncheck/deselect the O2 applications in /prefs/project page for your account.

Cheers, Mike.

( edit ) ... not forgetting to Save Changes on the account/prefs/project page, and then trigger an 'Update' via BOINC Manager, say, in order to transmit this new preference change to your machine.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

Harri Liljeroos

Joined: 10 Dec 05

Posts: 4360

Credit: 3216593852

RAC: 2049743

There are download limits

6 Jul 2021 8:37:44 UTC

Message 187013

(moderation:

)

There are download limits available on the preferences/computing page if you wish to experiment with them. Setting a limit might leave your computer without work sometimes though.

Eugene Stemple

Joined: 9 Feb 11

Posts: 67

Credit: 377736799

RAC: 593312

@Mike... Yes, regarding O2

7 Jul 2021 5:35:14 UTC

Message 187034

(moderation:

)

@Mike... Yes, regarding O2 winding down, I had some suspicion that might be contributing to the distribution of tasks and data blocks. Checking the server status just now it shows ~1300 O2 tasks to send; and the O3 tasks to send are much higher than I've seen them recently. I don't feel too bad about the CPU effort going into the GW search. They are long-running tasks, of course, and RAM considerations limit me to 2 concurrent tasks. But I assume they (CPU results) serve some useful purpose as a validation cross-check with GPU results. Somebody's got to do it!

@Harri... Two days ago I set a 800 MB/day limit. I'm hitting that limit about 8 to 10 hours into each new day (from midnight) but getting enough work (barely) to last through the day. Not exactly the mix of GW and GRP work that would be ideal, but just letting BOINC and E@home manage the restriction until, as Mike suggests, the run of O2 work is completed and the project moves on to O3.

--Off Topic-- I see O3AS work generator active and O3ASE work generator no longer listed. I infer that O3ASE has been completed. I do have an O3ASE_1.00 (GPU) executable in the E@home directory; so I suppose an O3AS executable will auto download in due time. I'll monitor the forum for messages/discussion on that transition.

EDIT: I see a new application in project preferences: GW O3 All Sky. I have checked/enabled that.

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4312

Credit: 250620969

RAC: 34586

There was indeed a wrong

7 Jul 2021 16:11:27 UTC

Message 187038

(moderation:

)

There was indeed a wrong configuration on the O2MD / GW CPU side. Thanks for the reports, fixed today, should be better now.

Locality scheduling, not working for me

Forums › Cruncher's Corner

Comment viewing options

Forums › Cruncher's Corner