Discussion Thread for the Continuous GW Search known as O2MD1 (now O2MDF - GPUs only)

Keith Myers

Joined: 11 Feb 11

Posts: 4968

Credit: 18759625494

RAC: 7160230

Zalster wrote:Haven't seen

5 Oct 2019 23:22:19 UTC

Message 173708 in response to message 173704

(moderation:

)

Zalster wrote:

Haven't seen any O2MD1 yet, all I keep getting are the O2AS20-500.

I am not getting any O2MD1 cpu work either. See I am in similar company as you.

Zalster

Joined: 26 Nov 13

Posts: 3117

Credit: 4050672230

RAC: 0

Think I will have to play

6 Oct 2019 5:59:49 UTC

Message 173711

(moderation:

)

Think I will have to play with my computer and see why we aren't getting any of these. Richie is correct in that he is getting some of these

h1_0057.90_O2C02Cl1In0__O2MD1G_G34731_57.95Hz_0_0

very unusual.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5872

Credit: 117730368921

RAC: 34970139

Keith Myers wrote:Gary, I

6 Oct 2019 8:08:55 UTC

Message 173712 in response to message 173707

(moderation:

)

Keith Myers wrote:

Gary, I still have not seen any log resembling what I first saw, but I do a similar log from Zalster currently. He is not getting any O2MD1 cpu tasks either.

https://einsteinathome.org/host/12789230/log

OK, thanks for that. I just had a look at the latest log for that host and it's rather large and convoluted. Here are some key entries together with my (possibly quite erroneous) interpretation. The log starts with the successful return of 18 FGRPB1G tasks with a "SCHEDULER_REQUEST::parse(): unrecognized: <allow_multiple_clients>0</allow_multiple_clients>" thrown in for good measure. Looks like the Einstein scheduler doesn't understand that particular option.

That is followed by

2019-10-06 06:43:27.5173 [PID=6927 ] [send] CPU: req 0.00 sec, 0.00 instances; est delay 0.00
2019-10-06 06:43:27.5173 [PID=6927 ] [send] CUDA: req 7932.99 sec, 0.00 instances; est delay 0.00

which shows no request for CPU work but a request for GPU work. As an aside, there must be something at your end that is preventing your client from requesting CPU work. Then there is a (I've trimmed the timestamps to just show the actual messages) request for locality scheduling (ie GW GPU) work. There is always some sort of 'lottery' that goes on to check what gets sent first - locality or non-locality. Locality won this time. You can also see checks for 'old' work - I assume generated work in particular frequency bins that is getting 'stale' - not enough hosts requesting that particular bin so let's add this host to one of these bins if stale work exists. Of course, with a brand new search, no stale work will exist.

[mixed] sending locality work first (0.3650)
[locality] [HOST#12789230] ignoring file JPLEPH.405
[locality] [HOST#12789230] ignoring file earth00-19-DE405.dat
[locality] [HOST#12789230] ignoring file sun00-19-DE405.dat
[locality] [HOST#12789230] ignoring file 20180220_O2_excludebadepochs_min5sft_9m60h.seg
[locality] send_work_locality(): locality_appid_filter 'appid IN (52,53) AND'
[send] send_old_work() no feasible result older than 336.0 hours
[locality] send_work_locality(): sending work for new file(s)
[locality] send_new_file_work(): try to send old work
[send] send_old_work() no feasible result younger than 324.9 hours and older than 168.0 hours
[locality] send_new_file_work(0): try to send from working set

Then there are a zillion lines where it tries to find available tasks belonging to particular large data files. I'm not going to reproduce it all - I'll just list the filenames that were checked

[locality] send_new_file_working_set will try filename h1_0115.55_O2C02Cl1In0
[locality] send_new_file_working_set will try filename h1_0094.55_O2C02Cl1In0
[locality] send_new_file_working_set will try filename h1_0114.50_O2C02Cl1In0
[locality] send_new_file_working_set will try filename h1_0101.70_O2C02Cl1In0
[locality] send_new_file_working_set will try filename h1_0072.05_O2C02Cl1In0

Each of these searches fails. Looks like 5 separate frequency bins will be tried before the scheduler gives up. Here is the last fail invoking a potential order to the work generator (maybe it needs =1 rather then =0 to actually get work generated), together with what the scheduler then does.

[locality] make_more_work_for_file(h1_0072.05_O2C02Cl1In0, 1)=0
[locality] send_work_locality(): returning with work still needed
[mixed] sending non-locality work second

So the scheduler gives up the hunt for locality scheduling work and moves on to fill your request with non-locality work - ie. FGRPB1G.

At no point is there any complaint about the O2MD1 CPU app but it was obviously happy to try to fill a request for O2MD1 GPU work - which Bernd has disabled until sometime next week when he has worked out if any further mod to the GPU app is needed.

I've had several looks at the server status page over the last several hours. My memory may be faulty but I seem to recall the tasks in progress (1828) and number available to send (6292) as being at those same values at each previous inspection. Maybe none are being sent out anyway. That still doesn't explain why your host is requesting 0.00 secs of CPU work - a puzzle for you to solve :-).

Cheers,
Gary.

floyd

Joined: 12 Sep 11

Posts: 133

Credit: 186610495

RAC: 0

Zalster schrieb:Think I will

6 Oct 2019 8:15:14 UTC

Message 173713 in response to message 173711

(moderation:

)

Zalster wrote:

Think I will have to play with my computer and see why we aren't getting any of these.

I think you'll just have to wait. Looking at the logs, though I don't really understand them, I can see the scheduler changes behaviour depending on the age of the work available. The first step is at 168h/7d, then 205.1h (variable) and 336h/14d. I guess work will only be available for everybody once it's been sitting in the queue for 14 days. Before the age of 7 days very tight (and unknown) restrictions will apply. Personally I've only got a single task and it is a resend. I'm sure something will change next week when the first work will have been waiting for 7 days.

Quote:

very unusual

Not at all. It was the same with O2AS, and the same confusion. In the first days you can't get any work and you don't understand why. Then suddenly it starts coming when nothing seems to have changed. Just be patient and wait. Oh, and a (short) explanation by someone who knows the scheduler's details would be welcome, I'm sure this situation will come again.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5872

Credit: 117730368921

RAC: 34970139

floyd wrote:I guess work will

6 Oct 2019 9:44:31 UTC

Message 173714 in response to message 173713

(moderation:

)

floyd wrote:

I guess work will only be available for everybody once it's been sitting in the queue for 14 days. Before the age of 7 days very tight (and unknown) restrictions will apply.

No, quite the contrary. If an initial task for particular data has been sent out and a second task hasn't been sent (ie. the quorum is becoming stale and the original recipient might start complaining) then prioritise the sending of the second task to anyone asking for new work for the first time because they won't already have existing data files. I'm guessing the 7 day mark will target just 'new' hosts whereas by 14 days things are getting desperate so the very next request will be targeted (existing data or not) just to get the very stale quorums moving.

With a new search like O2MD1, there will be no stale quorums so people will keep getting work aligned to data they already have.

Cheers,
Gary.

floyd

Joined: 12 Sep 11

Posts: 133

Credit: 186610495

RAC: 0

Disclaimer: I'm guessing all

6 Oct 2019 11:49:23 UTC

Message 173715 in response to message 173714

(moderation:

)

Disclaimer: I'm guessing all this just from the scheduler's output. My guesses seem reasonable to me but of course I have no access to the actual code to verify them.

Gary Roberts wrote:

floyd wrote:
I guess work will only be available for everybody once it's been sitting in the queue for 14 days. Before the age of 7 days very tight (and unknown) restrictions will apply.
No, quite the contrary. If an initial task for particular data has been sent out and a second task hasn't been sent (ie. the quorum is becoming stale and the original recipient might start complaining) then prioritise the sending of the second task to anyone asking for new work for the first time because they won't already have existing data files.

To anyone without checking for existing files, and that's after 14 days. (14 days after what I wonder. After generating the WU? After sending the first task? Anything else? The log doesn't tell.)

During those 14 days the check is done but if that doesn't get you work there's a second stage where the term "working set" comes into play and with lower, flexible time limits. Again old work is sent without further check but for fresh work below 7 days another check for the presence of data is done. More restrictive or less restrictive than the previous one? Less restrictive would be reasonable. Anyhow, it seems to be skipped after 7 days (7 days after what?) so I expect getting work to become easier then, and possibly unrestricted after 14 days as outlined above.

Quote:

With a new search like O2MD1, there will be no stale quorums so people will keep getting work aligned to data they already have.

And initially nobody has any data or am I missing something? Where's that stupid hen hiding?

Matt White

Joined: 9 Jul 19

Posts: 120

Credit: 280798376

RAC: 0

I haven't seen any O2MD1

6 Oct 2019 13:08:30 UTC

Message 173719

(moderation:

)

I haven't seen any O2MD1 tasks show up yet.

Clear skies,

Matt

Richie

Joined: 7 Mar 14

Posts: 656

Credit: 1702989778

RAC: 0

I had this dual-boot system

6 Oct 2019 17:15:35 UTC

Message 173725

(moderation:

)

I had this dual-boot system that crunched fourteen O2MD1 tasks in Windows (12329446). It still has an old 160GB IDE HDD that had become quite slow already for running the system (can't get SMART info but this disk must have seen a large amount of GBs over the years.... installed all of the Win 10 builds so far etc). I got bored and thought this would be a good day for wiping the disk. Then I installed Ubuntu 19.10 "daily-live" build on it, at least temporarily: 12464604
I did that mainly to see if this "new" system could get O2MD1 tasks. The record of this host doesn't look new because I used the old name from the the previous setup while it still had Linux Mint.
This host is now running AS20 cpu tasks fine but 12 hours have gone and it hasn't been able to get a single tasks from O2MD1.

Another host (12779576) that earlier got 35 tasks and crunched them hasn't been able to get any additional O2MD1 tasks. I'll change it to accept a few OS20 cpu tasks and will observe what the situation will be next week.

Anonymous

results/credits are coming

7 Oct 2019 12:18:48 UTC

Message 173734

(moderation:

)

results/credits are coming in now. 0 invalids, 0 errors on 2 pcs. I am taking back my "credits" statement because I believe it to be fundamentally wrong. I am seeing credits in the range of 120 to 1000 across two pcs.

feels like an optimistic start to O2MD1.

Richie

Joined: 7 Mar 14

Posts: 656

Credit: 1702989778

RAC: 0

Richie wrote:12464604This

7 Oct 2019 16:07:25 UTC

Message 173736 in response to message 173725

(moderation:

)

Richie wrote:

12464604
This host is now running AS20 cpu tasks fine but 12 hours have gone and it hasn't been able to get a single tasks from O2MD1.

Today at 02:30 PM local time (UTC +3) first tasks started to flow in. With work cache setting of '5 cores / 1 day / no additional tasks' this host had received 160 tasks while I was sleeping. The log says now "reached daily quota". That is insane.

Another host that had already crunched a few tasks and had work cache set to '5 cores / 2 days / no additional tasks' had received 416 tasks. "Reached daily quota". I'm going to abort about 200 of them as they won't be starting before deadline. I see there are many tasks with 'frequency' value of over 120 Hz and I assume those tasks will be taking even more time to run than the max 70-80 Hz tasks of what I've seen results so far.

edit: Work was downloaded in one hour and 11-18 tasks per contact.

Logic behing that scheduling would be interesting to know. Why would a host that has already run a few tasks of particular type of work (and has produced a basic info about the run time of that type of work for that computer) then receive work for almost a month... with a work cache setting of '2 days'.

Quote:

Another host (12779576) that earlier got 35 tasks and crunched them hasn't been able to get any additional O2MD1 tasks.

This one still isn't receiving them.

But tasks have validated well. No errors or invalids so far.

Discussion Thread for the Continuous GW Search known as O2MD1 (now O2MDF - GPUs only)

Forums › Cruncher's Corner

Comment viewing options

Forums › Cruncher's Corner