Scheduler Bug with work requests for O2MD1 work - ### Staff Please Read ###

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5887

Credit: 119595617898

RAC: 24826098

13 Oct 2019 0:17:38 UTC

Topic 219768

(moderation:

)

This is a problem where the scheduler refuses to send work for a spurious reason. This exact problem has happened on two separate hosts both of which had GPU work for the V1.10 app and both of which have transitioned to requesting GPU work for the V2.01 app. The details for the latest occurrence of the issue are listed below.

Host ID: 536520

This host was running FGRPB1G work. It still has suspended tasks for that search while testing initially the V1.10 O2MD1 app and subsequently the later V2.01 version. The scheduler now refuses to send further V2.01 work with the following log entry.


2019-10-12 23:10:27.5947 [PID=16254] [send] CPU: req 0.00 sec, 0.00 instances; est delay 0.00
2019-10-12 23:10:27.5947 [PID=16254] [send] ATI: req 17130.29 sec, 0.00 instances; est delay 0.00
2019-10-12 23:10:27.5947 [PID=16254] [send] work_req_seconds: 0.00 secs
2019-10-12 23:10:27.5947 [PID=16254] [send] available disk 8.13 GB, work_buf_min 125280
2019-10-12 23:10:27.5947 [PID=16254] [send] active_frac 1.000000 on_frac 1.000000 DCF 0.512215
2019-10-12 23:10:27.6009 [PID=16254] [mixed] sending locality work first (0.6038)
2019-10-12 23:10:27.6091 [PID=16254] [send] send_old_work() no feasible result older than 336.0 hours
2019-10-12 23:10:27.6185 [PID=16254] [version] Checking plan class 'GWold'
2019-10-12 23:10:27.6215 [PID=16254] [version] reading plan classes from file '/BOINC/projects/EinsteinAtHome/plan_class_spec.xml'
2019-10-12 23:10:27.6216 [PID=16254] [version] plan class ok
2019-10-12 23:10:27.6216 [PID=16254] [version] Don't need CPU jobs, skipping version 102 for einstein_O2MD1 (GWold)
2019-10-12 23:10:27.6216 [PID=16254] [version] Checking plan class 'GWnew'
2019-10-12 23:10:27.6216 [PID=16254] [version] WU#421990229 too old
2019-10-12 23:10:27.6216 [PID=16254] [version] Checking plan class 'GW-opencl-ati'
2019-10-12 23:10:27.6216 [PID=16254] [version] WU#421990229 too old
2019-10-12 23:10:27.6216 [PID=16254] [version] Checking plan class 'GW-opencl-nvidia'
2019-10-12 23:10:27.6216 [PID=16254] [version] WU#421990229 too old
2019-10-12 23:10:27.6217 [PID=16254] [version] no app version available: APP#53 (einstein_O2MD1) PLATFORM#7 (x86_64-pc-linux-gnu) min_version 0
2019-10-12 23:10:27.6217 [PID=16254] [version] no app version available: APP#53 (einstein_O2MD1) PLATFORM#1 (i686-pc-linux-gnu) min_version 0

You can see a reference to a WU#421990229 which is claimed to be "too old". This host has already crunched and returned a task belonging to that quorum almost 4 days ago. The task was crunched with the V1.10 app and has a status of "Completed, waiting for validation". Of course, frustratingly, there is no way I can check what is happening in that quorum as the details are hidden. So it seems like the scheduler is looking at that quorum because I have the correct data files. I'm guessing that the other task for WU#42199029 has failed and the scheduler needs to send a 'resend' task. Since I've already completed a member of that quorum, why is it even considering my host as a potential candidate? Whatever the reason, that seems to kill any prospect for sending new O2MD1 work. On multiple repeat work requests, I get the very same message time after time, followed by a complete refusal to send any new GW GPU work..

As I mentioned, this very same behaviour occurred on a different host under similar circumstances. The problem lasted for a few days and I put that host back on FGRPB1G work. The problem suddenly got resolved - I'm not sure how, but perhaps the incomplete quorum in that case may have got completed and that caused the problem to go away. Since it now has happened again, perhaps someone could investigate while this behaviour is still currently happening. If this continues, the current host will need to go back to FGRPB1G shortly.

Cheers,
Gary.

archae86

Joined: 6 Dec 05

Posts: 3165

Credit: 7388801687

RAC: 2013488

Gary Roberts wrote:It still

13 Oct 2019 0:24:12 UTC

Message 173846

(moderation:

)

Gary Roberts wrote:

It still has suspended tasks

Not saying it matches the log you posted, but I thought it was standard behavior for no new work to be sent when a host has any tasks suspended.

Richie

Joined: 7 Mar 14

Posts: 656

Credit: 1702989778

RAC: 0

I think I've seen somewhat

13 Oct 2019 2:17:49 UTC

Message 173848

(moderation:

)

I think I've seen somewhat similar situation with several hosts. At some point server has temporarily refused to send GW gpu tasks even if the host had nothing at all to crunch. Currently I see this happening for host 12779576 . It is running nothing and has 0 tasks in queue now. Cache setting is 3 cpus / 0.5 days.


2019-10-13 00:51:21.7961 [PID=16073] [handle] [HOST#12779576] [RESULT#889213623] [WU#422432642] got result (DB: server_state=4 outcome=0 client_state=0 validate_state=0 delete_state=0)

<pre>2019-10-13 01:10:18.7341 [PID=22115] [debug] have_master:1 have_working: 1 have_db: 1
 2019-10-13 01:10:18.7342 [PID=22115] [debug] using working prefs
 2019-10-13 01:10:18.7342 [PID=22115] [debug] have db 1; dbmod 1539560187.000000; global mod 1539560187.000000
 2019-10-13 01:10:18.7342 [PID=22115] [send] effective_ncpus 3 max_jobs_on_host_cpu 999999 max_jobs_on_host 999999
 2019-10-13 01:10:18.7342 [PID=22115] [send] effective_ngpus 1 max_jobs_on_host_gpu 999999
 2019-10-13 01:10:18.7342 [PID=22115] [send] Not using matchmaker scheduling; Not using EDF sim
 2019-10-13 01:10:18.7342 [PID=22115] [send] CPU: req 0.00 sec, 0.00 instances; est delay 0.00
 2019-10-13 01:10:18.7342 [PID=22115] [send] CUDA: req 43200.00 sec, 0.00 instances; est delay 0.00
 2019-10-13 01:10:18.7342 [PID=22115] [send] work_req_seconds: 0.00 secs
 2019-10-13 01:10:18.7342 [PID=22115] [send] available disk 6.54 GB, work_buf_min 43200
 2019-10-13 01:10:18.7342 [PID=22115] [send] active_frac 0.999540 on_frac 0.980263 DCF 0.577028
 2019-10-13 01:10:18.7351 [PID=22115] [mixed] sending locality work first (0.2429)
 2019-10-13 01:10:18.7492 [PID=22115] [send] send_old_work() no feasible result older than 336.0 hours
 2019-10-13 01:10:20.4783 [PID=22115] [version] Checking plan class 'GWold'
 2019-10-13 01:10:20.4813 [PID=22115] [version] reading plan classes from file '/BOINC/projects/EinsteinAtHome/plan_class_spec.xml'
 2019-10-13 01:10:20.4813 [PID=22115] [version] plan class ok
 2019-10-13 01:10:20.4813 [PID=22115] [version] Don't need CPU jobs, skipping version 101 for einstein_O2MD1 (GWold)
 2019-10-13 01:10:20.4813 [PID=22115] [version] Checking plan class 'GWnew'
 2019-10-13 01:10:20.4813 [PID=22115] [version] WU#421825419 too old
 2019-10-13 01:10:20.4813 [PID=22115] [version] Checking plan class 'GW-opencl-ati'
 2019-10-13 01:10:20.4813 [PID=22115] [version] WU#421825419 too old
 2019-10-13 01:10:20.4813 [PID=22115] [version] Checking plan class 'GW-opencl-nvidia'
 2019-10-13 01:10:20.4814 [PID=22115] [version] WU#421825419 too old
 2019-10-13 01:10:20.4814 [PID=22115] [version] no app version available: APP#53 (einstein_O2MD1) PLATFORM#9 (windows_x86_64) min_version 0
 2019-10-13 01:10:20.4814 [PID=22115] [version] no app version available: APP#53 (einstein_O2MD1) PLATFORM#2 (windows_intelx86) min_version 0
 2019-10-13 01:10:20.4815 [PID=22115] [mixed] sending non-locality work second
 2019-10-13 01:10:20.5062 [PID=22115] [send] [HOST#12779576] will accept beta work. Scanning for beta work.
 2019-10-13 01:10:20.5218 [PID=22115] [debug] [HOST#12779576] MSG(high) No work sent
 2019-10-13 01:10:20.5219 [PID=22115] [debug] [HOST#12779576] MSG(high) see scheduler log messages on https://einsteinathome.org/host/12779576/log
 2019-10-13 01:10:20.5219 [PID=22115] Sending reply to [HOST#12779576]: 0 results, delay req 60.00

* that 'pre' tag at the beginning of second line in the log wasn't there, but I can't get rid of it on this preview

EDIT: And while I was fighting to make the text to display correctly... that host got work again. But that's how it was.

Another host 12761897 hasn't been downloading work for some time now but the reason may be different. This host is currently running only one GW 1.01 cpu task and there are 0 tasks in queue. It was running GW 2.01 gpu tasks but isn't getting them now. Cache setting is 4 cpus and 0.5 days.



2019-10-13 02:00:51.5652 [PID=5574 ] [debug] have db 1; dbmod 1539560187.000000; global mod 1539560187.000000
2019-10-13 02:00:51.5652 [PID=5574 ] [send] effective_ncpus 4 max_jobs_on_host_cpu 999999 max_jobs_on_host 999999
2019-10-13 02:00:51.5652 [PID=5574 ] [send] effective_ngpus 1 max_jobs_on_host_gpu 999999
2019-10-13 02:00:51.5653 [PID=5574 ] [send] Not using matchmaker scheduling; Not using EDF sim
2019-10-13 02:00:51.5653 [PID=5574 ] [send] CPU: req 0.00 sec, 0.00 instances; est delay 0.00
2019-10-13 02:00:51.5653 [PID=5574 ] [send] ATI: req 43200.00 sec, 0.00 instances; est delay 0.00
2019-10-13 02:00:51.5653 [PID=5574 ] [send] work_req_seconds: 0.00 secs
2019-10-13 02:00:51.5653 [PID=5574 ] [send] available disk 5.59 GB, work_buf_min 43200
2019-10-13 02:00:51.5653 [PID=5574 ] [send] active_frac 0.999530 on_frac 0.978877 DCF 0.339174
2019-10-13 02:00:51.5664 [PID=5574 ] [resend] [HOST#12761897] found lost [RESULT#889070610]: h1_0226.25_O2C02Cl1In0__O2MD1Gn_G34731_226.40Hz_25_1
2019-10-13 02:00:51.5675 [PID=5574 ] [version] Checking plan class 'GWold'
2019-10-13 02:00:51.5708 [PID=5574 ] [version] reading plan classes from file '/BOINC/projects/EinsteinAtHome/plan_class_spec.xml'
2019-10-13 02:00:51.5708 [PID=5574 ] [version] WU#422358820 too new
2019-10-13 02:00:51.5708 [PID=5574 ] [version] Checking plan class 'GWnew'
2019-10-13 02:00:51.5708 [PID=5574 ] [version] plan class ok
2019-10-13 02:00:51.5708 [PID=5574 ] [version] Don't need CPU jobs, skipping version 200 for einstein_O2MD1 (GWnew)
2019-10-13 02:00:51.5709 [PID=5574 ] [version] Checking plan class 'GW-opencl-ati'
2019-10-13 02:00:51.5709 [PID=5574 ] [version] parsed project prefs setting 'gpu_util_gw': 1.000000
2019-10-13 02:00:51.5709 [PID=5574 ] [version] Peak flops supplied: 5e+10
2019-10-13 02:00:51.5709 [PID=5574 ] [version] plan class ok
2019-10-13 02:00:51.5709 [PID=5574 ] [version] beta test app versions not allowed in project prefs.
2019-10-13 02:00:51.5709 [PID=5574 ] [version] Checking plan class 'GW-opencl-nvidia'
2019-10-13 02:00:51.5709 [PID=5574 ] [version] parsed project prefs setting 'gpu_util_gw': 1.000000
2019-10-13 02:00:51.5709 [PID=5574 ] [version] No CUDA devices found
2019-10-13 02:00:51.5710 [PID=5574 ] [version] no app version available: APP#53 (einstein_O2MD1) PLATFORM#9 (windows_x86_64) min_version 0
2019-10-13 02:00:51.5710 [PID=5574 ] [version] no app version available: APP#53 (einstein_O2MD1) PLATFORM#2 (windows_intelx86) min_version 0
2019-10-13 02:00:51.5710 [PID=5574 ] [CRITICAL] [HOST#12761897] can't resend [RESULT#889070610]: no app version for einstein_O2MD1
2019-10-13 02:00:51.5710 [PID=5574 ] [resend] [HOST#12761897] 1 lost results, resent 0
2019-10-13 02:00:51.5711 [PID=5574 ] [mixed] sending locality work first (0.5974)
2019-10-13 02:00:51.6006 [PID=5574 ] [send] send_old_wo2019-10-13 02:00:51.8991 [PID=5576 ] SCHEDULER_REQUEST::parse(): unrecognized: <allow_multiple_clients>0</allow_multiple_clients>
2019-10-13 02:00:52.6495 [PID=5574 ] [mixed] sending non-locality work second
2019-10-13 02:00:52.6745 [PID=5574 ] [send] [HOST#12761897] will accept beta work. Scanning for beta work.
2019-10-13 02:00:52.6908 [PID=5574 ] [debug] [HOST#12761897] MSG(high) No work sent
2019-10-13 02:00:52.6909 [PID=5574 ] [debug] [HOST#12761897] MSG(high) see scheduler log messages on https://einsteinathome.org/host/12761897/log
2019-10-13 02:00:52.6909 [PID=5574 ] Sending reply to [HOST#12761897]: 0 results, delay req 60.00

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5887

Credit: 119595617898

RAC: 24826098

archae86 wrote:... standard

13 Oct 2019 2:48:11 UTC

Message 173849 in response to message 173846

(moderation:

)

archae86 wrote:

... standard behavior for no new work to be sent when a host has any tasks suspended.

Absolutely, so I need to manage a bit of a delicate dance to get what I want when I want it :-). Every host uses an app_config.xml file to control each app and the task multiplicity. I use 'locations' on top of that to only allow particular type of work for a particular location. It's very quick and easy to change location for a given host. I have only 4 machines under test so all the others just carry on as usual. It takes me maybe 15-20 mins to do a big top-up of those 4 hosts.

Most of the time, a host being tested runs with 'no new work' set. If I want a fresh batch of a particular type of work, I make sure to change to the appropriate location, adjust the cache setting to allow a work request and 'resume' all suspended tasks for the duration of the work request. Finally, I 'allow new work'. When the new tasks have arrived, I set 'no new work' and again 'suspend' those tasks that I don't want to be running for the time being.

It sounds complicated but it's fairly trivial to do. The work requests are large enough that I might only need to do something like this once or twice per day. Of course, it becomes a real pain if the scheduler refuses to cooperate when it's supposed to :-).

Cheers,
Gary.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5887

Credit: 119595617898

RAC: 24826098

Richie wrote:I think I've

13 Oct 2019 4:32:47 UTC

Message 173851 in response to message 173848

(moderation:

)

Richie wrote:

I think I've seen somewhat similar situation with several hosts. At some point server has temporarily refused to send GW gpu tasks even if the host had nothing at all to crunch. Currently I see this happening for host 12779576 .

2019-10-13 01:10:20.4814 [PID=22115] [version] WU#421825419 too old

Yes, that looks pretty much the same sort of thing that I've got.

If you check through your hosts to see if that workunit belongs to one of them (any one of them) then I think I can explain what has happened to allow the host 12779576 to suddenly get work. I had a quick look (via your link) and couldn't find that particular workunit on that particular host but I may have missed it or it may be on one of your other hosts. If so, none of your hosts should be allowed to get a resend of that particular quorum. That shouldn't be a reason for not allocating some other tasks, though.

I checked the workunit itself in case it had already been completed. It's still currently 'hidden' so I can't know the hosts that have tasks in that quorum. What I suspect may have happened is that the scheduler has found a different host to give the extra task to and so your host is no longer of interest. Your host therefore is suddenly able to get new work rather than continue being denied. Of course, this is all just conjecture but it would be interesting to know if any of your hosts had a task belonging to that #421825419 quorum.

The other example you show seems to be something a bit different since it involves a supposedly lost result. It could easily be part of the same problem so I hope the Devs can have a look at this promptly in case lots more examples start showing up. There have been some reports of inability to get work already which was why I jumped on this as soon as I saw some examples. Thanks for adding your information to this thread.

Cheers,
Gary.

Richie

Joined: 7 Mar 14

Posts: 656

Credit: 1702989778

RAC: 0

Okay, I went through all

13 Oct 2019 5:51:27 UTC

Message 173852

(moderation:

)

Okay, I went through all tasks from all my hosts but didn't find any of them to belong to WU 421825419. Theoretically it's possible though for that task to have slipped from the next page to that previous page where I was moving away from at the same moment... as the database kept updating.

edit: Now that "too old" joke is going on with another host. WU is just different... #421966858

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5887

Credit: 119595617898

RAC: 24826098

Richie wrote:Okay, I went

13 Oct 2019 7:08:48 UTC

Message 173853 in response to message 173852

(moderation:

)

Richie wrote:

Okay, I went through all tasks from all my hosts but didn't find any of them to belong to WU 421825419. Theoretically it's possible though for that task to have slipped from the next page to that previous page where I was moving away from at the same moment... as the database kept updating.

If you click on the Workunit ID header on the first page, they will all be sorted into ascending numerical order which makes it quicker to zero in on the page containing the closest numbers.

Richie wrote:

edit: Now that "too old" joke is going on with another host. WU is just different... #421966858

I took a look at that particular ID. This is what I got.

Workunit 421966858
Name: h1_0229.65_O2C02Cl1In0__O2MD1G_G34731_229.75Hz_14
Application: Gravitational Wave search O2 Multi-Directional
Created: 8 Oct 2019 18:49:11 UTC

Tasks are pending for this workunit.

Obviously not helpful in trying to see which host IDs have been allocated tasks for that quorum and if there are unsent resends. The task frequency 0229.65Hz is a clue. Does the host in question already have large data files around that particular value and ranging slightly above there. That would be a reason for the scheduler to be interested in sending that sort of a task to you. That's how locality scheduling is supposed to work to minimise large data file downloads.

Cheers,
Gary.

Richie

Joined: 7 Mar 14

Posts: 656

Credit: 1702989778

RAC: 0

Gary Roberts wrote:If you

13 Oct 2019 12:24:26 UTC

Message 173854 in response to message 173853

(moderation:

)

Gary Roberts wrote:

If you click on the Workunit ID header on the first page, they will all be sorted into ascending numerical order which makes it quicker to zero in on the page containing the closest numbers.

Well, that made it easier for sure. Thanks again. I hadn't thought clearly enough. Well... there is 421824xxx and 421827xxx but nothing in between. No match (WU 421825419).

Gary Roberts wrote:

I took a look at that particular ID. This is what I got.

Workunit 421966858
Name: h1_0229.65_O2C02Cl1In0__O2MD1G_G34731_229.75Hz_14
Application: Gravitational Wave search O2 Multi-Directional
Created: 8 Oct 2019 18:49:11 UTC

Tasks are pending for this workunit.

Obviously not helpful in trying to see which host IDs have been allocated tasks for that quorum and if there are unsent resends. The task frequency 0229.65Hz is a clue. Does the host in question already have large data files around that particular value and ranging slightly above there. That would be a reason for the scheduler to be interested in sending that sort of a task to you. That's how locality scheduling is supposed to work to minimise large data file downloads.

Yes it does!
I did not see any 229.65Hz tasks but there are plenty of invalids and errors on the 229.75-229.90Hz. Also the last tasks that it run this morning were that same stuff. I expect those latter ones to validate quite well though.

(That host had experienced somekind of Radeon gpu driver reset already on Thursday but it kept crunching seemingly in a normal way. Sadly I didn't notice that until later. Errors and invalids started to accumulate and finally I noticed a notification that had came from Radeon settings. There will still be more of errors and invalids coming but I changed this GPU to run in 1x configuration instead of 2x and set a power limit of -20 % about a day and a half ago. I haven't seen tasks ending up with validation errors after that. Positively, there are already successful validations instead. That card just can't seem to run successfully here with full power and 2x, I should've remembered that from not so long ago.)

edit: A moment ago, out of nowhere, this host 12329334 got 36 GW gpu tasks at once. No more "WU # too old". I had nothing to do with that. Everything is fine again...

And a couple of hours earlier:

Host 12761897 that had a problem with a task being lost.
I tried to shake and wake the scheduler to take a look at that one in a new perspective. I moved the host from "GW gpu only" venue to "FGRP gpu only" venue. I set a very low work cache and let the host download about 15 FGRP gpu tasks. Then I changed venue back to "GW gpu only" and increased the work cache setting. I have allowed both AMD and Nvidia on that venue but no cpu. Beta is allowed.

I thought there would be coming only GW gpu tasks if anything then... but next... there was immediately one "resent lost task". It was a GWnew 2.00 cpu task and it started running.

Right after that... GW gpu tasks started to flow in. About 40 of them... with work cache set to 3 cpus / 1 day. Scheduler is currently happy and says "don't need" when requesting for more work.

I don't know at all if this micro-managing was any crucial in fixing the problem with that lost task. I think a user at least shouldn't need to fiddle back and forth with the settings to make tasks flow again.

Zalster

Joined: 26 Nov 13

Posts: 3117

Credit: 4050672230

RAC: 0

Hey Gary, I'm seeing this

13 Oct 2019 18:30:36 UTC

Message 173859

(moderation:

)

Hey Gary,

I'm seeing this when I try to find out why my computer won't download any new work units

[version] beta test app versions not allowed in project prefs.
2019-10-13 18:16:11.6344 [PID=4640 ]    [version] no app version available: APP#53 (einstein_O2MD1) PLATFORM#7 (x86_64-pc-linux-gnu) min_version 0
2019-10-13 18:16:11.6344 [PID=4640 ]    [version] no app version available: APP#53 (einstein_O2MD1) PLATFORM#1 (i686-pc-linux-gnu) min_version 0
2019-10-13 18:16:11.6344 [PID=4640 ] [CRITICAL]   [HOST#12789230] can't resend [RESULT#888673591]: no app version for einstein_O2MD1
2019-10-13 18:16:11.6345 [PID=4640 ]    [resend] [HOST#12789230] found lost [RESULT#888674470]: h1_0216.40_O2C02Cl1In0__O2MD1Gn_G34731_216.55Hz_25_1
2019-10-13 18:16:11.6351 [PID=4640 ] [CRITICAL]   [HOST#12789230] can't resend [RESULT#888674470]: no app version for einstein_O2MD1
2019-10-13 18:16:11.6351 [PID=4640 ]    [resend] [HOST#12789230] found lost [RESULT#888674494]: h1_0216.20_O2C02Cl1In0__O2MD1Gn_G34731_216.35Hz_24_1
2019-10-13 18:16:11.6357 [PID=4640 ] [CRITICAL]   [HOST#12789230] can't resend [RESULT#888674494]: no app version for einstein_O2MD1
2019-10-13 18:16:11.6357 [PID=4640 ]    [resend] [HOST#12789230] found lost [RESULT#888677718]: h1_0164.10_O2C02Cl1In0__O2MD1Gn_G34731_164.20Hz_14_1
2019-10-13 18:16:11.6363 [PID=4640 ] [CRITICAL]   [HOST#12789230] can't resend [RESULT#888677718]: no app version for einstein_O2MD1
2019-10-13 18:16:11.6363 [PID=4640 ]    [resend] [HOST#12789230] found lost [RESULT#888678487]: h1_0216.30_O2C02Cl1In0__O2MD1Gn_G34731_216.45Hz_24_0
2019-10-13 18:16:11.6368 [PID=4640 ] [CRITICAL]   [HOST#12789230] can't resend [RESULT#888678487]: no app version for einstein_O2MD1
2019-10-13 18:16:11.6369 [PID=4640 ]    [resend] [HOST#12789230] found lost [RESULT#888678528]: h1_0223.10_O2C02Cl1In0__O2MD1Gn_G34731_223.25Hz_28_0
2019-10-13 18:16:11.6374 [PID=4640 ] [CRITICAL]   [HOST#12789230] can't resend [RESULT#888678528]: no app version for einstein_O2MD1

version for einstein_O2MD1 2019-10-13 18:25:31.1503 [PID=7360 ] [resend] [HOST#12789230] 288 lost results, resent 0 2019-10-13 18:25:31.1504 [PID=7360 ] [mixed] sending locality work first (0.1729) 2019-10-13 18:25:31.1602 [PID=7360 ] [send] send_old_work() no feasible result older than 336.0 hours 2019-10-13 18:25:31.1704 [PID=7360 ] [send] send_old_work() no feasible result younger than 260.3 hours and older than 168.0 hours 2019-10-13 18:25:31.1896 [PID=7360 ] [mixed] sending non-locality work second 2019-10-13 18:25:31.2155 [PID=7360 ] [send] [HOST#12789230] will accept beta work. Scanning for beta work

Tried to reset the project after I saw that but no go.....

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5887

Credit: 119595617898

RAC: 24826098

Richie wrote:.... Host

13 Oct 2019 22:06:20 UTC

Message 173861 in response to message 173854

(moderation:

)

Richie wrote:

....
Host 12761897 that had a problem with a task being lost.
I tried to shake and wake the scheduler to take a look at that one in a new perspective. I moved the host from "GW gpu only" venue to "FGRP gpu only" venue. I set a very low work cache and let the host download about 15 FGRP gpu tasks. Then I changed venue back to "GW gpu only" and increased the work cache setting. I have allowed both AMD and Nvidia on that venue but no cpu. Beta is allowed.

I thought there would be coming only GW gpu tasks if anything then... but next... there was immediately one "resent lost task". It was a GWnew 2.00 cpu task and it started running.

Earlier, in the O2MD1 discussion thread, I reported about deliberately creating a couple of lost GPU tasks. The idea was to see if the scheduler would resend them as tasks for the new V2.01 GPU app. The scheduler did resend them - but as CPU tasks and NOT GPU tasks. I aborted them.

This was probably due to the rule of "only one beta task per quorum". Even though the lost task was the beta one, the scheduler must have still counted the original beta allocation and so sent a non-beta CPU task as the replacement. Maybe something like that happened to you, since you also got a CPU task. With your lost task issue resolved, perhaps that was what allowed the normal flow of tasks again.

Richie wrote:

I don't know at all if this micro-managing was any crucial in fixing the problem with that lost task. I think a user at least shouldn't need to fiddle back and forth with the settings to make tasks flow again.

I don't think any micro-managing activities had anything to do with resuming the flow of tasks. It seems to me that something else had to happen at the scheduler end to clear what was causing the blockage. Hopefully one of the staff will get to take a look shortly.

The host I reported on in starting this thread still gets refused any requests for O2MD1 GPU work. It now has a big fat cache of FGRPB1G and is making occasional requests for O2MD1. Sooner or later, whatever is the issue with WU#421990229 (which is still claimed to be "too old") will get resolved and I will be able to get back to running some more GW work. I'm happy for the problem to remain so that the staff can still see the evidence.

Cheers,
Gary.

Zalster

Joined: 26 Nov 13

Posts: 3117

Credit: 4050672230

RAC: 0

I detached and reattached and

13 Oct 2019 22:35:35 UTC

Message 173862

(moderation:

)

I detached and reattached and was sent 20 resends but as CPU instead of GPU.

Scheduler Bug with work requests for O2MD1 work - ### Staff Please Read ###

Forums › Problems and Bug Reports

Comment viewing options

Forums › Problems and Bug Reports