error condition i am not understanding

Anonymous

13 Nov 2019 11:52:27 UTC

Topic 219969

(moderation:

)

I have 100s of these errors in the "last contacts log" for computer 116 and I cannot get/download any O2MDFG2 (Gravitational Wave search O2 Multi-Directional GPU v2.02) work units.  Don't know if they are related or not.

What is the no app version 02MDF?

2019-11-13 11:35:36.3975 [PID=18951] [CRITICAL]   [HOST#12619116] can't resend [RESULT#896857655]: no app version for einstein_O2MDF
2019-11-13 11:35:36.3975 [PID=18951]    [resend] [HOST#12619116] found lost [RESULT#896900828]: h1_0658.50_O2C02Cl2In0__O2MDFG2_G34731_658.75Hz_82_1

EDIT:  also of interest.  "in progress tasks" for computer 116 shows 54 "Gravitational Wave search O2 Multi-Directional GPU v2.02 () x86_64-pc-linux-gnu" but in Boinc manager on my pc there are no such tasks showing.

It downloads "pulsar binary search #1 (GPU)" WU but not what I have checked --> "Gravitational Wave search O2 Multi-Directional GPU".

mikey

Joined: 22 Jan 05

Posts: 12780

Credit: 1868190499

RAC: 1861869

I tried everything you suggested but nothing has fixed the problem. I deleted the project and added it back in. It seemed to ignore my venue/location. At present I have zero WUs being processed. But E@H say i have 54 GW Gpu tasks in progress and yet Boinc is showing zero tasks.

Holmis

Joined: 4 Jan 05

Posts: 1118

Credit: 1055935564

RAC: 0

I recall a similar discussion

13 Nov 2019 23:32:19 UTC

Message 174330

(moderation:

)

I recall a similar discussion about the O2MD1 search and the solution then was to accept CPU work and abort the lost tasks as they got downloaded. But the O2MDF doesn't have any CPU version according to this message.

I think Bernd has to look at this to help you clear the lost tasks before you can get new tasks.
I've sent him a PM.

Anonymous

Holmis wrote:I recall a

14 Nov 2019 0:32:39 UTC

Message 174331 in response to message 174330

(moderation:

)

Holmis wrote:

I recall a similar discussion about the O2MD1 search and the solution then was to accept CPU work and abort the lost tasks as they got downloaded. But the O2MDF doesn't have any CPU version according to this message.

I think Bernd has to look at this to help you clear the lost tasks before you can get new tasks.
I've sent him a PM.

Holmis,

Thanks for pinging Bernd. I always know that it was "something I did".

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5874

Credit: 118375602148

RAC: 25544375

robl wrote:What is the no app

14 Nov 2019 4:24:46 UTC

Message 174335

(moderation:

)

robl wrote:

What is the no app version 02MDF?

Here's the thing. Now that there is a new 'sub-run' with the plan class 02MDF (the run that is for CPUs is O2MD1), it is a run entirely for GPUs with no CPU app for this plan class. So the message means exactly what it says - there is no CPU app version for O2MDF. The implication of this is that the scheduler can't redirect a lost GPU test task to a CPU core any more.

The problem arises because of the rules for test apps - only 1 test task per quorum. In this case nvidia tasks are trusted and AMD tasks are test tasks. If an AMD task becomes lost, the scheduler isn't smart enough to understand that there is currently no test task in play. If the nvidia task became lost, the scheduler would resend it. If the AMD task becomes lost, the scheduler is not allowed to and so tries to send the lost task as a CPU task - hence the message you get. The solution is for someone to fix the scheduler bug and allow the scheduler to understand that it's OK to send a second test task if the first test task is the one that has become lost.

So why do tasks become lost? I understand one of the possible answers to that very well because in the last day or so I've seen it happen whilst I was in the process of transitioning a host from FGRPB1G to O2MDF. The host was building up a stock of tasks for O2MDF. I had made several successful small work cache increments, each resulting in about 5 - 10 tasks at a time. Then there was a comms failure showing no scheduler reply to the next work request.

The server often tends to be quite slow so it can take a minute or more for the advice confirming the number of new tasks allocated. This leaves a 'window of opportunity' for a temporary network glitch to prevent the scheduler reply from arriving.

So once the wait dragged on to the point that the request timed out, I knew there would be trouble. After a couple of mins, the glitch was fixed and service was restored and in most cases this presents no problem. The next scheduler contact should cause the 'lost tasks' to be noticed and resent, so you would normally be unaware of the temporary glitch. However, increasingly, I have been noticing such glitches and I think quite a few are at the server end rather than somewhere else. In this case however, I think it was my ISP.

So, how to recover from this? I really don't know. If you reset the project, the server will still be blocked by the 'one test task' rule. The only way I can think of is to get an entirely new host ID if you want to immediately keep crunching these tasks on that host. In my case, I'm allowing the remaining tasks to be finished and then it will go back to FGRPB1G until such time as the blocked resend tasks time out and are reallocated. I imagine that might take 2 weeks.

robl wrote:

EDIT: also of interest. "in progress tasks" for computer 116 shows 54 "Gravitational Wave search O2 Multi-Directional GPU v2.02 () x86_64-pc-linux-gnu" but in Boinc manager on my pc there are no such tasks showing.

Well, of course you don't have any - they've all become 'lost' :-). Check the server contact log properly and you'll see that you have a total of 54 tasks that the server is trying to resend to you but the rules of the game are preventing it from doing so. If you go back through your event log (stdoutdae.txt) for the time when the big work request was made (that should have given you 54 new tasks) I'm sure you'll find evidence of failure to receive a reply to the original request at that time - ie. some sort of network glitch that caused 54 tasks to become lost and thus create this entire difficulty

As an example of what to look for, here is exactly what I saw in the event log that alerted me to the problem. I've annotated the log snippet with markers for a number of key points to make it easy to follow.

(##1) A successful work fetch triggered by a work cache size increase in BOINC Manager.
(##2) Lines of output resulting from a further local cache size increase.
(##3) While counting down to the next work fetch, a task is completed and uploaded.
(##4) 3 secs after upload completes, a new work fetch which also reports the completed task.
(##5) While awaiting the scheduler response, another task completes.
(##6) If you look carefully, this next task doesn't complete the upload before BOINC states comms failed
(##7) Just over a minute later the upload is retried and succeeds (the glitch was evidently quite short).
(##8) A new work fetch is initiated which also tries to report both completed tasks. The response indicates that the scheduler had already received the report for the first task and acted on it. That also means that there would have been new tasks sent which never made it to the client because of the network glitch at ##6. So these are genuine lost tasks which under normal circumstances could easily have been resent if the scheduler had undertsood that there were no other test tasks except these already in existence.

12-Nov-2019 10:12:47 [E@H] Sending scheduler request: To fetch work.                                                     ##1
12-Nov-2019 10:12:47 [E@H] Requesting new tasks for AMD/ATI GPU
12-Nov-2019 10:13:15 [E@H] Scheduler request completed: got 7 new tasks
12-Nov-2019 10:13:36 [E@H] General prefs: from Einstein@Home (last modified 26-Nov-2016 15:12:08)                        ##2
12-Nov-2019 10:13:36 [E@H] Computer location: generic
12-Nov-2019 10:13:36 [E@H] General prefs: no separate prefs for generic; using your defaults
12-Nov-2019 10:13:36 [---] Reading preferences override file
12-Nov-2019 10:13:36 [---] Preferences:
12-Nov-2019 10:13:36 [---] max memory usage when active: 7509.97 MB
12-Nov-2019 10:13:36 [---] max memory usage when idle: 7905.23 MB
12-Nov-2019 10:13:36 [---] max disk usage: 20.00 GB
12-Nov-2019 10:13:36 [---] (to change preferences, visit a project web site or select Preferences in the Manager)
12-Nov-2019 10:13:58 [E@H] Computation for task h1_0653.45_O2C02Cl2In0__O2MDFG2_G34731_653.60Hz_7_0 finished                        ##3
12-Nov-2019 10:13:59 [E@H] Starting task h1_0653.45_O2C02Cl2In0__O2MDFG2_G34731_653.60Hz_3_0
12-Nov-2019 10:14:01 [E@H] Started upload of h1_0653.45_O2C02Cl2In0__O2MDFG2_G34731_653.60Hz_7_0_0
12-Nov-2019 10:14:01 [E@H] Started upload of h1_0653.45_O2C02Cl2In0__O2MDFG2_G34731_653.60Hz_7_0_1
12-Nov-2019 10:14:07 [E@H] Finished upload of h1_0653.45_O2C02Cl2In0__O2MDFG2_G34731_653.60Hz_7_0_1
12-Nov-2019 10:14:07 [E@H] Started upload of h1_0653.45_O2C02Cl2In0__O2MDFG2_G34731_653.60Hz_7_0_2
12-Nov-2019 10:14:13 [E@H] Finished upload of h1_0653.45_O2C02Cl2In0__O2MDFG2_G34731_653.60Hz_7_0_0
12-Nov-2019 10:14:13 [E@H] Started upload of h1_0653.45_O2C02Cl2In0__O2MDFG2_G34731_653.60Hz_7_0_3
12-Nov-2019 10:14:15 [E@H] Finished upload of h1_0653.45_O2C02Cl2In0__O2MDFG2_G34731_653.60Hz_7_0_3
12-Nov-2019 10:14:18 [E@H] Finished upload of h1_0653.45_O2C02Cl2In0__O2MDFG2_G34731_653.60Hz_7_0_2
12-Nov-2019 10:14:21 [E@H] Sending scheduler request: To fetch work.                                                               ##4
12-Nov-2019 10:14:21 [E@H] Reporting 1 completed tasks
12-Nov-2019 10:14:21 [E@H] Requesting new tasks for AMD/ATI GPU
12-Nov-2019 10:14:38 [E@H] Computation for task h1_0653.45_O2C02Cl2In0__O2MDFG2_G34731_653.60Hz_6_0 finished                       ##5
12-Nov-2019 10:14:39 [E@H] Starting task h1_0653.45_O2C02Cl2In0__O2MDFG2_G34731_653.60Hz_1_0
12-Nov-2019 10:14:41 [E@H] Started upload of h1_0653.45_O2C02Cl2In0__O2MDFG2_G34731_653.60Hz_6_0_0
12-Nov-2019 10:14:41 [E@H] Started upload of h1_0653.45_O2C02Cl2In0__O2MDFG2_G34731_653.60Hz_6_0_1
12-Nov-2019 10:14:52 [E@H] Finished upload of h1_0653.45_O2C02Cl2In0__O2MDFG2_G34731_653.60Hz_6_0_0
12-Nov-2019 10:14:52 [E@H] Finished upload of h1_0653.45_O2C02Cl2In0__O2MDFG2_G34731_653.60Hz_6_0_1
12-Nov-2019 10:14:52 [E@H] Started upload of h1_0653.45_O2C02Cl2In0__O2MDFG2_G34731_653.60Hz_6_0_2
12-Nov-2019 10:14:52 [E@H] Started upload of h1_0653.45_O2C02Cl2In0__O2MDFG2_G34731_653.60Hz_6_0_3
12-Nov-2019 10:19:29 [---] Project communication failed: attempting access to reference site                                     ##6
12-Nov-2019 10:19:29 [E@H] Scheduler request failed: Timeout was reached
12-Nov-2019 10:19:40 [---] BOINC can't access Internet - check network connection or proxy configuration.
12-Nov-2019 10:19:59 [E@H] Temporarily failed upload of h1_0653.45_O2C02Cl2In0__O2MDFG2_G34731_653.60Hz_6_0_2: transient HTTP error
12-Nov-2019 10:19:59 [E@H] Backing off 00:01:15 on upload of h1_0653.45_O2C02Cl2In0__O2MDFG2_G34731_653.60Hz_6_0_2
12-Nov-2019 10:20:00 [E@H] Temporarily failed upload of h1_0653.45_O2C02Cl2In0__O2MDFG2_G34731_653.60Hz_6_0_3: transient HTTP error
12-Nov-2019 10:20:00 [E@H] Backing off 00:01:15 on upload of h1_0653.45_O2C02Cl2In0__O2MDFG2_G34731_653.60Hz_6_0_3
12-Nov-2019 10:21:15 [E@H] Started upload of h1_0653.45_O2C02Cl2In0__O2MDFG2_G34731_653.60Hz_6_0_2                                ##7
12-Nov-2019 10:21:16 [E@H] Started upload of h1_0653.45_O2C02Cl2In0__O2MDFG2_G34731_653.60Hz_6_0_3
12-Nov-2019 10:21:18 [E@H] Finished upload of h1_0653.45_O2C02Cl2In0__O2MDFG2_G34731_653.60Hz_6_0_3
12-Nov-2019 10:21:19 [E@H] Finished upload of h1_0653.45_O2C02Cl2In0__O2MDFG2_G34731_653.60Hz_6_0_2
12-Nov-2019 10:21:29 [E@H] Sending scheduler request: To fetch work.                                                               ##8
12-Nov-2019 10:21:29 [E@H] Reporting 2 completed tasks
12-Nov-2019 10:21:29 [E@H] Requesting new tasks for AMD/ATI GPU
12-Nov-2019 10:21:33 [E@H] Scheduler request completed: got 0 new tasks
12-Nov-2019 10:21:33 [E@H] Completed result h1_0653.45_O2C02Cl2In0__O2MDFG2_G34731_653.60Hz_7_0 refused: result already reported as success
12-Nov-2019 10:21:33 [E@H] No work sent
12-Nov-2019 10:21:33 [E@H] No work is available for Gravitational Wave search O2 Multi-Directional
12-Nov-2019 10:21:33 [E@H] see scheduler log messages on https://einsteinathome.org/host/541492/log

I've seen this exact same problem several times before. I've tried to report it before. I had less of a clue about what was going one then than I do now. In that same thread, Zalster had 288 lost tasks that eventually came back as CPU tasks and then were aborted to get rid of them so that he could get fresh GPU tasks. Other people have seen similar problems. I would be very surprised if it's not the same issue of a test task becoming lost (for whatever reason) and then not being allowed to be resent.

Cheers,
Gary.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5874

Credit: 118375602148

RAC: 25544375

Holmis wrote:I recall a

14 Nov 2019 5:05:58 UTC

Message 174336 in response to message 174330

(moderation:

)

Holmis wrote:

I recall a similar discussion about the O2MD1 search ....

Thanks for chiming in. I started my marathon response about 8 hours ago and due to non-BOINC-related distractions (otherwise known as real life) I didn't get it finished in a timely manner and hadn't noticed your response until after I had eventually posted mine. I would be interested in your opinion of the 'comprehensibility' of the response I've written to robl.

To my way of thinking, the 'resend lost tasks' feature is very worthwhile - particularly to someone like me. I would hate to see it removed. I see evidence of lots of short network glitches on a daily basis. With FGRPB1G they always get resolved by subsequent scheduler exchanges that fix any 'lost tasks'. It's just the 'test app' cases that causes issues, probably because the scheduler doesn't understand that a lost test task really means the quorum in question has now got zero of these..

By definition, if a test task becomes 'lost', then the quorum of two that it belongs to cannot contain a different test task assigned to a different host. So the scheduler just needs to be told that a lost test task can be replaced with a further copy of that test task if that particular host needs it replaced. You would think it should be possible to fix this.

Cheers,
Gary.

Holmis

Joined: 4 Jan 05

Posts: 1118

Credit: 1055935564

RAC: 0

Gary Roberts wrote:Holmis

14 Nov 2019 7:56:04 UTC

Message 174338 in response to message 174336

(moderation:

)

Gary Roberts wrote:

Holmis wrote:
I recall a similar discussion about the O2MD1 search ....
Thanks for chiming in. I started my marathon response about 8 hours ago and due to non-BOINC-related distractions (otherwise known as real life) I didn't get it finished in a timely manner and hadn't noticed your response until after I had eventually posted mine. I would be interested in your opinion of the 'comprehensibility' of the response I've written to robl.

I think you give a very plausible explanation for what seems to be going on and I also think the response should be understandable even for one that haven't spent much time reading logs or thinking about how Boinc works.

Quote:

To my way of thinking, the 'resend lost tasks' feature is very worthwhile - particularly to someone like me. I would hate to see it removed. I see evidence of lots of short network glitches on a daily basis. With FGRPB1G they always get resolved by subsequent scheduler exchanges that fix any 'lost tasks'. It's just the 'test app' cases that causes issues, probably because the scheduler doesn't understand that a lost test task really means the quorum in question has now got zero of these..

By definition, if a test task becomes 'lost', then the quorum of two that it belongs to cannot contain a different test task assigned to a different host. So the scheduler just needs to be told that a lost test task can be replaced with a further copy of that test task if that particular host needs it replaced. You would think it should be possible to fix this.

I agree that the resend lost task feature is nice to have and if it were to be removed then the lost tasks would become ghosts that need to time out before the next copy gets sent out to try and form the quorum. It would slow down validation and cause the database to bloat if the network problems get frequent.

Anonymous

Gary/Holmis:Thank you both

14 Nov 2019 14:16:44 UTC

Message 174341

(moderation:

)

Gary/Holmis:

Thank you both for your responses but a couple of questions.

1. does the "resend lost tasks feature exist" and if so where "

2. do I have to wait for some timer to elapse in order for my pc to acquire new jobs?

I have sent Bernd a copy of my pc's "sched_reply_einstein.phys.uwm.edu.xml" file. I noticed he was requesting this file in another thread dealing with the same issue.

Holmis

Joined: 4 Jan 05

Posts: 1118

Credit: 1055935564

RAC: 0

robl

14 Nov 2019 15:55:02 UTC

Message 174344 in response to message 174341

(moderation:

)

robl wrote:

Gary/Holmis:

Thank you both for your responses but a couple of questions.

1. does the "resend lost tasks feature exist" and if so where "

Yes it exists and it's a feature on the server. When Boinc contacts the server it sends a list of all tasks on your computer, the server then compares this to what it thinks you should have. If the lists don't match up the server then tries to resend the missing tasks to your computer. Those missing tasks are called "lost tasks" as the became lost somehow.

Quote:

2. do I have to wait for some timer to elapse in order for my pc to acquire new jobs?

If Bernd can't or won't make a manual intervention then your will probably have to wait for the deadlines of the tasks to expire. That would normally be 14 days here but might differ sometimes, I don't run the O2MDF search so can't look it up.
An other option would be to force the server to give your host a new hostID, it will involve editing client_state.xml, ask if you want to go down this route.

Anonymous

Thanks. Rather then sit idle

14 Nov 2019 16:51:36 UTC

Message 174345

(moderation:

)

Thanks.

Rather then sit idle I have selected "pulsar binary search #1". Did an update. I now have work for this pc. Waiting 14 days for a timer to expire is a bit much I think.

error condition i am not understanding

robl wrote: I have 100s of

mikey wrote:robl wrote: I

I recall a similar discussion

Holmis wrote:I recall a

robl wrote:What is the no app

Holmis wrote:I recall a

Gary Roberts wrote:Holmis

Gary/Holmis:Thank you both

robl

Thanks. Rather then sit idle

Comment viewing options

error condition i am not understanding

Forums › Cruncher's Corner

Comment viewing options

Forums › Cruncher's Corner