Scheduler Bug with work requests for O2MD1 work - ### Staff Please Read ###

Jim1348

Joined: 19 Jan 06

Posts: 463

Credit: 257957147

RAC: 0

Thanks for looking into it.

24 Oct 2019 0:39:51 UTC

Message 173995 in response to message 173993

(moderation:

)

Thanks for looking into it. Yes, we need a clue.

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4312

Credit: 250502298

RAC: 34476

I configured some additional

24 Oct 2019 14:08:00 UTC

Message 174003

(moderation:

)

I configured some additional debugging to analyze this problem on the server side.

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4312

Credit: 250502298

RAC: 34476

Basically the O2MD1 workunit

24 Oct 2019 14:48:15 UTC

Message 174005

(moderation:

)

Basically the O2MD1 workunit generator got stuck, and now ran out of work. Currently only "resends" are sent.

Fixing that will likely not happen this week anymore.

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4312

Credit: 250502298

RAC: 34476

Update: seems it can be done

24 Oct 2019 16:07:37 UTC

Message 174007

(moderation:

)

Update: seems it can be done tomorrow. I'll try.

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4312

Credit: 250502298

RAC: 34476

I am aware, however, that

25 Oct 2019 8:56:04 UTC

Message 174019

(moderation:

)

I am aware, however, that this doesn't yet address the "resent lost tasks" issue, but I'm afraid I won't get to look into that until Monday.

It would probably help me if someone could make a "sched_reply_einstein.phys.uwm.edu.xml" file available to me from an affected host.

My rough guess is that as the GPU tasks run pretty fast and need a relatively large number of input files, the XML gets too large for some internal buffers and the client is unable to correctly parse it, meaning that it doesn't "get" the tasks sent by the scheduler because it doesn't "understand" it. But I need some clue about how the XML is messed up in order to know what exactly is going wrong and how to fix it.

Zalster

Joined: 26 Nov 13

Posts: 3117

Credit: 4050672230

RAC: 0

Bernd Machenschalk wrote:I am

26 Oct 2019 19:26:10 UTC

Message 174061 in response to message 174019

(moderation:

)

Bernd Machenschalk wrote:

I am aware, however, that this doesn't yet address the "resent lost tasks" issue, but I'm afraid I won't get to look into that until Monday.

It would probably help me if someone could make a "sched_reply_einstein.phys.uwm.edu.xml" file available to me from an affected host.

My rough guess is that as the GPU tasks run pretty fast and need a relatively large number of input files, the XML gets too large for some internal buffers and the client is unable to correctly parse it, meaning that it doesn't "get" the tasks sent by the scheduler because it doesn't "understand" it. But I need some clue about how the XML is messed up in order to know what exactly is going wrong and how to fix it.

Copied and sent you my "sched_reply_einstein.phys.uwm.edu.xml" via PM. Error still exist. I have 34 work units on the computer but the server thinks I have 241. Going to have to finish out these work units then put NNW so it forces 12 CPU downloads of the GPU and abort them and repeat for all 241 work units that are "lost"

Richie

Joined: 7 Mar 14

Posts: 656

Credit: 1702989778

RAC: 0

Can scheduler normally force

27 Oct 2019 21:18:54 UTC

Message 174086

(moderation:

)

Can scheduler normally force a 'resend' to a host, even if it the host has been set for 'no new tasks'? Or does that setting have nothing to do with resends ?
One host had 1 cpu task + 1 gpu task running and I suspended work fetch. Here's the log then:

So another cpu tasks started and there's 2 cpu tasks + 1 gpu tasks running now.

archae86

Joined: 6 Dec 05

Posts: 3157

Credit: 7222674931

RAC: 962604

Richie wrote: the host has

27 Oct 2019 21:29:20 UTC

Message 174088 in response to message 174086

(moderation:

)

Richie wrote:

the host has been set for 'no new tasks'? Or does that setting have nothing to do with resends

This has come up before, and I believe the answer is that in BOINC terminology the type of resend you are describing is not a "new task" and thus not blocked by the NNT setting.

Richie

Joined: 7 Mar 14

Posts: 656

Credit: 1702989778

RAC: 0

Okay, thanks. That sounds

27 Oct 2019 21:51:01 UTC

Message 174090

(moderation:

)

Okay, thanks. That sounds quite logical in a way.

* I got the corresponding server contact log on that event and there was this line:

2019-10-27 20:32:08.6797 [PID=21554]    [resend] [RESULT#892584449] [HOST#12684310] Updated report_deadline (resend lost worksion result per WU (#424059187, re#75)

"worksion" ... ?? lol, what is that

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5872

Credit: 117612713007

RAC: 35238263

archae86 wrote:Richie wrote:

27 Oct 2019 23:05:00 UTC

Message 174094 in response to message 174088

(moderation:

)

archae86 wrote:

Richie wrote:
the host has been set for 'no new tasks'? Or does that setting have nothing to do with resends
This has come up before, and I believe the answer is that in BOINC terminology the type of resend you are describing is not a "new task" and thus not blocked by the NNT setting.

That is exactly how lost tasks are treated. They are not new tasks but tasks that have already been allocated to you and have become 'lost' (by whatever mechanism).

I actually use this 'feature' to my (and the project's) advantage. Because the performance of all my hosts is closely monitored automatically, I get alerted very quickly when something catastrophic happens. A recent example for one host was that just as one task was finishing and a new one was about to start, BOINC suddenly decided that the file JPLEPH.405 was corrupt (MD5 check failed). Of course, the whole work cache of maybe 200 tasks immediately were declared as computation errors since all tasks rely on that file.

The nice thing is that for major problems like this, the tasks sit there and don't get reported until the balance of (at max) a 24hr backoff counts down. Once alerted that the machine has stopped crunching, it's about a 5 min job to shut BOINC down and cause all the failed tasks sitting in the state file to become 'lost'. This is achieved by editing the state file so you really need to understand what you are doing.

Whenever I investigate failures like this, the data file that has failed the MD5 check seems OK, but as a precaution (in case a disk sector belonging to the file gave a transient read error) I rename the file to <filename>.BAD and then install a fresh copy of the file to a new spot on the disk.

Then I restart BOINC, which shows no compute errors any more but may show tasks completed prior to the error. The restart automatically clears the backoff and BOINC contacts the server. Immediately, 12 'lost tasks' are resent and crunching resumes. Periodically, I will 'update' the project to get further batches of 12 until all the lost tasks have been fully recovered. Everything is back to normal. It's as if the problem never occurred in the first place.

The machine that I'm referring to in the above description was crunching FGRPB1G and not O2MD1 tasks. Since the O2MD1 GPU tasks are test tasks where only 1 test task per quorum is 'allowed', the above procedure doesn't work. You would think it should work if all the 'lost' tasks were GPU tasks. Each quorum would no longer have its one GPU task so it should be OK for the scheduler to resend the task as a GPU variant. I have specifically tested this, and others have reported as well, so I know that such a lost task gets replaced by a CPU version and each affected quorum will no longer contain its allowance of 1 test task.

Hopefully, when Bernd gets to look at all this, something might be done about that as well. In other words, a 'lost' GPU task should be able to be resent as a GPU task and not a CPU task, particularly if a volunteer has deliberately not allowed CPU tasks in the preferences. I know from actual testing that NOT allowing CPU tasks does NOT stop the scheduler from resending a lost GPU task as a CPU task.

Cheers,
Gary.

Scheduler Bug with work requests for O2MD1 work - ### Staff Please Read ###

Forums › Problems and Bug Reports

Comment viewing options

Forums › Problems and Bug Reports