I am aware, however, that this doesn't yet address the "resent lost tasks" issue, but I'm afraid I won't get to look into that until Monday.
It would probably help me if someone could make a "sched_reply_einstein.phys.uwm.edu.xml" file available to me from an affected host.
My rough guess is that as the GPU tasks run pretty fast and need a relatively large number of input files, the XML gets too large for some internal buffers and the client is unable to correctly parse it, meaning that it doesn't "get" the tasks sent by the scheduler because it doesn't "understand" it. But I need some clue about how the XML is messed up in order to know what exactly is going wrong and how to fix it.
I am aware, however, that this doesn't yet address the "resent lost tasks" issue, but I'm afraid I won't get to look into that until Monday.
It would probably help me if someone could make a "sched_reply_einstein.phys.uwm.edu.xml" file available to me from an affected host.
My rough guess is that as the GPU tasks run pretty fast and need a relatively large number of input files, the XML gets too large for some internal buffers and the client is unable to correctly parse it, meaning that it doesn't "get" the tasks sent by the scheduler because it doesn't "understand" it. But I need some clue about how the XML is messed up in order to know what exactly is going wrong and how to fix it.
Copied and sent you my "sched_reply_einstein.phys.uwm.edu.xml" via PM. Error still exist. I have 34 work units on the computer but the server thinks I have 241. Going to have to finish out these work units then put NNW so it forces 12 CPU downloads of the GPU and abort them and repeat for all 241 work units that are "lost"
Can scheduler normally force a 'resend' to a host, even if it the host has been set for 'no new tasks'? Or does that setting have nothing to do with resends ?
One host had 1 cpu task + 1 gpu task running and I suspended work fetch. Here's the log then:
27.10.2019 22:15:14 | Einstein@Home | work fetch suspended by user
27.10.2019 22:16:44 | | Resuming GPU computation
27.10.2019 22:31:58 | Einstein@Home | Computation for task h1_0417.90_O2C02Cl1In0__O2MD1G2_G34731_418.10Hz_30_2 finished
27.10.2019 22:31:58 | Einstein@Home | Starting task h1_0418.30_O2C02Cl1In0__O2MD1G2_G34731_418.50Hz_28_2
27.10.2019 22:32:05 | Einstein@Home | Sending scheduler request: To report completed tasks.
27.10.2019 22:32:05 | Einstein@Home | Reporting 1 completed tasks
27.10.2019 22:32:05 | Einstein@Home | Not requesting tasks: "no new tasks" requested via Manager
27.10.2019 22:32:07 | Einstein@Home | Scheduler request completed
27.10.2019 22:32:07 | Einstein@Home | Resent lost task h1_0420.30_O2C02Cl1In0__O2MD1G2_G34731_420.40Hz_4_1
27.10.2019 22:32:10 | Einstein@Home | Starting task h1_0420.30_O2C02Cl1In0__O2MD1G2_G34731_420.40Hz_4_1
So another cpu tasks started and there's 2 cpu tasks + 1 gpu tasks running now.
the host has been set for 'no new tasks'? Or does that setting have nothing to do with resends
This has come up before, and I believe the answer is that in BOINC terminology the type of resend you are describing is not a "new task" and thus not blocked by the NNT setting.
the host has been set for 'no new tasks'? Or does that setting have nothing to do with resends
This has come up before, and I believe the answer is that in BOINC terminology the type of resend you are describing is not a "new task" and thus not blocked by the NNT setting.
That is exactly how lost tasks are treated. They are not new tasks but tasks that have already been allocated to you and have become 'lost' (by whatever mechanism).
I actually use this 'feature' to my (and the project's) advantage. Because the performance of all my hosts is closely monitored automatically, I get alerted very quickly when something catastrophic happens. A recent example for one host was that just as one task was finishing and a new one was about to start, BOINC suddenly decided that the file JPLEPH.405 was corrupt (MD5 check failed). Of course, the whole work cache of maybe 200 tasks immediately were declared as computation errors since all tasks rely on that file.
The nice thing is that for major problems like this, the tasks sit there and don't get reported until the balance of (at max) a 24hr backoff counts down. Once alerted that the machine has stopped crunching, it's about a 5 min job to shut BOINC down and cause all the failed tasks sitting in the state file to become 'lost'. This is achieved by editing the state file so you really need to understand what you are doing.
Whenever I investigate failures like this, the data file that has failed the MD5 check seems OK, but as a precaution (in case a disk sector belonging to the file gave a transient read error) I rename the file to <filename>.BAD and then install a fresh copy of the file to a new spot on the disk.
Then I restart BOINC, which shows no compute errors any more but may show tasks completed prior to the error. The restart automatically clears the backoff and BOINC contacts the server. Immediately, 12 'lost tasks' are resent and crunching resumes. Periodically, I will 'update' the project to get further batches of 12 until all the lost tasks have been fully recovered. Everything is back to normal. It's as if the problem never occurred in the first place.
The machine that I'm referring to in the above description was crunching FGRPB1G and not O2MD1 tasks. Since the O2MD1 GPU tasks are test tasks where only 1 test task per quorum is 'allowed', the above procedure doesn't work. You would think it should work if all the 'lost' tasks were GPU tasks. Each quorum would no longer have its one GPU task so it should be OK for the scheduler to resend the task as a GPU variant. I have specifically tested this, and others have reported as well, so I know that such a lost task gets replaced by a CPU version and each affected quorum will no longer contain its allowance of 1 test task.
Hopefully, when Bernd gets to look at all this, something might be done about that as well. In other words, a 'lost' GPU task should be able to be resent as a GPU task and not a CPU task, particularly if a volunteer has deliberately not allowed CPU tasks in the preferences. I know from actual testing that NOT allowing CPU tasks does NOT stop the scheduler from resending a lost GPU task as a CPU task.
Thanks for looking into it.
)
Thanks for looking into it. Yes, we need a clue.
I configured some additional
)
I configured some additional debugging to analyze this problem on the server side.
BM
Basically the O2MD1 workunit
)
Basically the O2MD1 workunit generator got stuck, and now ran out of work. Currently only "resends" are sent.
Fixing that will likely not happen this week anymore.
BM
Update: seems it can be done
)
Update: seems it can be done tomorrow. I'll try.
BM
I am aware, however, that
)
I am aware, however, that this doesn't yet address the "resent lost tasks" issue, but I'm afraid I won't get to look into that until Monday.
It would probably help me if someone could make a "sched_reply_einstein.phys.uwm.edu.xml" file available to me from an affected host.
My rough guess is that as the GPU tasks run pretty fast and need a relatively large number of input files, the XML gets too large for some internal buffers and the client is unable to correctly parse it, meaning that it doesn't "get" the tasks sent by the scheduler because it doesn't "understand" it. But I need some clue about how the XML is messed up in order to know what exactly is going wrong and how to fix it.
BM
Bernd Machenschalk wrote:I am
)
Copied and sent you my "sched_reply_einstein.phys.uwm.edu.xml" via PM. Error still exist. I have 34 work units on the computer but the server thinks I have 241. Going to have to finish out these work units then put NNW so it forces 12 CPU downloads of the GPU and abort them and repeat for all 241 work units that are "lost"
Can scheduler normally force
)
Can scheduler normally force a 'resend' to a host, even if it the host has been set for 'no new tasks'? Or does that setting have nothing to do with resends ?
One host had 1 cpu task + 1 gpu task running and I suspended work fetch. Here's the log then:
27.10.2019 22:15:14 | Einstein@Home | work fetch suspended by user
27.10.2019 22:16:44 | | Resuming GPU computation
27.10.2019 22:31:58 | Einstein@Home | Computation for task h1_0417.90_O2C02Cl1In0__O2MD1G2_G34731_418.10Hz_30_2 finished
27.10.2019 22:31:58 | Einstein@Home | Starting task h1_0418.30_O2C02Cl1In0__O2MD1G2_G34731_418.50Hz_28_2
27.10.2019 22:32:05 | Einstein@Home | Sending scheduler request: To report completed tasks.
27.10.2019 22:32:05 | Einstein@Home | Reporting 1 completed tasks
27.10.2019 22:32:05 | Einstein@Home | Not requesting tasks: "no new tasks" requested via Manager
27.10.2019 22:32:07 | Einstein@Home | Scheduler request completed
27.10.2019 22:32:07 | Einstein@Home | Resent lost task h1_0420.30_O2C02Cl1In0__O2MD1G2_G34731_420.40Hz_4_1
27.10.2019 22:32:10 | Einstein@Home | Starting task h1_0420.30_O2C02Cl1In0__O2MD1G2_G34731_420.40Hz_4_1
So another cpu tasks started and there's 2 cpu tasks + 1 gpu tasks running now.
Richie wrote: the host has
)
This has come up before, and I believe the answer is that in BOINC terminology the type of resend you are describing is not a "new task" and thus not blocked by the NNT setting.
Okay, thanks. That sounds
)
Okay, thanks. That sounds quite logical in a way.
* I got the corresponding server contact log on that event and there was this line:
archae86 wrote:Richie wrote:
)
That is exactly how lost tasks are treated. They are not new tasks but tasks that have already been allocated to you and have become 'lost' (by whatever mechanism).
I actually use this 'feature' to my (and the project's) advantage. Because the performance of all my hosts is closely monitored automatically, I get alerted very quickly when something catastrophic happens. A recent example for one host was that just as one task was finishing and a new one was about to start, BOINC suddenly decided that the file JPLEPH.405 was corrupt (MD5 check failed). Of course, the whole work cache of maybe 200 tasks immediately were declared as computation errors since all tasks rely on that file.
The nice thing is that for major problems like this, the tasks sit there and don't get reported until the balance of (at max) a 24hr backoff counts down. Once alerted that the machine has stopped crunching, it's about a 5 min job to shut BOINC down and cause all the failed tasks sitting in the state file to become 'lost'. This is achieved by editing the state file so you really need to understand what you are doing.
Whenever I investigate failures like this, the data file that has failed the MD5 check seems OK, but as a precaution (in case a disk sector belonging to the file gave a transient read error) I rename the file to <filename>.BAD and then install a fresh copy of the file to a new spot on the disk.
Then I restart BOINC, which shows no compute errors any more but may show tasks completed prior to the error. The restart automatically clears the backoff and BOINC contacts the server. Immediately, 12 'lost tasks' are resent and crunching resumes. Periodically, I will 'update' the project to get further batches of 12 until all the lost tasks have been fully recovered. Everything is back to normal. It's as if the problem never occurred in the first place.
The machine that I'm referring to in the above description was crunching FGRPB1G and not O2MD1 tasks. Since the O2MD1 GPU tasks are test tasks where only 1 test task per quorum is 'allowed', the above procedure doesn't work. You would think it should work if all the 'lost' tasks were GPU tasks. Each quorum would no longer have its one GPU task so it should be OK for the scheduler to resend the task as a GPU variant. I have specifically tested this, and others have reported as well, so I know that such a lost task gets replaced by a CPU version and each affected quorum will no longer contain its allowance of 1 test task.
Hopefully, when Bernd gets to look at all this, something might be done about that as well. In other words, a 'lost' GPU task should be able to be resent as a GPU task and not a CPU task, particularly if a volunteer has deliberately not allowed CPU tasks in the preferences. I know from actual testing that NOT allowing CPU tasks does NOT stop the scheduler from resending a lost GPU task as a CPU task.
Cheers,
Gary.