Enhancement on Gary's Report Work Workaround

Alinator
Alinator
Joined: 8 May 05
Posts: 927
Credit: 9352143
RAC: 0
Topic 192462

I have been following the backend problems here and at SAH, along with the difficulties that have come with the official rollout of BOINC 5.8.x. Needless to say the combination of the three has made a pretty good mess of things for a lot people. ;-)

Things do seem to be improving slowly an the backend, as most of my hosts seemed to have worked through the comm problems on their own. However you may have a few hosts which are "stuck" trying to get completed work reported after the file upload has gone through.

In this case, you should use Gary's workaround of splitting the work request from the report in order to get it to go through by setting EAH to "No New Work" and forcing an update manually and/or allow BOINC to continue on its own until the report is made.

This is especially important if you have work which is approaching its deadline.

What I discovered is that most likely the report went through on an earlier attempt, but the failure to get new work to send causes the scheduler to erroneously reject the whole request rather than just report back "No Work from Project', leaving the host thinking that nothing succeeded. I'm speculating this is because the the processes which actually generate new work to be sent are timing out rather than throwing a specific error or other informational message. This leaves the scheduler with no other choice but to reject the whole request, since it's as much in the dark about what happened as the host is.

This was evidenced by the 2 hosts I had "stuck". A couple of hours ago I used Gary's procedure to get the report to go through since they were getting close to deadline, and they were immediately refused as having been already reported. If you look at these 2 hosts:

734693
228950

Note the last contact time, which was when I forced the update and then take a look at the date shown for the report (which was about two days ago). Also note that they validated immediately as well. This would seem to indicate that a large percentage of the pendings people are seeing currently are due to hosts stuck in this state, when in fact they could be cleared since the report was most likely already made.

One issue is without doing this I seriously doubt my hosts would have managed to get the report through on their own, since the cache of EAH was empty and every new scheduler request was for new work as well as an attempt to get rid of the hung report.

The problem is since many users don't pay any attention to BOINC at all, let alone read the forums or attempt troubleshooting on their own I'm assuming many of these result are going to deadline expire and get reissued needlessly.

The only thing I can think of the project could do to avoid this would be to deliberately shut down new work generation (including reissues). Essentially it would be an enforced NNW for everyone. This should allow the Scheduler to let the stuck reports go through and clear, since it would know ahead of time there is no work to send and not timeout like it currently does.

Yes, this would probably result in a lot howling over on NC, but at this point it may be wiser to cut losses, both in terms of wasted computational time as well potential extra database clutter. Since the bad PR is already a fact (based on some of posts now appearing), I'm not seeing whole lot of downside to this at the moment. Then start bringing things back up in a more controlled fashion for troubleshooting purposes. It's got to be tough figuring out what's going on when thousands of hosts are knocking on the door with requests which have virtually no chance of succeeding. ;-)

Alinator