For several days now I've had a task that has been stuck in BOINC Manager with a status of "uploading" and is in my E@H results with a status of "In progress". Many other tasks have completed and uploaded in the mean while. I've tried aborting the task, restarting BOINC Manager, rebooting the host, and waiting out the recent server maintenance downtime, all to no avail. My current event log doesn't go back far enough to see when it was logged as completed. Having it hang around doesn't seem to be affecting other work, but it is a slight bit annoying and something of a mystery. Any ideas for how to force an upload it or otherwise clear it out?
Ideas are not fixed, nor should they be; we live in model-dependent reality.
Copyright © 2024 Einstein@Home. All rights reserved.
I occasionally see this and
)
I occasionally see this and it is caused by a communications glitch of some sort during the 'result uploading' stage. I don't know the exact mechanism of how this happens but it's something like the client never receiving the confirmation that the server has the uploaded result so doesn't change the client status from 'uploading' to 'ready to report'. The server on the other hand does have the result and can't do anything whilst it waits for the 'report' indication from the client.
Here's what I do. Stop BOINC completely and open the state file client_state.xml in a plain text editor. Using a unique part of the name of the task, search for the result block in the state file that has that name. Here is an example from one of mine (doctored a bit to be what I think you may find). You see the start of the <result> block, lots of stuff missing, and then the very end up to the closing </result> tag.
My example is "ready to report" and I've annotated it to show you what I believe you may find. You just need to make the annotated changes and then save the file. If you don't find exactly what I'm suggesting, please let me know what you do find and we'll take it from there.
<result> <name>LATeah1061L02_284.0_0_0.0_15815807_1</name> (<-- my task name -- search for yours) <final_cpu_time>179.830000</final_cpu_time> <final_elapsed_time>1235.445683</final_elapsed_time> <exit_status>0</exit_status> <state>5</state> (<-- you will probably see 4 here -- that means 'uploading' -- you need 5) <platform>x86_64-pc-linux-gnu</platform> <version_num>118</version_num> <plan_class>FGRPopencl1K-ati</plan_class> <stderr_out> <![CDATA[ <stderr_txt> 41 df1dot: 2.512676418e-15 f1dot_start: -1e-13 f1dot_band: 1e-13 % Filling array of photon pairs . . . lots of stuff removed % Following up candidate number: 8 % Refining in S % Following-up in P % Following up candidate number: 9 % Refining in S % Following-up in P % Following up candidate number: 10 % Refining in S % Following-up in P % Writing follow-up output file. FPU status flags: PRECISION 19:54:34 (2579): [normal]: done. calling boinc_finish(0). 19:54:34 (2579): called boinc_finish (<-- if you have 10 candidates and see this line, then all you need ) ( to do is add the two marked lines I've pointed at below) </stderr_txt> ]]> </stderr_out> <ready_to_report/> (<-- you wont have this line -- you need to add it <completed_time>1561888482.509861</completed_time> (<-- nor this one -- the time is seconds from <wu_name>LATeah1061L02_284.0_0_0.0_15815807</wu_name> ( the unix epoch. This value is my time now <report_deadline>1562903403.000000</report_deadline> ( you probably need something rather less <received_time>1561693804.112932</received_time> ( make sure it's before your deadline. <file_ref> ( subtract 86400 for every day earlier yours was. <file_name>LATeah1061L02_284.0_0_0.0_15815807_1_0</file_name> </file_ref> <file_ref> <file_name>LATeah1061L02_284.0_0_0.0_15815807_1_1</file_name> </file_ref> </result>
Essentially all you are doing (if the task did finish correctly) is changing the state to 5 and making the client aware that it's ready to report. I try to get the completed time to be just before it was uploaded but I don't think it matters too much. As I'm just guessing what the problem might be, make sure you have a good look for any other things that look strange compared to what I've shown.
I've seen a few tasks like this over the years get credited so it's worth a try. Once you make the edits, save the file and restart BOINC. The task will probably be gone before you get the manager open to have a look :-).
Cheers,
Gary.
Okay, thanks Gary. Yes, my
)
Okay, thanks Gary. Yes, my client_state.xml (in /var/lib/boinc-client/) was as you described. I made the edits, but they don't take. I confirmed the edits after closing the text editor, but then when BOINC starts up, a new client_state.xml is created that reverts to the original text for that task. The edits do initially show up in the updated client_state_prev.xml file. I've tried first renaming the client_state_prev.xml file before restarting BOINC, but my edits still get overwritten, so I don't know from were it's getting the original task information.
Perhaps another complicating factor is that, in my earlier attempts to remove the uploading task from BOINC, I deleted it's associated template_[WU].dat file. I noticed that that .dat file is referenced elsewhere in client_state.xml, so is the absence of that file somehow foiling this fix?
Ideas are not fixed, nor should they be; we live in model-dependent reality.
Okay, got it! I finally
)
Okay, got it! I finally realized that by, "Stop BOINC completely...", you meant stop BOINC completely. My previous lack of result was because I had stopped only the BOINC Manager, thinking that was it. Once things sunk in, I also stopped (interrupted, via kill) the boinc process from the Terminal, then edited the client_state.xml. To restart the boinc client I did a system reboot (not knowing enough about the boinc command line to just restart the process) and was very happy to see that the waylaid task had been sent on it's way. Thanks again for the teaching moment.
Ideas are not fixed, nor should they be; we live in model-dependent reality.
OK, did you take care of
)
OK, did you take care of ownership and permissions of the edited file? I forgot you were using a system where you don't actually own the BOINC stuff :-).
I'm guessing that the edited file was owned by your normal user rather than the special 'boinc' user so when the client re-started it couldn't find a state file that it owned. In those cases, I think it deletes the offending intruder and makes itself a new file by copying the _prev file, so that is probably where it got the original state of affairs from.
If the currently running client still has the original set of tasks with the 'uploading' task that we are trying to fix, you need to repeat the whole procedure but this time change the ownership and permissions of the state file, after editing, to match what they were for the unedited file. The easiest way is to use a terminal session. Are you familiar with that? If not, I could give you instructions.
Assuming you have used one before, all you need to do is make sure the owner and group for the edited file is boinc:boinc and that the permissions (some combination of -rwxrwxrwx) match those that showed for the file before you edit it. You can change directory using 'cd /var/lib/boinc-client' and then use the 'ls -l client_state.xml' (from within /var/lib/boinc-client) to see the proper ownership and permissions. You might need to preface those commands with sudo if you don't have permission to be in that directory to start with.
To change ownership it's just 'sudo chown boinc:boinc client_state.xml'. To change permissions it's just 'sudo chmod nnn client_state.xml' where nnn is constructed from the three sets of rwx using r=4, w=2, x=1. For example rwx=7 rw-=6 r-x=5 r--=4 etc. So if the original file showed -rw-rw-r-- then nnn would be 664 but if it was -rw-rw---- then nnn would be 660. Those three permission groups represent read/write/execute permissions for the owner of the file, the group the owner belongs to, and the rest of the world respectively.
EDIT: Sorry, thrashed out a reply without checking for further info :-). Yes, if you don't stop the client completely, it will overwrite whatever you do from its own 'in memory' copy of the state file. It ignores what's on disk and at regular intervals overwrites that stuff from what it has in memory. Doen't do any harm to itself but has been known to create great consternation for the poor 'victim' :-).
Now all we need to know is if the result was accepted by the server. Do you have a pending or perhaps even a valid outcome?
Cheers,
Gary.
Gary Roberts wrote:Now all we
)
Well, shoot. It's in my Invalid results with a status of "Validate error". It's not recognized as completed. That task was _0; my wingman completed the _1 task early on; the Workunit now lists an _2 task, which has this information:
Will the _2 task be sent back to me, or on to another host? Will that sort itself out, or do I need to do some more editing?
Ideas are not fixed, nor should they be; we live in model-dependent reality.
When things are working well,
)
When things are working well, validate errors are uncommon. I commonly saw none in many months when I ran Nvidia cards here, but notice that I have one showing now (along with 29 "completed, marked as invalid"). They are not the same thing as "completed, marked as invalid". The "Marked" type got past an initial sanity check but did not match a quorum partner well enough for both members of the quorum to be considered Valid. An additional task returned by (typically a third) system matched the other guy better than you, so you are the odd man out in those cases.
The simple "Validate error" sort flunk a basic sanity check which for some reason is only performed when the quorum is fulfilled, but not involving a comparison to the quorum partner. The system from then on ignores the erroneous result, sends out a new task to another system (not yours), and things move on from there.
Maybe your system really did have some fault in processing that work, or maybe something corrupted what eventually got sent back to Einstein. In any case, your work is gone, but things are cleaned up and will move on.
Could it be that in this case
)
Could it be that in this case the task result wasn't actually on the server when, after manual intervention, the task was reported and therefore the outcome was a validate error?
If the result isn't found it can't pass the sanity check.
If this is the case that begs the question of why the result couldn't be uploaded in the first place?
My take is this. You had the
)
My take is this. You had the two lines 'calling boinc_finish()' followed by 'called boinc_finish()' -- note past tense -- so with that 2nd line, you know that the app had successfully finished its work and the client was now in charge of uploading the result. You also know that the client never did receive confirmation that the upload had safely arrived. You know that from <state> still being 4 rather than 5.
There are two points where a problem could have arisen. The first is the upload (which is probably many back and forth messages) which may have been damaged/incomplete/never arrived so the server couldn't send an acknowledgement to the client.
The second point would be during further comms being sent by the server to the host confirming the safe arrival. These confirmation messages may have got lost. Now that you've found the 'validate error', this confirms that the problem was probably damage in the upload stage, not the non-arrival of a confirmation message. A received file size of zero (for example) would certainly be recognised as a validate error, as would a partial upload with an incorrect number of lines/entries etc. These are gross discrepancies that don't require a comparison to detect.
Result uploads just sit on the upload server and are not checked until a quorum has formed. It's only then that the members of the quorum undergo the basic sanity checks, followed by the detailed comparisons if both are 'sane'.
As I mentioned, from my experience, sometimes you win and sometimes you don't. As archae86 mentions, you got rid of an annoying entry at your end and someone else gets the job of cleaning up the mess :-). The system is perfectly happy with this arrangement and there's nothing more for you to be concerned about :-).
Cheers,
Gary.
Thank you all for the
)
Thank you all for the feedback. My results page today shows that the task has been sent out to another host; a Windows machine with a Radeon VII, so it's in good hands now. All's well that ends well (assuming it passes validation).
Ideas are not fixed, nor should they be; we live in model-dependent reality.