More problems with deferring communication...
During the previous 2 days, I ran 3 test jobs on my HPC cluster which failed and, I think, wasted 30 hours of processing. I don't really understand what went wrong.
I submitted the jobs via the scheduler, gave them a wall-time limit of 11 hours (10 hours, according to benchmarking and previous tests, are enough), and I continually monitored the progress. To ensure that only 1 work unit get processed and that the results get uploaded immediately, I used the following switches: -exit_when_idle and -return_results_immediately. I hope that was correct.
The compute nodes sit behind an http proxy. After the cpu benchmarks completed, the project files downloaded successfully via the http proxy and processing immidiately began. I checked my account just to see if the compute nodes registered, and the did. Everything then went smoothly until the very end.
Approximately 10 hours later processing completed and I am absolutely certain that I saw the message "work unit uploaded" or something like that. However, I checked my statistics and it states that the work unit is still classified as "in progress." I don't understand this. I ran the command-line manually again, and this time, it keeps telling me that there may be a proxy issue and it is continuing to defer communication. If it downloaded the work unit, why can't it upload the work unit?
Does anyone have any ideas about this problem or further suggestions for running the command-line client across an HPC linux cluster?
More problems with deferring communication...
)
Its a two step process - first the results get uploaded, then the completion status gets sent to the scheduler. Did you check the log to see it it actually did contact the scheduler? Until that happens, the WU status is "Ready to report".
Theres a very good possibility that -exit_when_idle stopped BOINC before it got a chance to report on the WU.
Run BOINC again with -update_prefs http://einstein.phys.uwm.edu/
It should report on the completed workunits.
> Its a two step process -
)
> Its a two step process - first the results get uploaded, then the completion
> status gets sent to the scheduler. Did you check the log to see it it
> actually did contact the scheduler? Until that happens, the WU status is
> "Ready to report".
>
> Theres a very good possibility that -exit_when_idle stopped BOINC before it
> got a chance to report on the WU.
>
> Run BOINC again with -update_prefs http://einstein.phys.uwm.edu/
>
> It should report on the completed workunits.
>
I tried your suggestion and still no luck. Here is the output:
2005-02-28 08:35:16 [---] Starting BOINC client version 4.19 for i686-pc-linux-gnu
2005-02-28 08:35:16 [Einstein@Home] Project prefs: no separate prefs for home; using your defaults
2005-02-28 08:35:16 [Einstein@Home] Host ID is 51351
2005-02-28 08:35:16 [---] General prefs: from Einstein@Home (last modified 2005-02-23 13:23:27)
2005-02-28 08:35:16 [---] General prefs: no separate prefs for home; using your defaults
2005-02-28 08:35:16 [Einstein@Home] Deferring communication with project for 20 hours, 53 minutes, and 41 seconds
2005-02-28 08:35:16 [Einstein@Home] Deferring communication with project for 20 hours, 53 minutes, and 41 seconds
Any ideas?
Well, i had the same. I did
)
Well, i had the same. I did this way.
Stop boinc, find client_state.xml, find line with min_rpc_time, replace the value with 0.000000, start boinc again.
It works, but I don't know whether it breaks something or not.
Before that You may want go to Your general preferences and set 'Connect to network about every' to something like 1day.
> Well, i had the same. I did
)
> Well, i had the same. I did this way.
> Stop boinc, find client_state.xml, find line with min_rpc_time, replace the
> value with 0.000000, start boinc again.
>
> It works, but I don't know whether it breaks something or not.
>
> Before that You may want go to Your general preferences and set 'Connect to
> network about every' to something like 1day.
>
>
Worked like a charm. Thanks very much.
I don't understand why such a large integer value was set there in the first place.
> > Well, i had the same. I
)
> > Well, i had the same. I did this way.
> > Stop boinc, find client_state.xml, find line with min_rpc_time, replace
> the
> > value with 0.000000, start boinc again.
> >
> > It works, but I don't know whether it breaks something or not.
> >
> > Before that You may want go to Your general preferences and set 'Connect
> to
> > network about every' to something like 1day.
> >
> >
>
> Worked like a charm. Thanks very much.
>
> I don't understand why such a large integer value was set there in the first
> place.
>
I just set the value for "Connect to network every..." to 0 days. What exactly will this do? Should I have set it to every 0.000001 days or something?
Ad "Connect to network
)
Ad "Connect to network every..." : I don't know what means 0 days, but 0.5 days means, that boinc downloads work for at least 0.5 days. If it sees, that work may be finished earlier than 0.5 days from now, downloads more.
Ad "large integer". The large value 11... look to me like unixtime when the next contact may be done (second since 1970 or so).
> Ad "large integer". The
)
> Ad "large integer". The large value 11... look to me like unixtime when the
> next contact may be done (second since 1970 or so).
Correct! (man 2 time)
Bruce
Director, Einstein@Home