More problems with deferring communication...

zooey_glass

Joined: 23 Feb 05

Posts: 4

Credit: 71327

RAC: 0

28 Feb 2005 0:21:26 UTC

Topic 188109

(moderation:

)

During the previous 2 days, I ran 3 test jobs on my HPC cluster which failed and, I think, wasted 30 hours of processing. I don't really understand what went wrong.

I submitted the jobs via the scheduler, gave them a wall-time limit of 11 hours (10 hours, according to benchmarking and previous tests, are enough), and I continually monitored the progress. To ensure that only 1 work unit get processed and that the results get uploaded immediately, I used the following switches: -exit_when_idle and -return_results_immediately. I hope that was correct.

The compute nodes sit behind an http proxy. After the cpu benchmarks completed, the project files downloaded successfully via the http proxy and processing immidiately began. I checked my account just to see if the compute nodes registered, and the did. Everything then went smoothly until the very end.

Approximately 10 hours later processing completed and I am absolutely certain that I saw the message "work unit uploaded" or something like that. However, I checked my statistics and it states that the work unit is still classified as "in progress." I don't understand this. I ran the command-line manually again, and this time, it keeps telling me that there may be a proxy issue and it is continuing to defer communication. If it downloaded the work unit, why can't it upload the work unit?

Does anyone have any ideas about this problem or further suggestions for running the command-line client across an HPC linux cluster?

Walt Gribben

Joined: 20 Feb 05

Posts: 219

Credit: 1645393

RAC: 0

> Its a two step process -

28 Feb 2005 16:36:31 UTC

Message 6054 in response to message 6053

(moderation:

)

> Its a two step process - first the results get uploaded, then the completion
> status gets sent to the scheduler. Did you check the log to see it it
> actually did contact the scheduler? Until that happens, the WU status is
> "Ready to report".
>
> Theres a very good possibility that -exit_when_idle stopped BOINC before it
> got a chance to report on the WU.
>
> Run BOINC again with -update_prefs http://einstein.phys.uwm.edu/
>
> It should report on the completed workunits.
>

I tried your suggestion and still no luck. Here is the output:

2005-02-28 08:35:16 [---] Starting BOINC client version 4.19 for i686-pc-linux-gnu
2005-02-28 08:35:16 [Einstein@Home] Project prefs: no separate prefs for home; using your defaults
2005-02-28 08:35:16 [Einstein@Home] Host ID is 51351
2005-02-28 08:35:16 [---] General prefs: from Einstein@Home (last modified 2005-02-23 13:23:27)
2005-02-28 08:35:16 [---] General prefs: no separate prefs for home; using your defaults
2005-02-28 08:35:16 [Einstein@Home] Deferring communication with project for 20 hours, 53 minutes, and 41 seconds
2005-02-28 08:35:16 [Einstein@Home] Deferring communication with project for 20 hours, 53 minutes, and 41 seconds

Any ideas?

wijata.com

Joined: 11 Feb 05

Posts: 113

Credit: 25495895

RAC: 0

Well, i had the same. I did

28 Feb 2005 16:47:13 UTC

Message 6055

(moderation:

)

Well, i had the same. I did this way.
Stop boinc, find client_state.xml, find line with min_rpc_time, replace the value with 0.000000, start boinc again.

It works, but I don't know whether it breaks something or not.

Before that You may want go to Your general preferences and set 'Connect to network about every' to something like 1day.

zooey_glass

Joined: 23 Feb 05

Posts: 4

Credit: 71327

RAC: 0

> Well, i had the same. I did

28 Feb 2005 17:47:10 UTC

Message 6056 in response to message 6055

(moderation:

)

> Well, i had the same. I did this way.
> Stop boinc, find client_state.xml, find line with min_rpc_time, replace the
> value with 0.000000, start boinc again.
>
> It works, but I don't know whether it breaks something or not.
>
> Before that You may want go to Your general preferences and set 'Connect to
> network about every' to something like 1day.
>
>

Worked like a charm. Thanks very much.

I don't understand why such a large integer value was set there in the first place.

zooey_glass

Joined: 23 Feb 05

Posts: 4

Credit: 71327

RAC: 0

> > Well, i had the same. I

28 Feb 2005 17:50:00 UTC

Message 6057 in response to message 6056

(moderation:

)

> > Well, i had the same. I did this way.
> > Stop boinc, find client_state.xml, find line with min_rpc_time, replace
> the
> > value with 0.000000, start boinc again.
> >
> > It works, but I don't know whether it breaks something or not.
> >
> > Before that You may want go to Your general preferences and set 'Connect
> to
> > network about every' to something like 1day.
> >
> >
>
> Worked like a charm. Thanks very much.
>
> I don't understand why such a large integer value was set there in the first
> place.
>

I just set the value for "Connect to network every..." to 0 days. What exactly will this do? Should I have set it to every 0.000001 days or something?

wijata.com

Joined: 11 Feb 05

Posts: 113

Credit: 25495895

RAC: 0

Ad "Connect to network

28 Feb 2005 18:04:42 UTC

Message 6058

(moderation:

)

Ad "Connect to network every..." : I don't know what means 0 days, but 0.5 days means, that boinc downloads work for at least 0.5 days. If it sees, that work may be finished earlier than 0.5 days from now, downloads more.

Ad "large integer". The large value 11... look to me like unixtime when the next contact may be done (second since 1970 or so).

Bruce Allen

Moderator

Joined: 15 Oct 04

Posts: 1125

Credit: 172127663

RAC: 0

> Ad "large integer". The

1 Mar 2005 18:20:08 UTC

Message 6059 in response to message 6058

(moderation:

)

> Ad "large integer". The large value 11... look to me like unixtime when the
> next contact may be done (second since 1970 or so).

Correct! (man 2 time)

Bruce

Director, Einstein@Home

More problems with deferring communication...

Forums › Problems and Bug Reports