Immediate timeout? Missing deadline?

Wurgl (speak^Wcrunching for Special: Off-Topic)
Wurgl (speak^Wc...
Joined: 11 Feb 05
Posts: 321
Credit: 140550008
RAC: 0
Topic 198091

Hi!

It seems that I got a WU without a deadline?
http://einsteinathome.org/workunit/218136984

Sent today in the early night and already timed out?

Screenshot of the website, since it will vanish in a month or so ...
http://abload.de/img/nodeadlineq9r0c.png

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7228958230
RAC: 1134786

Immediate timeout? Missing deadline?

Quote:
Sent today in the early night and already timed out?


It says no response, not a deadline miss. It seems likely that the server thinks that some expected handshaking in the work assignment/download process was not detected, so it gave up on sending that one to your host and sent a copy to another host.

I recall composing a post similar to this one an hour or so ago, but don't see it, so am risking a double post.

Wurgl (speak^Wcrunching for Special: Off-Topic)
Wurgl (speak^Wc...
Joined: 11 Feb 05
Posts: 321
Credit: 140550008
RAC: 0

RE: I recall composing a

Quote:
I recall composing a post similar to this one an hour or so ago, but don't see it, so am risking a double post.

I miss the entry in Column 4 showing the deadline. It is empty.

In the boinc manager the same, empty deadline date.
http://abload.de/img/boinco4qy2.png

So I think this is a server-oops.

Compare here some real deadline:
http://einsteinathome.org/workunit/214259915
Same entry "Timed out - no response", but holding a valid deadline date.

floyd
floyd
Joined: 12 Sep 11
Posts: 133
Credit: 186610495
RAC: 0

Today I got one of those too.

Today I got one of those too. The task is still running but assuming the server won't accept the result I think I'll abort it.

http://einsteinathome.org/workunit/220348353

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2143
Credit: 2960589356
RAC: 705865

RE: Today I got one of

Quote:

Today I got one of those too. The task is still running but assuming the server won't accept the result I think I'll abort it.

http://einsteinathome.org/workunit/220348353


The problem task was h1_0378.00_S6GC1__S6BucketFU2UBb_32310395_1

Unfortunately, the host has contacted the server again since then, and picked up another task:

Quote:
2015-06-11 06:56:18.1403 [PID=981] Request: [USER#xxxxx] [HOST#11711999] [IP xxx.xxx.xxx.150] client 7.4.23
2015-06-11 06:56:18.1415 [PID=981 ] [send] effective_ncpus 7 max_jobs_on_host_cpu 999999 max_jobs_on_host 999999
2015-06-11 06:56:18.1415 [PID=981 ] [send] effective_ngpus 1 max_jobs_on_host_gpu 999999
2015-06-11 06:56:18.1415 [PID=981 ] [send] Not using matchmaker scheduling; Not using EDF sim
2015-06-11 06:56:18.1415 [PID=981 ] [send] CPU: req 8723.15 sec, 0.00 instances; est delay 0.00
2015-06-11 06:56:18.1415 [PID=981 ] [send] CUDA: req 0.00 sec, 0.00 instances; est delay 0.00
2015-06-11 06:56:18.1415 [PID=981 ] [send] work_req_seconds: 8723.15 secs
2015-06-11 06:56:18.1415 [PID=981 ] [send] available disk 4.60 GB, work_buf_min 345600
2015-06-11 06:56:18.1415 [PID=981 ] [send] active_frac 0.999977 on_frac 0.639708 DCF 0.678278
2015-06-11 06:56:18.1443 [PID=981 ] [send] [HOST#11711999] not reliable; max_result_day 31
2015-06-11 06:56:18.1444 [PID=981 ] [send] set_trust: random choice for error rate 0.000010: yes
2015-06-11 06:56:18.1444 [PID=981 ] [mixed] sending non-locality work first (0.9847)
2015-06-11 06:56:18.1648 [PID=981 ] [version] Checking plan class 'FGRP4-SSE2'
2015-06-11 06:56:18.1678 [PID=981 ] [version] reading plan classes from file '/BOINC/projects/EinsteinAtHome/plan_class_spec.xml'
2015-06-11 06:56:18.1678 [PID=981 ] [version] plan class ok
2015-06-11 06:56:18.1678 [PID=981 ] [version] Best version of app hsgamma_FGRP4 is 1.06 ID 736 FGRP4-SSE2 (2.65 GFLOPS)
2015-06-11 06:56:18.1678 [PID=981 ] [send] [HOST#11711999] [WU#220632476 LATeah1056E_1136.0_118694_0.0] using delay bound 1209600 (opt: 1209600 pess: 1209600)
2015-06-11 06:56:18.1692 [PID=981 ] [debug] Sorted list of URLs follows [host timezone: UTC+7200]
2015-06-11 06:56:18.1692 [PID=981 ] [debug] zone=+03600 url=http://einstein2.aei.uni-hannover.de
2015-06-11 06:56:18.1692 [PID=981 ] [debug] zone=-18900 url=http://einstein-dl.syr.edu
2015-06-11 06:56:18.1692 [PID=981 ] [debug] zone=-21600 url=http://einstein-dl2.phys.uwm.edu
2015-06-11 06:56:18.1692 [PID=981 ] [debug] zone=-28800 url=http://einstein.ligo.caltech.edu
2015-06-11 06:56:18.1694 [PID=981 ] [send] [HOST#11711999] Sending app_version 736 hsgamma_FGRP4 7 106 FGRP4-SSE2; 2.65 GFLOPS
2015-06-11 06:56:18.1714 [PID=981 ] [send] est. duration for WU 220632476: unscaled 39655.32 scaled 42047.23
2015-06-11 06:56:18.1715 [PID=981 ] [HOST#11711999] Sending [RESULT#504393489 LATeah1056E_1136.0_118694_0.0_2] (est. dur. 42047.23 seconds, delay 1209600, deadline 1435215378)
2015-06-11 06:56:18.1731 [PID=981 ] [send] don't need more work
2015-06-11 06:56:18.1731 [PID=981 ] [mixed] sending locality work second
2015-06-11 06:56:18.1745 [PID=981 ] [send] don't need more work
2015-06-11 06:56:18.1745 [PID=981 ] [send] don't need more work
2015-06-11 06:56:18.1760 [PID=981 ] Sending reply to [HOST#11711999]: 1 results, delay req 60.00
2015-06-11 06:56:18.1770 [PID=981 ] Scheduler ran 0.040 seconds


It would be really interesting to catch and examine a server log for one of these immediate timeouts sometime, and try to work out what's going wrong. But you'd need to be quick about it.

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4312
Credit: 250647167
RAC: 34202

Thanks for reporting this

Thanks for reporting this problem!

So far we have not been aware of it.

Quote:
It would be really interesting to catch and examine a server log for one of these immediate timeouts sometime, and try to work out what's going wrong. But you'd need to be quick about it.

We are looking into it.

Currently we do have >1700 such tasks in the DB (send_time>0 and report_deadline=0), all of which belong to "einstein_S6BucketFU2UB", which makes me think that the reason is in the locality scheduler.

BM

BM

dwcsoftware@gmail.com
dwcsoftware@gma...
Joined: 5 Jul 15
Posts: 1
Credit: 23353128
RAC: 0

I am about to miss my

I am about to miss my deadline on 2 tasks that have been running for 10 and 7 hrs. They have 2 and 2.75 hrs left. Due in 20min. What happens now? Do I miss the credits and end up wasting 18+hrs of CPU time?

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7228958230
RAC: 1134786

RE: I am about to miss my

Quote:
I am about to miss my deadline on 2 tasks that have been running for 10 and 7 hrs. They have 2 and 2.75 hrs left. Due in 20min. What happens now? Do I miss the credits and end up wasting 18+hrs of CPU time?


It depends.

The system will arm itself to send out another copy to someone else, as you failed to respond in the allocated time. But it may not do so immediately, the recipient likely will not start working on it immediately, won't finish immediately, and may not report it immediately. If you both finish and report before they do--and your result has enough integrity and similarity to your first quorum partner to validate, you'll get credit. If the third quorum partner reports within their (later) deadline, in this case they also get credit.

But there still is a loss. The project wasted the effort of that third partner (the credit is just symbolic), and that third partner actually wasted their effort also so far as useful science is concerned. The "consolation prize" of getting credit in this case notwithstanding.

So it is a good idea so to manage your queue and your participation to avoid missing deadlines, not just losing credit.

Perhaps people reading this with better knowledge of the system for cancelling already distributed work will comment on what circumstances (if any) there are in which the third party won't wasted time as the software will tell it not to run the already distributed work before that system started on it. But in at least a fraction of real-world cases that can't possibly happen in time to avoid all wasted.

Grubix
Grubix
Joined: 1 Jul 08
Posts: 19
Credit: 159690452
RAC: 0

Hello. In the last few

Hello.

In the last few days I got some tasks without a deadline.

510204919
512267367
511293311
511314272
511314214

Some were calculated, others were immediately and automatically canceled by the client. It looks like the problem of the first post.

Bye, Grubix.

Grubix
Grubix
Joined: 1 Jul 08
Posts: 19
Credit: 159690452
RAC: 0

Next WU without a deadline:

Next WU without a deadline: 511817146

Bye, Grubix.

Ray Stone
Ray Stone
Joined: 6 Feb 13
Posts: 5
Credit: 185128285
RAC: 0

RE: I am about to miss my

Quote:
I am about to miss my deadline on 2 tasks that have been running for 10 and 7 hrs. They have 2 and 2.75 hrs left. Due in 20min. What happens now? Do I miss the credits and end up wasting 18+hrs of CPU time?

My question is "what is good form in this case?"

If I know I'll miss a deadline by a couple of hours, should I abort the task and take the wasted compute hit or should I just let it complete/report and "waste" someone else's cpu/gpu cycles. [I've been doing the former for any task scheduled to complete in > 1 hour]

Is there some kind of grace period before the ending of which this wu will not be sent to another computer?

Also, I've noticed that sometimes tasks are marked as "missed deadline" but not always. Is the presence of this message some indication that the WU has been sent to a 3rd computer? Is there any such indication for these situations?

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.