Error message is new to me

Odysseus
Odysseus
Joined: 17 Dec 05
Posts: 372
Credit: 16,762,743
RAC: 6,675

I just got a couple of those,

I just got a couple of those, too, on my Mac G4/733: I’d never seen that message before.

[pre]Thu Jan 11 16:35:31 2007|Einstein@Home|Sending scheduler request to http://einstein.phys.uwm.edu/EinsteinAtHome_cgi/cgi
Thu Jan 11 16:35:31 2007|Einstein@Home|Reason: To fetch work
Thu Jan 11 16:35:31 2007|Einstein@Home|Requesting 53235 seconds of new work, and reporting 2 completed tasks
[…]
Thu Jan 11 16:35:46 2007|Einstein@Home|Scheduler request succeeded
Thu Jan 11 16:35:46 2007|Einstein@Home|Message from server: Completed result h1_0344.5_S5R1__8196_S5R1a_0 refused: successful result ALREADY reported for this work
Thu Jan 11 16:35:47 2007|Einstein@Home|Message from server: Completed result h1_0344.5_S5R1__8195_S5R1a_0 refused: successful result ALREADY reported for this work
[/pre]

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,625
Credit: 89,661,115,084
RAC: 59,553,480

Lots of people are seeing

Lots of people are seeing these "result ALREADY reported" messages. I've probably seen this more than 30-40 times and I haven't really been looking for it :). My guess is that due to the ongoing server flakiness, there are many instances of a server disruption just after the server has received and tried to acknowledge the receipt of results being reported. The server has the report and thinks it has issued the acknowledgement but the client has not actually seen it. Thus the client will not delete the result and will try again later, at which point the server will issue the rather peeved response with the undue emphasis on "ALREADY" just to show how peeved it is that the poor client is not keeping up with the game :).

So my guess is that this is all quite harmless client/server banter that you normally wouldn't see, but for the server flakiness.

There is another message which occurs much less frequently which initially did cause me some concern. I've seen it about 3-4 times now. Here is the latest example which occurred just a couple of hours ago that I've just seen by chance.

Quote:

2007-01-12 10:26:00 [Einstein@Home] Sending scheduler request: Requested by user
2007-01-12 10:26:00 [Einstein@Home] Requesting 14635 seconds of new work, and reporting 6 completed tasks
2007-01-12 10:26:05 [Einstein@Home] Scheduler RPC succeeded [server version 505]
2007-01-12 10:26:05 [Einstein@Home] Message from server: Completed result h1_0391.0_S5R1__19212_S5R1a_1 refused: successful result ALREADY reported for this work
2007-01-12 10:26:05 [Einstein@Home] Message from server: Resent lost result h1_0344.5_S5R1__4604_S5R1a_1

2007-01-12 10:26:05 [Einstein@Home] Deferring communication 1 minutes and 0 seconds, because requested by project
2007-01-12 10:26:07 [Einstein@Home] [file_xfer] Started download of file h1_0344.5_S5R1
2007-01-12 10:26:07 [Einstein@Home] [file_xfer] Started download of file grid_0350_h_T03_S5R1.dat

2007-01-12 10:26:13 [Einstein@Home] [file_xfer] Finished download of file grid_0350_h_T03_S5R1.dat
2007-01-12 10:26:13 [Einstein@Home] [file_xfer] Throughput 31622 bytes/sec
2007-01-12 10:34:24 [Einstein@Home] [file_xfer] Finished download of file h1_0344.5_S5R1
2007-01-12 10:34:24 [Einstein@Home] [file_xfer] Throughput 32700 bytes/sec

Notice that the first red line is an example of the "ALREADY" message. The second red line is telling the client that a "lost result" is being resent. This in itself is not that unusual since the client/server protocol (on EAH anyway) does allow for any discrepancies between what the server considers it has sent to the client and what the client can actually see that it has received to be reconciled and resynchronised as necessary.

The thing that initially disturbed me was that the server didn't send just the missing result but rather it sent the whole 15.7mb large data file and the associated grid file (both highlighted in brown). My immediate thinking was that if just one result was missing why didn't I already have the large data file on my system? Why is the server wasting bandwidth by sending it all again? Then the penny dropped. On any request for work, the server could decide that it was time for a new large data file. Whilst this hopefully occurs relatively infrequently, there is still a chance that a flaky server, answering a request for new work, thinks it has sent you a new large data file although your client has actually not received it. So a subsequent request for work will cause the server to notice the "out-of-sync" condition and then remedy the situation.

I don't know if these explanations are correct or even approximately correct but I'm reasonably happy to ignore this client/server banter and to look forward to the time when all this current strife is a fast-fading dim memory :).

Cheers,
Gary.

mray
mray
Joined: 23 Dec 05
Posts: 5
Credit: 212,796,731
RAC: 195,926

Access to the Einstein site

Access to the Einstein site has been up and down wildly for me for the last few days (today was better). I'm not sure if it's due to the file server issues or network issues. I haven't seen any reports of additional file server issues, but one minute I can connect, next "server not found", then a few minutes later I can connect again. I've seen people on other boards make the same observations about Einstein. I lost a few posts here because the server "disappeared" between when I started typing and when I hit send. That's why I made my "yo yo" thread.

The same thing may be happening to you people getting those duplicate result errors, the connection is there for the upload but goes poof for the acknowledgment. Maybe it was just the server being overloaded because of all the connections after it came back up.


Udo
Udo
Joined: 19 May 05
Posts: 203
Credit: 8,945,570
RAC: 0

RE: Access to the Einstein

Message 58238 in response to message 58237

Quote:

Access to the Einstein site has been up and down wildly for me for the last few days (today was better). I'm not sure if it's due to the file server issues or network issues. I haven't seen any reports of additional file server issues, but one minute I can connect, next "server not found", then a few minutes later I can connect again. I've seen people on other boards make the same observations about Einstein. I lost a few posts here because the server "disappeared" between when I started typing and when I hit send. That's why I made my "yo yo" thread.

The same thing may be happening to you people getting those duplicate result errors, the connection is there for the upload but goes poof for the acknowledgment. Maybe it was just the server being overloaded because of all the connections after it came back up.

currently it seems that the Einstein 'BOINC scheduler' can't create WUs fast enough...
see 'Oldest Unsent Result 0 d 0 h 1 m' on the server status page.
As there are nearly only short WUs available, the Einstein servers are stressed much more...

Udo

gomeyer
gomeyer
Joined: 5 Jan 07
Posts: 3
Credit: 1,901,516
RAC: 0

RE: Lots of people are

Message 58239 in response to message 58236

Quote:
Lots of people are seeing these "result ALREADY reported" messages.
-snip-
So my guess is that this is all quite harmless client/server banter that you normally wouldn't see, but for the server flakiness.


Gary-
I hope you're right, but per the following it appears that these represent lost work. Note 8 results reported and 8 results refused.

1/15/2007 11:55:28 AM|Einstein@Home|Requesting 84259 seconds of new work, and reporting 8 completed tasks
1/15/2007 11:55:33 AM|Einstein@Home|Scheduler request succeeded
1/15/2007 11:55:33 AM|Einstein@Home|Message from server: Completed result h1_0394.5_S5R1__15295_S5R1a_1 refused: successful result ALREADY reported for this work
1/15/2007 11:55:33 AM|Einstein@Home|Message from server: Completed result h1_0394.5_S5R1__15224_S5R1a_1 refused: successful result ALREADY reported for this work
1/15/2007 11:55:33 AM|Einstein@Home|Message from server: Completed result h1_0394.5_S5R1__15163_S5R1a_1 refused: successful result ALREADY reported for this work
1/15/2007 11:55:33 AM|Einstein@Home|Message from server: Completed result h1_0394.5_S5R1__15020_S5R1a_0 refused: successful result ALREADY reported for this work
1/15/2007 11:55:33 AM|Einstein@Home|Message from server: Completed result h1_0394.5_S5R1__14858_S5R1a_1 refused: successful result ALREADY reported for this work
1/15/2007 11:55:33 AM|Einstein@Home|Message from server: Completed result h1_0394.5_S5R1__14857_S5R1a_1 refused: successful result ALREADY reported for this work
1/15/2007 11:55:33 AM|Einstein@Home|Message from server: Completed result h1_0394.5_S5R1__14419_S5R1a_0 refused: successful result ALREADY reported for this work
1/15/2007 11:55:33 AM|Einstein@Home|Message from server: Completed result h1_0394.5_S5R1__14418_S5R1a_0 refused: successful result ALREADY reported for this work

Very frustrating for a newbie exile from SAH due to massive lost work on their site. - Sigh -

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,625
Credit: 89,661,115,084
RAC: 59,553,480

RE: Gary- I hope you're

Message 58240 in response to message 58239

Quote:

Gary-
I hope you're right, but per the following it appears that these represent lost work. Note 8 results reported and 8 results refused.
.....

If you look back through the records of your client/server comms, you should find a previous (but apparently not fully completed) exchange for each of the results now being refused. I believe that the server had already accepted the report on this previous exchange. That's how it looks for the sample of my own that I have checked back on.

Quote:

Very frustrating for a newbie exile from SAH due to massive lost work on their site. - Sigh -

Yeah, very frustrating indeed for everybody. I'm sure some more drastic action would have been taken if there was significant data loss going on.

Anyway, welcome to EAH and hopefully things will soon be fixed on the original server and the current temporary replacement can quickly be retired.

Cheers,
Gary.

gomeyer
gomeyer
Joined: 5 Jan 07
Posts: 3
Credit: 1,901,516
RAC: 0

RE: RE: Gary- I hope

Message 58241 in response to message 58240

Quote:
Quote:

Gary-
I hope you're right, but per the following it appears that these represent lost work. Note 8 results reported and 8 results refused.
.....

If you look back through the records of your client/server comms, you should find a previous (but apparently not fully completed) exchange for each of the results now being refused. I believe that the server had already accepted the report on this previous exchange. That's how it looks for the sample of my own that I have checked back on.


Gary - You are correct sir! These and the other “Already Reported� errors I’ve received had indeed been successfully reported on the previous server contact even tho' that contact had reported an error. These were then reported again thus the errors.

Thanks much, my OCD and I will sleep just a little better tonight.

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2,112
Credit: 1,768,587,236
RAC: 3,421,642

Another new one to

Another new one to me:

16/01/2007 10:03:06|einstein@home|Scheduler request to http://einstein.phys.uwm.edu/EinsteinAtHome_cgi/cgi succeeded
16/01/2007 10:03:06|einstein@home|Message from server: Server can't open database
16/01/2007 10:03:06|einstein@home|Project is down

For a change, a BOINC message which makes perfect sense! But unfortunately, it doesn't hint at a solution.

Has anyone else felt the cycle of - can't access website; website slowly comes up; website runs quick and clean for a few minutes; website access is slow again; can't access website - with a timescale roughly that of a computer continually rebooting itself?

gomeyer
gomeyer
Joined: 5 Jan 07
Posts: 3
Credit: 1,901,516
RAC: 0

A new wrinkle . .

A new wrinkle . . .

1/16/2007 5:28:04 AM|Einstein@Home|Requesting 18262 seconds of new work, and reporting 8 completed tasks
1/16/2007 5:28:26 AM||Project communication failed: attempting access to reference site
1/16/2007 5:28:28 AM||Access to reference site succeeded - project servers may be temporarily down.
1/16/2007 5:28:30 AM|Einstein@Home|Scheduler request failed: couldn't connect to server
1/16/2007 5:28:30 AM|Einstein@Home|Deferring scheduler requests for 1 minutes and 0 seconds
1/16/2007 5:29:30 AM|Einstein@Home|Fetching scheduler list
1/16/2007 5:29:52 AM||Project communication failed: attempting access to reference site
1/16/2007 5:29:53 AM||Access to reference site succeeded - project servers may be temporarily down.
1/16/2007 5:29:55 AM|Einstein@Home|Scheduler list fetch failed: http error
1/16/2007 5:29:55 AM|Einstein@Home|4 consecutive failures fetching scheduler list - deferring 604800 seconds
1/16/2007 5:29:55 AM|Einstein@Home|Deferring scheduler requests for 1 weeks, 0 days, 0 hours, 0 minutes and 0 seconds

1 week?? Ya gotta laugh :o)

paul milton
paul milton
Joined: 16 Sep 05
Posts: 329
Credit: 35,825,044
RAC: 0

RE: (..snip..) 16/01/2007

Message 58244 in response to message 58242

Quote:


(..snip..)
16/01/2007 10:03:06|einstein@home|Message from server: Server can't open database
16/01/2007 10:03:06|einstein@home|Project is down

For a change, a BOINC message which makes perfect sense! But onfortunately, it doesn't hint at a solution.
(..snip..)
with a timescale roughly that of a computer continually rebooting itself?

right,, now thats a scary tholt.

as to the boinc message, i doubt theres anything we as users can do it may be that the data base server was off when you tryd to hit it,, i got the same error. needless to say all these short wu's are putting one heck of a strain on the hardware. and those 15 - 16 MB downloads arent helping it. whoever tholt the idea of one big data pack for multiple wu's was a great one, is probably rethinking that right about now lol.

seeing without seeing is something the blind learn to do, and seeing beyond vision can be a gift.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.