new units not downloading

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117773855325
RAC: 34789608

Ahhh OK... One one occasion,

Ahhh OK... One one occasion, the re-download got to about 50K and then stalled. Maybe I was snagging the file just as the server was deleting it :). When I got the full download (about 8 megs) I'd just stop and delete again and that seemed to cure it. Bit of an eerie feeling when it's telling you it is getting a file that's not supposed to be there. Hopefully all servers are synced up now.

The interesting question is what is going to be the reaction of the silent majority out there who don't regularly follow the lists and are going to be mightily confused by these strange happenings. Is it possible to send a small email to all registered users to warn them to check if they have h1_nnnn style data files and if so check the web page for details? I can just imagine the complaints if someone has a couple of days of h1 work and they don't immediately notice that there is no credit. I watched one of mine do that and that spurred me into action :).

Cheers,
Gary.

Divide Overflow
Divide Overflow
Joined: 9 Feb 05
Posts: 91
Credit: 183220
RAC: 0

What about the "l1_xxx" WU's?

What about the "l1_xxx" WU's? I understand the point that lowercase h WU's are troublesome right now and should be aborted. What about lowercase l WU's?

Bruce Allen
Bruce Allen
Moderator
Joined: 15 Oct 04
Posts: 1119
Credit: 172127663
RAC: 0

RE: What about the "l1_xxx"

Message 13538 in response to message 13537

Quote:
What about the "l1_xxx" WU's? I understand the point that lowercase h WU's are troublesome right now and should be aborted. What about lowercase l WU's?

Lowercase l workunits l1_XXXX.X__... are FINE! This is because we don't have any data sets labeled 'L1_XXXX.[05] for them to get confused with.

Bruce

Director, Einstein@Home

Bruce Allen
Bruce Allen
Moderator
Joined: 15 Oct 04
Posts: 1119
Credit: 172127663
RAC: 0

RE: The interesting

Message 13539 in response to message 13536

Quote:
The interesting question is what is going to be the reaction of the silent majority out there who don't regularly follow the lists and are going to be mightily confused by these strange happenings. Is it possible to send a small email to all registered users to warn them to check if they have h1_nnnn style data files and if so check the web page for details? I can just imagine the complaints if someone has a couple of days of h1 work and they don't immediately notice that there is no credit. I watched one of mine do that and that spurred me into action :).

I have thought about doing this. There are about 6000 host machines that got these workunits, and about 5000 users. But it would take me some hours to cobble together and test scripts for mailing the users, and I would rather spend the time making sure (testing!) the new w1_XXXX workunits to make sure they are OK.

[Edit added 30 min later]
I found a script that I have used before, which I can use to grant credit to users/hosts/teams for workunits which I have cancelled. I am going to use this to grant credit to people who have had the misfortune of getting and doing work then having it cancelled.

Bruce

Director, Einstein@Home

gravywavy
gravywavy
Joined: 22 Jan 05
Posts: 392
Credit: 68962
RAC: 0

RE: I found a script that

Message 13540 in response to message 13539

Quote:


I found a script that I have used before, which I can use to grant credit to users/hosts/teams for workunits which I have cancelled. I am going to use this to grant credit to people who have had the misfortune of getting and doing work then having it cancelled.

that is a nice touch, Bruce.

Fortunately it does not affect me, but I'm pleased to see the swift way the problem has been dealt with.

~~gravywavy

Bruce Allen
Bruce Allen
Moderator
Joined: 15 Oct 04
Posts: 1119
Credit: 172127663
RAC: 0

RE: RE: I found a script

Message 13541 in response to message 13540

Quote:
Quote:


I found a script that I have used before, which I can use to grant credit to users/hosts/teams for workunits which I have cancelled. I am going to use this to grant credit to people who have had the misfortune of getting and doing work then having it cancelled.

that is a nice touch, Bruce.

Fortunately it does not affect me, but I'm pleased to see the swift way the problem has been dealt with.

Thank you very much.

Real science is VERY error prone. In fact one of the distinguishing characteristics of real research is that (especially the first and second time) one gets it wrong more often than not. The only saving grace in all of this is that with other scientists you get 99.9% forgiveness for being brutally honest about what happened and why. That's the one thing that I can promise Einstein@Home participants that they will get 100% of the time.

Director, Einstein@Home

CJOrtega
CJOrtega
Joined: 19 Feb 05
Posts: 39
Credit: 1742781
RAC: 0

RE: RE: Yes! I have

Message 13542 in response to message 13534

Quote:
Quote:

Yes!

I have CANCELLED all h1_ workunits. That means that any CPU time spent on them is entirely wasted. No credits, no glory, no good.

Shoot those workunits before they tire out your CPUs.

In anticipation of that answer I've just finished deleting h1_nnnn work on about a dozen boxes that I can actually get physical access to. Bit of a struggle for V4.19 as it doesn't have the nice abort button that the later CCs have. Here's basically what I had to do.

1. Stop BOINC
3. Delete the large h1_nnnn file in the the einstein subdir of the projects dir
4. Restart BOINC. It would complain about missing files and would try to reget them.
5. The current WU would error out and the reget would mostly fail but occasionally it seemed to succeed.
6. Stop BOINC and repeat the procedure. The next h1_nnnn would then seem to error out.
7. I think on all second passes, BOINC would then get an l1_nnnn data file and I knew I was winning.
8. I'd throw in the odd "update" which occasionally seemed to help. I also had to stop and start BOINC to get processing started.

The interesting thing was that on at least three occasions BOINC claimed to be able to reget at least part of the hi_nnnn large file. I thought they were all supposedly deleted? Maybe BOINC was kidding itself :).

########

I took the easy way out. :-)

Stopped Boinc/service, waited 1/2 mni., started Boinc/service.

Then did an update of the Einstein project via BoincView.
6/28/2005 11:16:55 AM||Starting BOINC client version 4.45 for windows_intelx86
6/28/2005 11:16:55 AM||Executing as a daemon
6/28/2005 11:16:55 AM||Data directory: C:\Program Files\BOINC
6/28/2005 11:16:55 AM|climateprediction.net|Computer ID: 105470; location: home; project prefs: home
6/28/2005 11:16:55 AM|Einstein@Home|Computer ID: 21342; location: home; project prefs: default
6/28/2005 11:16:55 AM|SETI@home|Computer ID: 56801; location: home; project prefs: home
6/28/2005 11:16:55 AM||General prefs: from Einstein@Home (last modified 2005-05-19 16:49:31)
6/28/2005 11:16:55 AM||General prefs: using separate prefs for home
6/28/2005 11:16:55 AM||Remote control allowed
6/28/2005 11:16:55 AM|climateprediction.net|Resuming computation for result 3ive_200186079_0 using hadsm3 version 4.12
6/28/2005 11:16:55 AM|climateprediction.net|Resuming computation for result 3vit_100202636_0 using hadsm3 version 4.12
6/28/2005 11:16:55 AM|SETI@home|Deferring computation for result 29ap04aa.22117.2656.709662.124_2
6/28/2005 11:16:55 AM|Einstein@Home|Deferring communication with project for 7 hours, 44 minutes, and 1 seconds
6/28/2005 11:18:59 AM||request_reschedule_cpus: project op
6/28/2005 11:18:59 AM|Einstein@Home|Sending scheduler request to http://einstein.phys.uwm.edu/EinsteinAtHome_cgi/cgi
6/28/2005 11:18:59 AM|Einstein@Home|Requesting 34560 seconds of work, returning 1 results
6/28/2005 11:19:08 AM|Einstein@Home|Scheduler request to http://einstein.phys.uwm.edu/EinsteinAtHome_cgi/cgi succeeded
6/28/2005 11:19:08 AM|Einstein@Home|Got server request to delete file H1_0592.5
6/28/2005 11:19:10 AM|Einstein@Home|Started download of Config_L_S4lA
6/28/2005 11:19:10 AM|Einstein@Home|Started download of l1_0277.5
6/28/2005 11:19:10 AM|Einstein@Home|Temporarily failed download of Config_L_S4lA: 404
6/28/2005 11:19:11 AM|Einstein@Home|Started download of Config_L_S4lA
6/28/2005 11:19:12 AM|Einstein@Home|Finished download of Config_L_S4lA
6/28/2005 11:19:12 AM|Einstein@Home|Throughput 3059 bytes/sec
6/28/2005 11:19:32 AM|Einstein@Home|Finished download of l1_0277.5
6/28/2005 11:19:32 AM|Einstein@Home|Throughput 287190 bytes/sec
6/28/2005 11:19:32 AM||request_reschedule_cpus: files downloaded

So all is well in the world again. :-)

Claude

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117773855325
RAC: 34789608

RE: I took the easy way

Message 13543 in response to message 13542

Quote:


I took the easy way out. :-)

.......

So all is well in the world again. :-)

Claude

That's all right if you have 4.45. As I mentioned I was running 4.19. my notes were for the benefit of those running that version.

Cheers,
Gary.

rbpeake
rbpeake
Joined: 18 Jan 05
Posts: 266
Credit: 1135437797
RAC: 752121

RE: Real science is VERY

Message 13544 in response to message 13541

Quote:
Real science is VERY error prone. In fact one of the distinguishing characteristics of real research is that (especially the first and second time) one gets it wrong more often than not. The only saving grace in all of this is that with other scientists you get 99.9% forgiveness for being brutally honest about what happened and why. That's the one thing that I can promise Einstein@Home participants that they will get 100% of the time.

I certainly appreciate that sentiment, and thank you!

Compared with most distributed computing projects I have participated in over the past number of years, you have gotten it right the first time more so than the majority of them!

From my perspective, it is very nice indeed to be associated with such professionals and with such a professionally run project.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117773855325
RAC: 34789608

RE: [Edit added 30 min

Message 13545 in response to message 13539

Quote:


[Edit added 30 min later]
I found a script that I have used before, which I can use to grant credit to users/hosts/teams for workunits which I have cancelled. I am going to use this to grant credit to people who have had the misfortune of getting and doing work then having it cancelled.

Bruce

I'm very pleased that you have done that and it will be good for the silent majority who probably aren't even aware of the problem yet.

However, it's not my day today :). I took your advice and cancelled running work that was in many cases 80-90% complete!!! And I'm still not mad at you in the slightest :). I'd rather lose the credits than hold up the science by doing work that will only have to be repeated anyway so my cancelling the partly completed work was still the right thing to do.

It must have been one of those nightmare days (and nights) for you :).

Cheers,
Gary.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.