Ahhh OK... One one occasion, the re-download got to about 50K and then stalled. Maybe I was snagging the file just as the server was deleting it :). When I got the full download (about 8 megs) I'd just stop and delete again and that seemed to cure it. Bit of an eerie feeling when it's telling you it is getting a file that's not supposed to be there. Hopefully all servers are synced up now.
The interesting question is what is going to be the reaction of the silent majority out there who don't regularly follow the lists and are going to be mightily confused by these strange happenings. Is it possible to send a small email to all registered users to warn them to check if they have h1_nnnn style data files and if so check the web page for details? I can just imagine the complaints if someone has a couple of days of h1 work and they don't immediately notice that there is no credit. I watched one of mine do that and that spurred me into action :).
What about the "l1_xxx" WU's? I understand the point that lowercase h WU's are troublesome right now and should be aborted. What about lowercase l WU's?
What about the "l1_xxx" WU's? I understand the point that lowercase h WU's are troublesome right now and should be aborted. What about lowercase l WU's?
Lowercase l workunits l1_XXXX.X__... are FINE! This is because we don't have any data sets labeled 'L1_XXXX.[05] for them to get confused with.
The interesting question is what is going to be the reaction of the silent majority out there who don't regularly follow the lists and are going to be mightily confused by these strange happenings. Is it possible to send a small email to all registered users to warn them to check if they have h1_nnnn style data files and if so check the web page for details? I can just imagine the complaints if someone has a couple of days of h1 work and they don't immediately notice that there is no credit. I watched one of mine do that and that spurred me into action :).
I have thought about doing this. There are about 6000 host machines that got these workunits, and about 5000 users. But it would take me some hours to cobble together and test scripts for mailing the users, and I would rather spend the time making sure (testing!) the new w1_XXXX workunits to make sure they are OK.
[Edit added 30 min later]
I found a script that I have used before, which I can use to grant credit to users/hosts/teams for workunits which I have cancelled. I am going to use this to grant credit to people who have had the misfortune of getting and doing work then having it cancelled.
I found a script that I have used before, which I can use to grant credit to users/hosts/teams for workunits which I have cancelled. I am going to use this to grant credit to people who have had the misfortune of getting and doing work then having it cancelled.
that is a nice touch, Bruce.
Fortunately it does not affect me, but I'm pleased to see the swift way the problem has been dealt with.
I found a script that I have used before, which I can use to grant credit to users/hosts/teams for workunits which I have cancelled. I am going to use this to grant credit to people who have had the misfortune of getting and doing work then having it cancelled.
that is a nice touch, Bruce.
Fortunately it does not affect me, but I'm pleased to see the swift way the problem has been dealt with.
Thank you very much.
Real science is VERY error prone. In fact one of the distinguishing characteristics of real research is that (especially the first and second time) one gets it wrong more often than not. The only saving grace in all of this is that with other scientists you get 99.9% forgiveness for being brutally honest about what happened and why. That's the one thing that I can promise Einstein@Home participants that they will get 100% of the time.
I have CANCELLED all h1_ workunits. That means that any CPU time spent on them is entirely wasted. No credits, no glory, no good.
Shoot those workunits before they tire out your CPUs.
In anticipation of that answer I've just finished deleting h1_nnnn work on about a dozen boxes that I can actually get physical access to. Bit of a struggle for V4.19 as it doesn't have the nice abort button that the later CCs have. Here's basically what I had to do.
1. Stop BOINC
3. Delete the large h1_nnnn file in the the einstein subdir of the projects dir
4. Restart BOINC. It would complain about missing files and would try to reget them.
5. The current WU would error out and the reget would mostly fail but occasionally it seemed to succeed.
6. Stop BOINC and repeat the procedure. The next h1_nnnn would then seem to error out.
7. I think on all second passes, BOINC would then get an l1_nnnn data file and I knew I was winning.
8. I'd throw in the odd "update" which occasionally seemed to help. I also had to stop and start BOINC to get processing started.
The interesting thing was that on at least three occasions BOINC claimed to be able to reget at least part of the hi_nnnn large file. I thought they were all supposedly deleted? Maybe BOINC was kidding itself :).
########
I took the easy way out. :-)
Stopped Boinc/service, waited 1/2 mni., started Boinc/service.
Then did an update of the Einstein project via BoincView.
6/28/2005 11:16:55 AM||Starting BOINC client version 4.45 for windows_intelx86
6/28/2005 11:16:55 AM||Executing as a daemon
6/28/2005 11:16:55 AM||Data directory: C:\Program Files\BOINC
6/28/2005 11:16:55 AM|climateprediction.net|Computer ID: 105470; location: home; project prefs: home
6/28/2005 11:16:55 AM|Einstein@Home|Computer ID: 21342; location: home; project prefs: default
6/28/2005 11:16:55 AM|SETI@home|Computer ID: 56801; location: home; project prefs: home
6/28/2005 11:16:55 AM||General prefs: from Einstein@Home (last modified 2005-05-19 16:49:31)
6/28/2005 11:16:55 AM||General prefs: using separate prefs for home
6/28/2005 11:16:55 AM||Remote control allowed
6/28/2005 11:16:55 AM|climateprediction.net|Resuming computation for result 3ive_200186079_0 using hadsm3 version 4.12
6/28/2005 11:16:55 AM|climateprediction.net|Resuming computation for result 3vit_100202636_0 using hadsm3 version 4.12
6/28/2005 11:16:55 AM|SETI@home|Deferring computation for result 29ap04aa.22117.2656.709662.124_2
6/28/2005 11:16:55 AM|Einstein@Home|Deferring communication with project for 7 hours, 44 minutes, and 1 seconds
6/28/2005 11:18:59 AM||request_reschedule_cpus: project op
6/28/2005 11:18:59 AM|Einstein@Home|Sending scheduler request to http://einstein.phys.uwm.edu/EinsteinAtHome_cgi/cgi
6/28/2005 11:18:59 AM|Einstein@Home|Requesting 34560 seconds of work, returning 1 results
6/28/2005 11:19:08 AM|Einstein@Home|Scheduler request to http://einstein.phys.uwm.edu/EinsteinAtHome_cgi/cgi succeeded
6/28/2005 11:19:08 AM|Einstein@Home|Got server request to delete file H1_0592.5
6/28/2005 11:19:10 AM|Einstein@Home|Started download of Config_L_S4lA
6/28/2005 11:19:10 AM|Einstein@Home|Started download of l1_0277.5
6/28/2005 11:19:10 AM|Einstein@Home|Temporarily failed download of Config_L_S4lA: 404
6/28/2005 11:19:11 AM|Einstein@Home|Started download of Config_L_S4lA
6/28/2005 11:19:12 AM|Einstein@Home|Finished download of Config_L_S4lA
6/28/2005 11:19:12 AM|Einstein@Home|Throughput 3059 bytes/sec
6/28/2005 11:19:32 AM|Einstein@Home|Finished download of l1_0277.5
6/28/2005 11:19:32 AM|Einstein@Home|Throughput 287190 bytes/sec
6/28/2005 11:19:32 AM||request_reschedule_cpus: files downloaded
Real science is VERY error prone. In fact one of the distinguishing characteristics of real research is that (especially the first and second time) one gets it wrong more often than not. The only saving grace in all of this is that with other scientists you get 99.9% forgiveness for being brutally honest about what happened and why. That's the one thing that I can promise Einstein@Home participants that they will get 100% of the time.
I certainly appreciate that sentiment, and thank you!
Compared with most distributed computing projects I have participated in over the past number of years, you have gotten it right the first time more so than the majority of them!
From my perspective, it is very nice indeed to be associated with such professionals and with such a professionally run project.
[Edit added 30 min later]
I found a script that I have used before, which I can use to grant credit to users/hosts/teams for workunits which I have cancelled. I am going to use this to grant credit to people who have had the misfortune of getting and doing work then having it cancelled.
Bruce
I'm very pleased that you have done that and it will be good for the silent majority who probably aren't even aware of the problem yet.
However, it's not my day today :). I took your advice and cancelled running work that was in many cases 80-90% complete!!! And I'm still not mad at you in the slightest :). I'd rather lose the credits than hold up the science by doing work that will only have to be repeated anyway so my cancelling the partly completed work was still the right thing to do.
It must have been one of those nightmare days (and nights) for you :).
Ahhh OK... One one occasion,
)
Ahhh OK... One one occasion, the re-download got to about 50K and then stalled. Maybe I was snagging the file just as the server was deleting it :). When I got the full download (about 8 megs) I'd just stop and delete again and that seemed to cure it. Bit of an eerie feeling when it's telling you it is getting a file that's not supposed to be there. Hopefully all servers are synced up now.
The interesting question is what is going to be the reaction of the silent majority out there who don't regularly follow the lists and are going to be mightily confused by these strange happenings. Is it possible to send a small email to all registered users to warn them to check if they have h1_nnnn style data files and if so check the web page for details? I can just imagine the complaints if someone has a couple of days of h1 work and they don't immediately notice that there is no credit. I watched one of mine do that and that spurred me into action :).
Cheers,
Gary.
What about the "l1_xxx" WU's?
)
What about the "l1_xxx" WU's? I understand the point that lowercase h WU's are troublesome right now and should be aborted. What about lowercase l WU's?
RE: What about the "l1_xxx"
)
Lowercase l workunits l1_XXXX.X__... are FINE! This is because we don't have any data sets labeled 'L1_XXXX.[05] for them to get confused with.
Bruce
Director, Einstein@Home
RE: The interesting
)
I have thought about doing this. There are about 6000 host machines that got these workunits, and about 5000 users. But it would take me some hours to cobble together and test scripts for mailing the users, and I would rather spend the time making sure (testing!) the new w1_XXXX workunits to make sure they are OK.
[Edit added 30 min later]
I found a script that I have used before, which I can use to grant credit to users/hosts/teams for workunits which I have cancelled. I am going to use this to grant credit to people who have had the misfortune of getting and doing work then having it cancelled.
Bruce
Director, Einstein@Home
RE: I found a script that
)
that is a nice touch, Bruce.
Fortunately it does not affect me, but I'm pleased to see the swift way the problem has been dealt with.
~~gravywavy
RE: RE: I found a script
)
Thank you very much.
Real science is VERY error prone. In fact one of the distinguishing characteristics of real research is that (especially the first and second time) one gets it wrong more often than not. The only saving grace in all of this is that with other scientists you get 99.9% forgiveness for being brutally honest about what happened and why. That's the one thing that I can promise Einstein@Home participants that they will get 100% of the time.
Director, Einstein@Home
RE: RE: Yes! I have
)
########
I took the easy way out. :-)
Stopped Boinc/service, waited 1/2 mni., started Boinc/service.
Then did an update of the Einstein project via BoincView.
6/28/2005 11:16:55 AM||Starting BOINC client version 4.45 for windows_intelx86
6/28/2005 11:16:55 AM||Executing as a daemon
6/28/2005 11:16:55 AM||Data directory: C:\Program Files\BOINC
6/28/2005 11:16:55 AM|climateprediction.net|Computer ID: 105470; location: home; project prefs: home
6/28/2005 11:16:55 AM|Einstein@Home|Computer ID: 21342; location: home; project prefs: default
6/28/2005 11:16:55 AM|SETI@home|Computer ID: 56801; location: home; project prefs: home
6/28/2005 11:16:55 AM||General prefs: from Einstein@Home (last modified 2005-05-19 16:49:31)
6/28/2005 11:16:55 AM||General prefs: using separate prefs for home
6/28/2005 11:16:55 AM||Remote control allowed
6/28/2005 11:16:55 AM|climateprediction.net|Resuming computation for result 3ive_200186079_0 using hadsm3 version 4.12
6/28/2005 11:16:55 AM|climateprediction.net|Resuming computation for result 3vit_100202636_0 using hadsm3 version 4.12
6/28/2005 11:16:55 AM|SETI@home|Deferring computation for result 29ap04aa.22117.2656.709662.124_2
6/28/2005 11:16:55 AM|Einstein@Home|Deferring communication with project for 7 hours, 44 minutes, and 1 seconds
6/28/2005 11:18:59 AM||request_reschedule_cpus: project op
6/28/2005 11:18:59 AM|Einstein@Home|Sending scheduler request to http://einstein.phys.uwm.edu/EinsteinAtHome_cgi/cgi
6/28/2005 11:18:59 AM|Einstein@Home|Requesting 34560 seconds of work, returning 1 results
6/28/2005 11:19:08 AM|Einstein@Home|Scheduler request to http://einstein.phys.uwm.edu/EinsteinAtHome_cgi/cgi succeeded
6/28/2005 11:19:08 AM|Einstein@Home|Got server request to delete file H1_0592.5
6/28/2005 11:19:10 AM|Einstein@Home|Started download of Config_L_S4lA
6/28/2005 11:19:10 AM|Einstein@Home|Started download of l1_0277.5
6/28/2005 11:19:10 AM|Einstein@Home|Temporarily failed download of Config_L_S4lA: 404
6/28/2005 11:19:11 AM|Einstein@Home|Started download of Config_L_S4lA
6/28/2005 11:19:12 AM|Einstein@Home|Finished download of Config_L_S4lA
6/28/2005 11:19:12 AM|Einstein@Home|Throughput 3059 bytes/sec
6/28/2005 11:19:32 AM|Einstein@Home|Finished download of l1_0277.5
6/28/2005 11:19:32 AM|Einstein@Home|Throughput 287190 bytes/sec
6/28/2005 11:19:32 AM||request_reschedule_cpus: files downloaded
So all is well in the world again. :-)
Claude
RE: I took the easy way
)
That's all right if you have 4.45. As I mentioned I was running 4.19. my notes were for the benefit of those running that version.
Cheers,
Gary.
RE: Real science is VERY
)
I certainly appreciate that sentiment, and thank you!
Compared with most distributed computing projects I have participated in over the past number of years, you have gotten it right the first time more so than the majority of them!
From my perspective, it is very nice indeed to be associated with such professionals and with such a professionally run project.
RE: [Edit added 30 min
)
I'm very pleased that you have done that and it will be good for the silent majority who probably aren't even aware of the problem yet.
However, it's not my day today :). I took your advice and cancelled running work that was in many cases 80-90% complete!!! And I'm still not mad at you in the slightest :). I'd rather lose the credits than hold up the science by doing work that will only have to be repeated anyway so my cancelling the partly completed work was still the right thing to do.
It must have been one of those nightmare days (and nights) for you :).
Cheers,
Gary.