It's too late now but the message needs to go out about the possible loss of results during outages if you don't upgrade beyond BOINC 5.10.x.
EDIT: Small correction - the version 6.2.29 mentioned above should of course be 6.2.19 which was the final Windows version. The Linux version was 6.2.15 and since over 80% of my machines run Linux it was that version that I had been experimenting with lately. With Linux the upgrade from 5.10.45 to 6.2.15 is so easy - just stop BOINC, copy the new boinc executables over the top of the existing ones and then restart BOINC.
Gary, thanks for your explanations on this matter.
I have been running 5.10.45, and lost lots of work on this outage. I recall that when I built a new machine last fall I installed the then latest Windows version, was troubled by something in its behavior, and reverted to the 5.10.45 that had treated me well and was already familiar from my use on three other hosts.
If you are feeling generous, could you summarize the benefits and deficits in comparing Windows (XP Pro in my case) operation under 6.2.19 vs. 5.10.45? And does one need to be specially careful in doing the install? (incompatible directory changes and such?) While I like things that work, I'm always uncomfortable running outdated releases, one is so likely to get caught by an old fixed bug.
That does not seem fair to us volunteers if the programme that has been given to us, just 'gives up' on trying to communicate with a project.
The work we have done is still valid, or would be if BOINC Manager didn't dump a file and then make your valid work unit invalid.
I tried to update to 6.x but had to go back to 5.10.21 (Linux) due to the dismal benchmark scores I then received (well over 40% drop in Whetstone and Dhrystone).
As I still do some projects with benchmarking so I need the best score I can get on my now older computers.
Therefore 6.2 and 6.4 have been disappointing for me.
I have seen excellent benchmarks on some recent Intel machines and some recent Windows based AMD machines but not on my AMD Linux machines.
In future if it seems that a project will be off air for 3 days or more then I may as well abort all remaining work for that project as I wont get any credit anyway so why waste CPU time and power.
There's another consideration as well. . .
Some Linux distros, such as Debian Etch, don't have new enough libraries to support the newer BOINC clients. On my two Etch machines, I can't run anything newer than BOINC 5.8.
You might compile boinc with all libs static linked.
The strange thing is, I have one machine running v5.10.13 which finished several Einstein WUs during the time the project was off the net, and here they all are: http://einsteinathome.org/host/831490/tasks. 10 reported, 8 credited, 2 waiting on wingmen - no sign of an error anywhere. And I took no special precautions - network comms were enabled throughout, and the logs are full of:
Einstein@Home 30/03/2009 18:39:27 [file_xfer] Started upload of file h1_0753.70_S5R4__547_S5R5a_0_0
Einstein@Home 30/03/2009 18:39:38 [file_xfer] Temporarily failed upload of h1_0753.70_S5R4__548_S5R5a_1_0: system connect
Einstein@Home 30/03/2009 18:39:38 Backing off 2 hr 20 min 7 sec on upload of file h1_0753.70_S5R4__548_S5R5a_1_0
--- 30/03/2009 18:39:38 Project communication failed: attempting access to reference site
--- 30/03/2009 18:39:40 Access to reference site succeeded - project servers may be temporarily down.
Whatever the problem is, it isn't as simple as 'v5.10 bad, v6.2 good'.
The strange thing is, I have one machine running v5.10.13 which finished several Einstein WUs during the time the project was off the net,
...
Whatever the problem is, it isn't as simple as 'v5.10 bad, v6.2 good'.
I also had one happy 5.10.45 host (of three). The visible difference was that it was still trying to upload its old work at the end, while the other two had "succeeded" in uploading and showed status as waiting to report.
Maybe there was a window of vulnerability--if a host tried to upload at the wrong time it lost out, but otherwise all was well?
After I typed that, I rechecked, and noticed that my own happy host actually was not running 5.10.45, but rather was running 5.10.20. The portion of the message log I can still see from the 5.10.20 host has lots of message like: stoll3 Einstein@Home 3/29/2009 6:25:57 PM [file_xfer] Temporarily failed upload of h1_0635.85_S5R4__636_S5R5a_1_0: file not found
Which may correspond to situations in which the 5.10.45 hosts gave the "giving up" message. It appears not to be just a cosmetic difference either, as the specific task generating the quoted message got credit.
Any chance 5.10.20 is "better" from this point of view?
If you are feeling generous, could you summarize the benefits and deficits in comparing Windows (XP Pro in my case) operation under 6.2.19 vs. 5.10.45? And does one need to be specially careful in doing the install? (incompatible directory changes and such?) While I like things that work, I'm always uncomfortable running outdated releases, one is so likely to get caught by an old fixed bug.
You said it in a nut shell, 6.x will fix things that were found to be broken in 5.x...
Cosmetically the 6.x series now has multiple selects in windows so you can select several tasks or projects and then apply a button action to all of them at the same time. In 6.2.something CUDA capability was added and versions that are widely used include 6.4.5 (need configuration files), 6.5.0 and some 6.6.x versions (6.6.18 and 6.6.20) ...
Upgrading BOINC can be an intensely personal task as some people find one bug or another a show stopper where it does not bother someone else. I have been running 6.5.0 on all my windows XP Pro machines for some time now and find it stable and reasonably well behaved though it does have some issues with maintaining the work queues in a stable manner on multi-core machines ( a problem I reported years ago when I was running 4 cores when most people were only running one with a few duals in the mix, now we are seeing almost all 4 cores and some 8s, well the problem is being noticed more).
Anyway, I just installed 6.6.20 on my Mac Pro to see what issues there are, if any, and won't really know for at least a month ... at least that has been my experience, it takes at least that long to know for sure there are not show-stoppers..
That does not seem fair to us volunteers if the programme that has been given to us, just 'gives up' on trying to communicate with a project.
The work we have done is still valid, or would be if BOINC Manager didn't dump a file and then make your valid work unit invalid.
I tried to update to 6.x but had to go back to 5.10.21 (Linux) due to the dismal benchmark scores I then received (well over 40% drop in Whetstone and Dhrystone).
As I still do some projects with benchmarking so I need the best score I can get on my now older computers.
Therefore 6.2 and 6.4 have been disappointing for me.
I have seen excellent benchmarks on some recent Intel machines and some recent Windows based AMD machines but not on my AMD Linux machines.
In future if it seems that a project will be off air for 3 days or more then I may as well abort all remaining work for that project as I wont get any credit anyway so why waste CPU time and power.
There's another consideration as well. . .
Some Linux distros, such as Debian Etch, don't have new enough libraries to support the newer BOINC clients. On my two Etch machines, I can't run anything newer than BOINC 5.8.
You might compile boinc with all libs static linked.
... could you summarize the benefits and deficits in comparing Windows (XP Pro in my case) operation under 6.2.19 vs. 5.10.45?
I'd probably be one of the worst to give dispassionate advice about this :-)
Until the recent outage I'd never completed a Win install of 6.x since as soon as I found that you had to have two separate directories, I gave up and left all Win machines on (mostly) 5.10.45. I've now (during the outage) completed about four upgrades to 6.2.19 without incident. I didn't go any further because of the time it was taking per host so I just used the "suspend network activity" option instead.
I dislike having to reboot immediately after the installation and (being old and grumpy) I dislike the fact that BOINC Manager seems to be started automatically along with BOINC during that Windows reboot, even though I'm installing as a service. I'm used to having many headless machines where I can touch the power button to shut them down cleanly and the same to start them up cleanly with no peripherals attached. I just want BOINC running as a service and nothing else and 5.10.45 did that perfectly.
I'd probably done more than 20 6.2.15 Linux conversions because they were so simple and no changes to my previous way of doing things was required - no reboots and no extra directories. I have scripts and shortcuts that expect things to be in certain (standardised) places and now I have to change all that for the upgraded Win hosts.
Quote:
And does one need to be specially careful in doing the install?
Not that I've noticed. Some scripts I had in the BOINC dir got moved to the data dir so I had to manually move them back. Apart from that, it all seemed to go OK.
On my Win hosts, the install dir used to be D:\Program Files\BOINC. My standard data dir is now D:\Program Files\BOINC_Data. I like to keep what I install quite separate from Windows so I use a D:\ partition for that putpose. Most of my machines are dual booting so I usually have 5 partitions on a disk - C:, D:, /, /home, and swap - the last three being the Linux partitions.
I expect that once I get used to the new arrangements and fix up my own stuff all will be fine again.
Whatever the problem is, it isn't as simple as 'v5.10 bad, v6.2 good'.
The oldest BOINC I could find on my machines was 5.10.28 and it has the problem.
Peter has a 5.10.20 which didn't have the problem so it looks like the bug was introduced somewhere between 20 and 28. I'd be very interested to hear if there is anyone running a BOINC later than 5.10.20 that has avoided the problem.
There is a validate errors thread over on the problems board which started in early August last year where there were also server outages over a weekend if I remember correctly. I guess this was exactly the same deal and if one was unlucky enough to be trying to upload at the wrong time with the wrong BOINC version, the results got deleted. Pity that someone didn't spot the connection with the "Giving up" message at that time ... :-).
... noticed that my own happy host actually was not running 5.10.45, but rather was running 5.10.20. The portion of the message log I can still see from the 5.10.20 host has lots of message like: stoll3 Einstein@Home 3/29/2009 6:25:57 PM [file_xfer] Temporarily failed upload of h1_0635.85_S5R4__636_S5R5a_1_0: file not found
That's very interesting indeed! To have been granted credit, there must be a later message where the upload was attempted yet again and this time, magically, the file was found and the upload was successful!
I wonder what you need to do to make the missing file reappear like that?? Perhaps there might be a clue in the messages when that particular upload eventually succeeded.
Quote:
Which may correspond to situations in which the 5.10.45 hosts gave the "giving up" message. It appears not to be just a cosmetic difference either, as the specific task generating the quoted message got credit.
Can you actually find the "successful upload" confirmation? I guess there'll be no clue as to why the file could suddenly be found this time. I guess we'd need someone familiar with the code to explain why it was a "temporary failure" as opposed to the permanent failure that leads to the "giving up" message. How can a missing file be a "temporary failure"??
Quote:
Any chance 5.10.20 is "better" from this point of view?
Sure looks like it. I wonder if it's a viable option to consider downgrading from 5.10.45 to 5.10.20? Hmmmm....
[Only half serious ... Unless someone comes up with a better plan, my intention is to bite the bullet and upgrade about 160 machines to 6.2.x and then sort out my scripts and shortcuts, etc.]
Can you actually find the "successful upload" confirmation? I guess there'll be no clue as to why the file could suddenly be found this time. I guess we'd need someone familiar with the code to explain why it was a "temporary failure" as opposed to the permanent failure that leads to the "giving up" message. How can a missing file be a "temporary failure"?
Perhaps I could have, but time has marched on and the point at which the available messages now begin is too late.
BOINCView did save two files related to that result, a .log.xml file and a .progress.csv file.
However they may not be of any help here. The .progress.csv has a single header line, then hundreds of lines of this style 3576,906000;0,200819999999999999;18064,367745
The xml file seems slightly more informative, but not much. Although it is several hundreds of lines long, only a very few seem specific to this result. If anyone thinks there might be something interesting there, I can look with guidance, or send along a copy.
RE: It's too late now but
)
Gary, thanks for your explanations on this matter.
I have been running 5.10.45, and lost lots of work on this outage. I recall that when I built a new machine last fall I installed the then latest Windows version, was troubled by something in its behavior, and reverted to the 5.10.45 that had treated me well and was already familiar from my use on three other hosts.
If you are feeling generous, could you summarize the benefits and deficits in comparing Windows (XP Pro in my case) operation under 6.2.19 vs. 5.10.45? And does one need to be specially careful in doing the install? (incompatible directory changes and such?) While I like things that work, I'm always uncomfortable running outdated releases, one is so likely to get caught by an old fixed bug.
RE: RE: That does not
)
You might compile boinc with all libs static linked.
ldd boinc
linux-gate.so.1 => (0xffffe000)
libz.so.1 => /lib/libz.so.1 (0xb7f77000)
libdl.so.2 => /lib/libdl.so.2 (0xb7f73000)
libc.so.6 => /lib/libc.so.6 (0xb7e40000)
libstdc++.so.6 => /usr/lib/libstdc++.so.6 (0xb7d52000)
libm.so.6 => /lib/libm.so.6 (0xb7d2d000)
libpthread.so.0 => /lib/libpthread.so.0 (0xb7d16000)
/lib/ld-linux.so.2 (0xb7f9e000)
libgcc_s.so.1 => /lib/libgcc_s.so.1 (0xb7d09000)
Could become a pretty big monster. ;-)
cu,
Michael
The strange thing is, I have
)
The strange thing is, I have one machine running v5.10.13 which finished several Einstein WUs during the time the project was off the net, and here they all are: http://einsteinathome.org/host/831490/tasks. 10 reported, 8 credited, 2 waiting on wingmen - no sign of an error anywhere. And I took no special precautions - network comms were enabled throughout, and the logs are full of:
Einstein@Home 30/03/2009 18:39:27 [file_xfer] Started upload of file h1_0753.70_S5R4__547_S5R5a_0_0
Einstein@Home 30/03/2009 18:39:38 [file_xfer] Temporarily failed upload of h1_0753.70_S5R4__548_S5R5a_1_0: system connect
Einstein@Home 30/03/2009 18:39:38 Backing off 2 hr 20 min 7 sec on upload of file h1_0753.70_S5R4__548_S5R5a_1_0
--- 30/03/2009 18:39:38 Project communication failed: attempting access to reference site
--- 30/03/2009 18:39:40 Access to reference site succeeded - project servers may be temporarily down.
Whatever the problem is, it isn't as simple as 'v5.10 bad, v6.2 good'.
RE: The strange thing is, I
)
I also had one happy 5.10.45 host (of three). The visible difference was that it was still trying to upload its old work at the end, while the other two had "succeeded" in uploading and showed status as waiting to report.
Maybe there was a window of vulnerability--if a host tried to upload at the wrong time it lost out, but otherwise all was well?
After I typed that, I rechecked, and noticed that my own happy host actually was not running 5.10.45, but rather was running 5.10.20. The portion of the message log I can still see from the 5.10.20 host has lots of message like:
stoll3 Einstein@Home 3/29/2009 6:25:57 PM [file_xfer] Temporarily failed upload of h1_0635.85_S5R4__636_S5R5a_1_0: file not found
Which may correspond to situations in which the 5.10.45 hosts gave the "giving up" message. It appears not to be just a cosmetic difference either, as the specific task generating the quoted message got credit.
Any chance 5.10.20 is "better" from this point of view?
RE: If you are feeling
)
You said it in a nut shell, 6.x will fix things that were found to be broken in 5.x...
Cosmetically the 6.x series now has multiple selects in windows so you can select several tasks or projects and then apply a button action to all of them at the same time. In 6.2.something CUDA capability was added and versions that are widely used include 6.4.5 (need configuration files), 6.5.0 and some 6.6.x versions (6.6.18 and 6.6.20) ...
Upgrading BOINC can be an intensely personal task as some people find one bug or another a show stopper where it does not bother someone else. I have been running 6.5.0 on all my windows XP Pro machines for some time now and find it stable and reasonably well behaved though it does have some issues with maintaining the work queues in a stable manner on multi-core machines ( a problem I reported years ago when I was running 4 cores when most people were only running one with a few duals in the mix, now we are seeing almost all 4 cores and some 8s, well the problem is being noticed more).
Anyway, I just installed 6.6.20 on my Mac Pro to see what issues there are, if any, and won't really know for at least a month ... at least that has been my experience, it takes at least that long to know for sure there are not show-stoppers..
RE: RE: RE: That does
)
Interesting idea. . .
But, it may be easier if I just upgrade my machines to Lenny. (Now, if only I had the time. . .)
RE: ... could you summarize
)
I'd probably be one of the worst to give dispassionate advice about this :-)
Until the recent outage I'd never completed a Win install of 6.x since as soon as I found that you had to have two separate directories, I gave up and left all Win machines on (mostly) 5.10.45. I've now (during the outage) completed about four upgrades to 6.2.19 without incident. I didn't go any further because of the time it was taking per host so I just used the "suspend network activity" option instead.
I dislike having to reboot immediately after the installation and (being old and grumpy) I dislike the fact that BOINC Manager seems to be started automatically along with BOINC during that Windows reboot, even though I'm installing as a service. I'm used to having many headless machines where I can touch the power button to shut them down cleanly and the same to start them up cleanly with no peripherals attached. I just want BOINC running as a service and nothing else and 5.10.45 did that perfectly.
I'd probably done more than 20 6.2.15 Linux conversions because they were so simple and no changes to my previous way of doing things was required - no reboots and no extra directories. I have scripts and shortcuts that expect things to be in certain (standardised) places and now I have to change all that for the upgraded Win hosts.
Not that I've noticed. Some scripts I had in the BOINC dir got moved to the data dir so I had to manually move them back. Apart from that, it all seemed to go OK.
On my Win hosts, the install dir used to be D:\Program Files\BOINC. My standard data dir is now D:\Program Files\BOINC_Data. I like to keep what I install quite separate from Windows so I use a D:\ partition for that putpose. Most of my machines are dual booting so I usually have 5 partitions on a disk - C:, D:, /, /home, and swap - the last three being the Linux partitions.
I expect that once I get used to the new arrangements and fix up my own stuff all will be fine again.
Cheers,
Gary.
RE: Whatever the problem
)
The oldest BOINC I could find on my machines was 5.10.28 and it has the problem.
Peter has a 5.10.20 which didn't have the problem so it looks like the bug was introduced somewhere between 20 and 28. I'd be very interested to hear if there is anyone running a BOINC later than 5.10.20 that has avoided the problem.
There is a validate errors thread over on the problems board which started in early August last year where there were also server outages over a weekend if I remember correctly. I guess this was exactly the same deal and if one was unlucky enough to be trying to upload at the wrong time with the wrong BOINC version, the results got deleted. Pity that someone didn't spot the connection with the "Giving up" message at that time ... :-).
Cheers,
Gary.
RE: ... noticed that my own
)
That's very interesting indeed! To have been granted credit, there must be a later message where the upload was attempted yet again and this time, magically, the file was found and the upload was successful!
I wonder what you need to do to make the missing file reappear like that?? Perhaps there might be a clue in the messages when that particular upload eventually succeeded.
Can you actually find the "successful upload" confirmation? I guess there'll be no clue as to why the file could suddenly be found this time. I guess we'd need someone familiar with the code to explain why it was a "temporary failure" as opposed to the permanent failure that leads to the "giving up" message. How can a missing file be a "temporary failure"??
Sure looks like it. I wonder if it's a viable option to consider downgrading from 5.10.45 to 5.10.20? Hmmmm....
[Only half serious ... Unless someone comes up with a better plan, my intention is to bite the bullet and upgrade about 160 machines to 6.2.x and then sort out my scripts and shortcuts, etc.]
Cheers,
Gary.
RE: Can you actually find
)
Perhaps I could have, but time has marched on and the point at which the available messages now begin is too late.
BOINCView did save two files related to that result, a .log.xml file and a .progress.csv file.
However they may not be of any help here. The .progress.csv has a single header line, then hundreds of lines of this style
3576,906000;0,200819999999999999;18064,367745
The xml file seems slightly more informative, but not much. Although it is several hundreds of lines long, only a very few seem specific to this result. If anyone thinks there might be something interesting there, I can look with guidance, or send along a copy.
But I'm afraid the fox has left the valley.