This may be at least partly a screw-up on my side.
The "new" S4 data files are named l1_XXXX.X and h1_XXXX.X, in contrast to the "old" files which are named L1_XXXX.X and H1_XXXX.X.
Unfortunately I had not realized that on Win32, file names are case-insensitive.
So there may be some issues in the next few days if workunits which are supposed to use the file H1_0400.0 (which has a particular size and checksum) try to instead use the file h1_0400.0 (which has a DIFFERENT size and checksum).
Meanwhile, I'll see what I can do on the server side to ameliorate this issue.
Unfortunately I had not realized that on Win32, file names are case-insensitive.
yes, when writing a cross-platform system, it is safest to use only lower case, (or only upper case !?) throughout. Maybe the BOINC developers community should add this requirement to the policy on filenames across all BOINC projects, which would reduce the chances of similar errors in future.
It is not fair to expect developers with a single-OS background to know all the cross-platform pitfalls and policies can help with that.
All versions of DOS & Win have been case insensitive, but then so too were many mainframe OS's. Sooner or later someone is going to put BOINC on a platform with some other case-insensitive filing system, so whle Win makes the issue urgent here, this is one that would eventually have wanted sorting out anyway.
What about deleting all the uppercase or lowercase WUs on server side and then later reissuing them with new naming convention?
This should "convert" the temporary download error into a permanent one (with "giving up") so the computers break out of their download loop.
would this waste work that has already been done (even work that has been returned) on those wu?
The current situation does the same, some of my team already did report lost WUs after the restart and it happened to me too.
Maybe it would help to remove the H1 and h1 ones for some time, later reissue only the h1 ones there and later (much later) reissue the H1 ones.
Those endless loops are very much a waste of CPU cycles too, the CPUs are heavily loaded mostly with the download, my system had a permanent high load on BOINC (not on the project client) and BOINC does not run with low priority. Not much CPU power left for any project client and (that's worst) for me.
If that happens on a production system where BOINC should stay in background, the users and admins of those systems might become really mad.
After some discussions with David Anderson, I've taken the simple way out. I've cancelled the workunits with names that start "h1_" (NOTE: this is case sensitive, work starting "H1_" is NOT cancelled).
I've also removed the problematic h1_XXXX.X data files from the download servers. After these changes propagate to the data server mirrors (15 to 30 minutes) this should generate hard download errors for any client that attempts these WU.
I'll rename the workunits and files using "w1" (w for Washington state, where the Hanford detector is located) and reissue them.
Apologies to everyone for this fiasco. It's my fault. Hopefully we can recover quickly.
Please feel free to manually abort any h1_ workunits. My apologies for wasted CPU cycles. Fortunately these workunits have only been out there for a half-day so this shouldn't be too severe.
...Please feel free to manually abort any h1_ workunits. My apologies for wasted CPU cycles. Fortunately these workunits have only been out there for a half-day so this shouldn't be too severe.
Bruce
Thank you for handling this issue so quickly :)
A project reset (I only have h1_... left) should do the trick, right?
This may be at least partly a
)
This may be at least partly a screw-up on my side.
The "new" S4 data files are named l1_XXXX.X and h1_XXXX.X, in contrast to the "old" files which are named L1_XXXX.X and H1_XXXX.X.
Unfortunately I had not realized that on Win32, file names are case-insensitive.
So there may be some issues in the next few days if workunits which are supposed to use the file H1_0400.0 (which has a particular size and checksum) try to instead use the file h1_0400.0 (which has a DIFFERENT size and checksum).
Meanwhile, I'll see what I can do on the server side to ameliorate this issue.
Bruce
Director, Einstein@Home
RE: The "download looping"
)
4.19 here
... and it's still happening, on a different PC now, while it loops it needs most CPU power.
RE: This may be at least
)
==============
Whew, thought I was looking at Boinc Seti for a few minutes. Had 9 errors (7 DL and 2 computing) on ID 11073.
What about deleting all the
)
What about deleting all the uppercase or lowercase WUs on server side and then later reissuing them with new naming convention?
This should "convert" the temporary download error into a permanent one (with "giving up") so the computers break out of their download loop.
RE: What about deleting all
)
would this waste work that has already been done (even work that has been returned) on those wu?
~~gravywavy
RE: Unfortunately I had
)
yes, when writing a cross-platform system, it is safest to use only lower case, (or only upper case !?) throughout. Maybe the BOINC developers community should add this requirement to the policy on filenames across all BOINC projects, which would reduce the chances of similar errors in future.
It is not fair to expect developers with a single-OS background to know all the cross-platform pitfalls and policies can help with that.
All versions of DOS & Win have been case insensitive, but then so too were many mainframe OS's. Sooner or later someone is going to put BOINC on a platform with some other case-insensitive filing system, so whle Win makes the issue urgent here, this is one that would eventually have wanted sorting out anyway.
~~gravywavy
RE: RE: What about
)
The current situation does the same, some of my team already did report lost WUs after the restart and it happened to me too.
Maybe it would help to remove the H1 and h1 ones for some time, later reissue only the h1 ones there and later (much later) reissue the H1 ones.
Those endless loops are very much a waste of CPU cycles too, the CPUs are heavily loaded mostly with the download, my system had a permanent high load on BOINC (not on the project client) and BOINC does not run with low priority. Not much CPU power left for any project client and (that's worst) for me.
If that happens on a production system where BOINC should stay in background, the users and admins of those systems might become really mad.
After some discussions with
)
After some discussions with David Anderson, I've taken the simple way out. I've cancelled the workunits with names that start "h1_" (NOTE: this is case sensitive, work starting "H1_" is NOT cancelled).
I've also removed the problematic h1_XXXX.X data files from the download servers. After these changes propagate to the data server mirrors (15 to 30 minutes) this should generate hard download errors for any client that attempts these WU.
I'll rename the workunits and files using "w1" (w for Washington state, where the Hanford detector is located) and reissue them.
Apologies to everyone for this fiasco. It's my fault. Hopefully we can recover quickly.
Please feel free to manually abort any h1_ workunits. My apologies for wasted CPU cycles. Fortunately these workunits have only been out there for a half-day so this shouldn't be too severe.
Bruce
Director, Einstein@Home
RE: ...Please feel free to
)
Thank you for handling this issue so quickly :)
A project reset (I only have h1_... left) should do the trick, right?
Aloha, Uli
Any chance to reset the
)
Any chance to reset the "daily quota" things too for today?