It turns out there are a number of issues that lead to these cross-platform validation problems, some of which have been addressed recently, some we're still digging for. Solving these problems will probably require both a new validator and a complete set of Apps. I am confident that we will have all these pieces together next week.
BM
Bernd,
When you get to the point of deploying the new validator and the new set of apps, are you intending to run a (perhaps short) beta test phase first, as you did with the 4.24 Windows app?
If you are, might I make a suggestion about the app_info.xml file that would accompany each test app? As you warn quite clearly on the beta test page, changing the app aborts any work in progress with a client error. However you can easily avoid this with a small modification to the app_info.xml file. If you are already fully aware of this and do not want to allow a change of app in the middle of a result, that is fine - no change is needed.
My thinking is that the beta test period could be kept shorter and the number of potential beta testers could be increased if people were allowed to "re-brand" the results in their caches so that they didn't have to abort or wait for their caches to drain or in any way disrupt their normal crunching patterns in order to participate in the test. I'm sure that people have done this in the past by editing their state files. I think it's much safer to do it through the app_info.xml mechanism.
Hopefully, whilst I've got your attention, you might like to review this thread concerning stalled results. I've noticed this behaviour a few times now and i've recorded the result ID of my latest stalled result there.
The result in question was being crunched with the 4.17 Windows app. A little while after I kicked it back to life, I decided to test out my app_info.xml mods in order to speed up the completion of the result as much as possible by using 4.24 instead of 4.17. Even though my result was past the deadline, a third result had not been issued at that point. I hoped that I might be able to beat the system and keep the third result "unsent" :).
Although there was a 25%+ speedup of the final stages of crunching, I still missed out on stopping the third result being issued by just 37 mins.
When you get to the point of deploying the new validator and the new set of apps, are you intending to run a (perhaps short) beta test phase first, as you did with the 4.24 Windows app?
If new Apps are needed, I'll definitely publish them for a public Beta test first.
Currently it looks like upgrading some server-side components (validator and workunit generator) may solve the problem and be the best choice, but we're still looking into this.
Quote:
If you are, might I make a suggestion about the app_info.xml file that would accompany each test app? As you warn quite clearly on the beta test page, changing the app aborts any work in progress with a client error. However you can easily avoid this with a small modification to the app_info.xml file. If you are already fully aware of this and do not want to allow a change of app in the middle of a result, that is fine - no change is needed.
My thinking is that the beta test period could be kept shorter and the number of potential beta testers could be increased if people were allowed to "re-brand" the results in their caches so that they didn't have to abort or wait for their caches to drain or in any way disrupt their normal crunching patterns in order to participate in the test. I'm sure that people have done this in the past by editing their state files. I think it's much safer to do it through the app_info.xml mechanism.
Actually I'll not advise people to manually hack the client_state.xml files, they are too fragile.
However in the future the app_info.xml files in the Beta Test packages will include entries for previous (maybe both official and beta) App versions, so after installing the Beta Test Package even in the middle of a result will not lead to a Client Error, but just to be finished with the old App version, and new work will be assigned to the new App.
Furthermore if you really want to switch the App version halfway through a result, see the sticky post on this subject. I can not guarantee that it will work at all, as e.g. the syntax of the checkpoint file might change between versions.
Furthermore if you really want to switch the App version halfway through a result, see the sticky post on this subject. I can not guarantee that it will work at all, as e.g. the syntax of the checkpoint file might change between versions.
Hi Bernd,
Thanks for the reply.
I'm fully aware of that sticky you link to and I'm also NOT suggesting any hacking of the state file. My comments were about making some additions to the app_info.xml file so that the state file would remain pristine and that no changing of the name of the new executable so that it could pretend to be the old executable would be needed either (as was mentioned in the sticky).
Taking the case of the transition from 4.17 to 4.24 as an example. Here there were desirable bugfixes and apparently no change in output syntax. It would be prudent therefore for any 4.17 "branded" results in a person's cache to be crunched by 4.24, rather than the old buggy app. This can be achieved very simply using a bit more intelligence built into app_info.xml. No dodgy editing of the state file is required at all.
Taking the case of the transition from 4.17 to 4.24 as an example. Here there were desirable bugfixes and apparently no change in output syntax. It would be prudent therefore for any 4.17 "branded" results in a person's cache to be crunched by 4.24, rather than the old buggy app. This can be achieved very simply using a bit more intelligence built into app_info.xml. No dodgy editing of the state file is required at all.
Currently it looks like upgrading some server-side components (validator and workunit generator) may solve the problem and be the best choice, but we're still looking into this.
BM
Wouldn't it be worthwhile to correct the uninitialized data problem in the Linux and Mac apps? As those were detected by compiler runtime checks, to me it sounds as if they were relevant.
Currently it looks like upgrading some server-side components (validator and workunit generator) may solve the problem and be the best choice, but we're still looking into this.
Wouldn't it be worthwhile to correct the uninitialized data problem in the Linux and Mac apps? As those were detected by compiler runtime checks, to me it sounds as if they were relevant.
On Linux and Mac we haven't seen a single result that have been affected by this bug, i.e. it didn't have an effect on the final outcome of the calculation. With this 4.24 Windows App we have found another problem in the same module (which might have been introduced by the fix to the earlier problem). We're working on this. So we'll definitely release a new generation of Apps anyway with some bugfixes.
However for the cross-platform validation problem (only) it might be that we'll need to deal with this only on the server side.
How about the 0xc0000142 crash issues? I don't know if you got my email, as you haven't replied... I wish I knew more of what to help with, but that error is a vexing one...
Edit: BTW, SIGABRT still seems to come up for Linux. See this result.
How about the 0xc0000142 crash issues? I don't know if you got my email, as you haven't replied... I wish I knew more of what to help with, but that error is a vexing one...
Is it still happeneing with the new app?? I would have guesses that the majority of these bugs were secondary problems resulting in a failure to initialize the runtime debugger (which should now work).
How about the 0xc0000142 crash issues? I don't know if you got my email, as you haven't replied... I wish I knew more of what to help with, but that error is a vexing one...
Is it still happeneing with the new app?? I would have guesses that the majority of these bugs were secondary problems resulting in a failure to initialize the runtime debugger (which should now work).
He emailed me the other day asking about it. It is with 4.24. 0xc0000142 is a DLL did not initialize. It is a Windows stop error. From what I read through googling it, it could be a science app problem or it could be a graphics subsystem problem. Graphics-related, I found a few mentions of the issue happening with ATI video cards. Sooooo, based off of what I recall from the initial Linux Signal 11 ("SIGABRT") issue with some OpenGL library, then it could be whatever OpenGL software that the ATI Catalyst drivers use...
Ultimately, it's way out of my league. I mentioned he should contact Rom Walton...one of the main BOINC developers...
RE: It turns out there are
)
Bernd,
When you get to the point of deploying the new validator and the new set of apps, are you intending to run a (perhaps short) beta test phase first, as you did with the 4.24 Windows app?
If you are, might I make a suggestion about the app_info.xml file that would accompany each test app? As you warn quite clearly on the beta test page, changing the app aborts any work in progress with a client error. However you can easily avoid this with a small modification to the app_info.xml file. If you are already fully aware of this and do not want to allow a change of app in the middle of a result, that is fine - no change is needed.
My thinking is that the beta test period could be kept shorter and the number of potential beta testers could be increased if people were allowed to "re-brand" the results in their caches so that they didn't have to abort or wait for their caches to drain or in any way disrupt their normal crunching patterns in order to participate in the test. I'm sure that people have done this in the past by editing their state files. I think it's much safer to do it through the app_info.xml mechanism.
Cheers,
Gary.
Bernd, Hopefully, whilst
)
Bernd,
Hopefully, whilst I've got your attention, you might like to review this thread concerning stalled results. I've noticed this behaviour a few times now and i've recorded the result ID of my latest stalled result there.
The result in question was being crunched with the 4.17 Windows app. A little while after I kicked it back to life, I decided to test out my app_info.xml mods in order to speed up the completion of the result as much as possible by using 4.24 instead of 4.17. Even though my result was past the deadline, a third result had not been issued at that point. I hoped that I might be able to beat the system and keep the third result "unsent" :).
Although there was a 25%+ speedup of the final stages of crunching, I still missed out on stopping the third result being issued by just 37 mins.
Cheers,
Gary.
RE: When you get to the
)
If new Apps are needed, I'll definitely publish them for a public Beta test first.
Currently it looks like upgrading some server-side components (validator and workunit generator) may solve the problem and be the best choice, but we're still looking into this.
Actually I'll not advise people to manually hack the client_state.xml files, they are too fragile.
However in the future the app_info.xml files in the Beta Test packages will include entries for previous (maybe both official and beta) App versions, so after installing the Beta Test Package even in the middle of a result will not lead to a Client Error, but just to be finished with the old App version, and new work will be assigned to the new App.
Furthermore if you really want to switch the App version halfway through a result, see the sticky post on this subject. I can not guarantee that it will work at all, as e.g. the syntax of the checkpoint file might change between versions.
BM
BM
RE: Furthermore if you
)
Hi Bernd,
Thanks for the reply.
I'm fully aware of that sticky you link to and I'm also NOT suggesting any hacking of the state file. My comments were about making some additions to the app_info.xml file so that the state file would remain pristine and that no changing of the name of the new executable so that it could pretend to be the old executable would be needed either (as was mentioned in the sticky).
Taking the case of the transition from 4.17 to 4.24 as an example. Here there were desirable bugfixes and apparently no change in output syntax. It would be prudent therefore for any 4.17 "branded" results in a person's cache to be crunched by 4.24, rather than the old buggy app. This can be achieved very simply using a bit more intelligence built into app_info.xml. No dodgy editing of the state file is required at all.
Cheers,
Gary.
RE: Taking the case of the
)
I understand.
I guess I have to think about this a little more.
BM
BM
RE: Currently it looks
)
Wouldn't it be worthwhile to correct the uninitialized data problem in the Linux and Mac apps? As those were detected by compiler runtime checks, to me it sounds as if they were relevant.
CU
BRM
RE: RE: Currently it
)
On Linux and Mac we haven't seen a single result that have been affected by this bug, i.e. it didn't have an effect on the final outcome of the calculation. With this 4.24 Windows App we have found another problem in the same module (which might have been introduced by the fix to the earlier problem). We're working on this. So we'll definitely release a new generation of Apps anyway with some bugfixes.
However for the cross-platform validation problem (only) it might be that we'll need to deal with this only on the server side.
BM
BM
How about the 0xc0000142
)
How about the 0xc0000142 crash issues? I don't know if you got my email, as you haven't replied... I wish I knew more of what to help with, but that error is a vexing one...
Edit: BTW, SIGABRT still seems to come up for Linux. See this result.
RE: How about the
)
Is it still happeneing with the new app?? I would have guesses that the majority of these bugs were secondary problems resulting in a failure to initialize the runtime debugger (which should now work).
CU
BRM
RE: RE: How about the
)
He emailed me the other day asking about it. It is with 4.24. 0xc0000142 is a DLL did not initialize. It is a Windows stop error. From what I read through googling it, it could be a science app problem or it could be a graphics subsystem problem. Graphics-related, I found a few mentions of the issue happening with ATI video cards. Sooooo, based off of what I recall from the initial Linux Signal 11 ("SIGABRT") issue with some OpenGL library, then it could be whatever OpenGL software that the ATI Catalyst drivers use...
Ultimately, it's way out of my league. I mentioned he should contact Rom Walton...one of the main BOINC developers...
Brian