What's the Cure?

jacklass1
jacklass1
Joined: 18 Jan 05
Posts: 77
Credit: 7421006
RAC: 0
Topic 194834

3/22/2010 10:14:45 AM Einstein@Home Output file p2030_53611_01579_0009_G36.03+00.54.N_2.dm_20_1_0 for task p2030_53611_01579_0009_G36.03+00.54.N_2.dm_20_1 absent
3/22/2010 10:14:45 AM Einstein@Home Output file p2030_53611_01579_0009_G36.03+00.54.N_2.dm_20_1_1 for task p2030_53611_01579_0009_G36.03+00.54.N_2.dm_20_1 absent
3/22/2010 10:14:45 AM Einstein@Home Output file p2030_53611_01579_0009_G36.03+00.54.N_2.dm_20_1_2 for task p2030_53611_01579_0009_G36.03+00.54.N_2.dm_20_1 absent
3/22/2010 10:14:45 AM Einstein@Home Output file p2030_53611_01579_0009_G36.03+00.54.N_2.dm_20_1_3 for task p2030_53611_01579_0009_G36.03+00.54.N_2.dm_20_1 absent

I have been getting this result consistently for the ABS work units. I have reset the project, downloaded the BOINC software again, and nothing has helped. I have no idea why I'm having this problem with only these WU's and not the others. Is there some sort of fix I'm unaware of? I am not a BOINC expert, so if anyone has a solution or suggestion please phrase it so I can understand it.

THE MOTHER OF FOOLS IS ALWAYS PREGNANT

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2139
Credit: 2753059092
RAC: 1365538

What's the Cure?

Hmmm. 167296238

How do you get 'too many exit(0)s' in 0 seconds?

[Sorry, jacklass1, that's a question for other potential helpers - indicating that you've set us an "interesting", i.e. tough, question. Hopefully the answer will be easier to understand, but it may take us a while to find it.]

Mike Hewson
Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6534
Credit: 284737525
RAC: 103325

RE: Hmmm. 167296238 How do

Message 97464 in response to message 97463

Quote:

Hmmm. 167296238

How do you get 'too many exit(0)s' in 0 seconds?

[Sorry, jacklass1, that's a question for other potential helpers - indicating that you've set us an "interesting", i.e. tough, question. Hopefully the answer will be easier to understand, but it may take us a while to find it.]


OK, I'm game ....

- exit() is a language call for program termination with an error code.

- exit(0) is a terminate returning a code of zero.

- traditionally zero means 'no problem' or 'success' that will be read ( probably ) by whatever called the program in the first place.

- it looks like the BOINC client ( version 6.10.18 in this case ) was that program invoking the one that exited ( evidently a E@H application - STSP )

- so this is reported as happening too many times in no time at all !?!?

- there must be a counter reflecting that ( number of times that is excessive )

- someone has used/nominated an integer type for that count

- but has mixed up a signed rather than an unsigned comparison. Eg what is 255 as an unsigned integer byte, is -1 as a signed integer byte.

- and/or hasn't initialised the counter prior to use, hence it didn't start at zero but rather any old value ( depending on memory contents prior to load ).

- tested that value ( prior to application program invocation actually ) in a conditional construct ( test before body/block is executed ), so that it errors out quick slick.

Thus I hypothecate a programming boner in the BOINC client of that version, possibly also an issue with compiler switches for a given target system. In C/C++ for instance ( my guess at the BOINC source code language ) the type 'int' without other qualification can be deemed as signed or unsigned, depending on a variety of stuff.

[ Always initialise your variables. If you want a certain data type then say so, don't assume. ]

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

Jord
Joined: 26 Jan 05
Posts: 2952
Credit: 5779100
RAC: 0

RE: - so this is reported

Message 97465 in response to message 97464

Quote:
- so this is reported as happening too many times in no time at all !?!?


You have to see it in the context of error -226

Quote:

ERR_TOO_MANY_EXITS -226

An application has exited prematurely (unexpectedly) more than 99 times without generating a checkpoint, so giving up on that task.

If the app can't even make the first checkpoint, and that 99 times before we exit the application, the CPU time is effectively zero.

This error usually happens when something external is locking the BOINC Data directory and sub-directories, like an anti-virus program or anti-spyware program. Advice is to exclude the BOINC Data directory completely, or only do active scans on the system when BOINC isn't running.

John Clark
John Clark
Joined: 4 May 07
Posts: 1087
Credit: 3143193
RAC: 0

The inevitable Windows OS

The inevitable Windows OS answer is to shutdown the PC and then reboot. Probably after dealing with the point Jord made.

Shih-Tzu are clever, cuddly, playful and rule!! Jack Russell are feisty!

Mike Hewson
Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6534
Credit: 284737525
RAC: 103325

RE: RE: - so this is

Message 97467 in response to message 97465

Quote:
Quote:
- so this is reported as happening too many times in no time at all !?!?

You have to see it in the context of error -226

Quote:

ERR_TOO_MANY_EXITS -226

An application has exited prematurely (unexpectedly) more than 99 times without generating a checkpoint, so giving up on that task.

If the app can't even make the first checkpoint, and that 99 times before we exit the application, the CPU time is effectively zero.

This error usually happens when something external is locking the BOINC Data directory and sub-directories, like an anti-virus program or anti-spyware program. Advice is to exclude the BOINC Data directory completely, or only do active scans on the system when BOINC isn't running.


Well done! I had a nice theory for the ten minutes it lasted! :-) :-)

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

Jord
Joined: 26 Jan 05
Posts: 2952
Credit: 5779100
RAC: 0

RE: Well done! I had a nice

Message 97468 in response to message 97467

Quote:
Well done! I had a nice theory for the ten minutes it lasted! :-) :-)


Yes, but the only problem I see with my answer is that it only happens with his ABP2, not with GCE/S5R6.

I didn't follow everything here, but does ABP2 come in ATI flavor as well?

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5842
Credit: 109411554484
RAC: 34892858

RE: Yes, but the only

Message 97469 in response to message 97468

Quote:
Yes, but the only problem I see with my answer is that it only happens with his ABP2, not with GCE/S5R6.


I don't think that's a problem at all. His security software is upset only with something in ABP2. I don't know how AV/Security stuff works but could it be that an ABP2 file is being permanently locked, rather than just drive-by scanning inserting a temp lock? The file can still be seen but never can be opened?

Quote:
I didn't follow everything here, but does ABP2 come in ATI flavor as well?


No.

@ Jack Lass - can you temporarily disable your security software to see if ABP2 tasks can then complete? If so, and they do complete, you will need to investigate how to get your security software to stop interfering with ABP2 stuff. It might be to reconfigure your security software to advise it that ABP2 files are OK. There should be logs somewhere in your security system that tell you what particular file it is unhappy with. So rather than disabling anything, first see if you can find that log information (try reading the docs that came with your software) and once you find the offending file listed in the logs, read up on how to tell your security software that the 'problem' file is actually OK.

EDIT: On re-reading your original post, the error is that the output file is missing. Perhaps your security software is deleting the output file as soon as it is created and so the science app is continually being restarted right from the beginning until there are too many of these restarts.

Cheers,
Gary.

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2139
Credit: 2753059092
RAC: 1365538

RE: EDIT: On re-reading

Message 97470 in response to message 97469

Quote:
EDIT: On re-reading your original post, the error is that the output file is missing. Perhaps your security software is deleting the output file as soon as it is created and so the science app is continually being restarted right from the beginning until there are too many of these restarts.


Unlikely. If you look at host 2262468, where I got the example task from, the time interval between tasks isn't enough to iterate the full run 100 times with a file deletion between each run.....

It's a problem - like many others - with recent BOINC versions: they report the consequences of an error (the expected output files didn't exist), but they're too coy to actually say there was an error in the first place. Ticket [trac]#985[/trac] relates.

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2139
Credit: 2753059092
RAC: 1365538

Just got a nudge from

Just got a nudge from ZZUBYTTIHS in the next thread. Could this be our old friend the clunky thermal throttling back again? To be serious for a moment, the OP's actual problem seems to be that the task started 100 times, but never got far enough into the task to (a) start the CPU time counter, or (b) post any application startup stderr_out messages. That could be AV locking - though I would be surprised, and worried, if that come back as an exit(0) - or it could be BOINC's own stop/start.

jacklass1, if you look at your Computing preferences page, what does it say for the very bottom item in the first section: Use at most (Can be used to reduce CPU heat)?

If it's anything less than 100 percent of CPU time, try turning it up to 100 and see if that makes any difference.

Mike Hewson
Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6534
Credit: 284737525
RAC: 103325

And yet exit(0) ought mean a

And yet exit(0) ought mean a happy exit .... or is the general 'non-zero values are true' boolean rule equating to a 'false' message here? Are we sure of the exit(0) semantics in this case? Maybe it's just too many exits per se, regardless of the return code.

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.