Client Errors when throtteling

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 3783
Credit: 168820609
RAC: 59706
Topic 194243

Apparently some people get Client Errors when they turn on CPU throtteling ("Use at most than ... of CPU time"). These Client Errors have exit status -226 (TOO MANY EXITS) and lots of "Can't acquire lockfile - exiting" messages in stderr_out.

To further track down this problem we made a Windows Application that produces some additional diagnostic output in stderr_out (actaully lots when you have throtteling turned on).

If you encounter this problem, please run this Application first without changing your settings (anonymous platform - download the archive, open it, drop all the files into the project folder and restart BOINC), so we can figure out what's happening. For fast feedback, please report Tasks finished with this Application (in particular in case of a Client Error) to the server manually.

Thanks in advance,

BM

BM

Rudy
Rudy
Joined: 12 Dec 05
Posts: 33
Credit: 1539312
RAC: 1

Client Errors when throtteling

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 3783
Credit: 168820609
RAC: 59706

Thank you very much! I'm

Thank you very much! I'm discussing this with the BOINC developers.

BM

BM

Rudy
Rudy
Joined: 12 Dec 05
Posts: 33
Credit: 1539312
RAC: 1

ok resultid=122521832 R

nils
nils
Joined: 6 Sep 09
Posts: 2
Credit: 68489
RAC: 0

Good day! I am working for

Good day!

I am working for Einstein@Home as a developer and currently investigating the problem described above.

To get some more debugging information, we kindly ask you to upgrade your BOINC client to the latest development version 6.10.29, which can be found here for numerous platforms. This version includes a new feature, which lets us see the first 64kB of the applications stderr log instead of its last 64kB. To activate that feature, please create a file named cc_config.xml in your BOINC data folder (in Windows that is usually under C:\Documents and Settings\All Users\Application Data\BOINC) with the following contents:

  
    1
  


If you encounter the problem (symptoms are numerous application restarts in your message log), we also kindly ask you to shut down the BOINC client completely and look into a task manager, whether there are still Einstein@Home applications running. If so, please report that and whether the process in question is consuming any CPU time.

Thanks in advance.

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 3783
Credit: 168820609
RAC: 59706

RE: If you encounter the

Message 90928 in response to message 90927

Quote:
If you encounter the problem (symptoms are numerous application restarts in your message log), we also kindly ask you to shut down the BOINC client completely and look into a task manager, whether there are still Einstein@Home applications running. If so, please report that and whether the process in question is consuming any CPU time.


Apparently someone did this - see Too many exits (was: No finished file). He didn't stop BOINC, but at least computing activity.

BM

BM

joe areeda
joe areeda
Joined: 13 Dec 10
Posts: 285
Credit: 217512382
RAC: 206723

I have gotten a rash of these

I have gotten a rash of these errors since the site went back up after the 1/26/11 day of down time. I don't know if the timing is coincidental.

Most stderr output looks like:
too many exit(0)s

With additional errors like:

[17:59:12][26008][ERROR] Error allocating power spectrum device memory: 25166336 bytes (error: 2)
[17:59:12][26008][ERROR] Demodulation failed (error: 1006)!
[17:59:12][26008][WARN ] CUDA memory allocation problem encountered!
------> Returning control to BOINC, delaying restart for at least five minutes...
------> If this problem persists you should consider aborting this task.
[17:59:13][26010][INFO ] Starting data processing...
[17:59:14][26010][INFO ] CUDA global memory status (initial GPU state, including context):
------> Used in total: 318 MB (194 MB free / 512 MB total) -> Used by this application: 0 MB

or

[16:50:59][23992][ERROR] Error creating CUDA FFT plan (error code: 2)
[16:50:59][23992][ERROR] Demodulation failed (error: 1011)!
[16:50:59][23992][WARN ] CUDA memory allocation problem encountered!

Here's a few of them:
http://einsteinathome.org/task/217030348
http://einsteinathome.org/task/217029719
http://einsteinathome.org/task/217029182
http://einsteinathome.org/task/217029179
http://einsteinathome.org/task/217028962
http://einsteinathome.org/task/217028916

I have a total of 40 most on Binary Radio Pulsar Search v1.06 (BRP3cuda32fullCPU)

I also had a few that look like:
WU download error: couldn't get input files:

h1_1329.00_S5R4
-119
MD5 check failed

But I think those were transfers that were interrupted when the server when down they were clustered around 26 Jan 2011 14:41:19 UTC

Joe

Gundolf Jahn
Gundolf Jahn
Joined: 1 Mar 05
Posts: 1079
Credit: 341280
RAC: 0

RE: Here's a few of

Message 90930 in response to message 90929


I made the links active, but where are the (TOO MANY EXITS) and "Can't acquire lockfile - exiting" messages for those tasks?

Gundolf

Computer sind nicht alles im Leben. (Kleiner Scherz)

joe areeda
joe areeda
Joined: 13 Dec 10
Posts: 285
Credit: 217512382
RAC: 206723

RE: I made the links

Message 90931 in response to message 90930

Quote:

I made the links active, but where are the (TOO MANY EXITS) and "Can't acquire lockfile - exiting" messages for those tasks?

Gundolf


thank you for dealing with my laziness.

I'm not sure this is exactly the same problem but the tasks that I've examined report:

Exit status -226 (0xffffffffffffff1e)

and the stderr output start with:

6.10.58

too many exit(0)s

I am a novice at looking at this kind of output and mine do seem to be CUDA memory allocation errors. I guess I should have read closer if the others were problems obtaining locks.

I apologize if I'm asking about this problem in the wrong place but I would appreciate any suggestion on how to proceed. Leave it alone, disable GPU again, or what?

Thanks for looking into it.

Joe

joe areeda
joe areeda
Joined: 13 Dec 10
Posts: 285
Credit: 217512382
RAC: 206723

Oh yes, and I have recently

Oh yes, and I have recently throttled back in an attempt to deal with temperature issues. I'm surprised how effective it is. Running at 50% CPU is 20C cooler on one system than 80%.

Joe

joe areeda
joe areeda
Joined: 13 Dec 10
Posts: 285
Credit: 217512382
RAC: 206723

My problem seems to be

My problem seems to be resolved by rebooting. This is a Ubuntu 10.04 laptop. I now have a few CUDA BPR's completed and validated.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.