GNU/Linux S5R3 "power users" App 4.27 available

KSMarksPsych
KSMarksPsych
Moderator
Joined: 15 Oct 05
Posts: 2702
Credit: 4090227
RAC: 0

RE: Kathryn, I do know this

Message 77469 in response to message 77468

Quote:
Kathryn, I do know this kind of behaviour, but I'm not convinced it has much to do with Rosetta. On my box it has happened with Einstein/Prime Grid as well as with two Einstein tasks running parallely. Maybe some projects are not affected or are more likely to develop this problem than others but it is not, afaik, a Rosetta problem.
For info: What kind of box and OS are you talking about? Since this is the Linux app thread, you obviously have a Linux box ;-) but which distro and kernel version? And what kind of CPU is it? Your description makes me think of a dual core but there are quite a few different ones out there...
The box I experienced this problem with is running Kubuntu 7.10, 2.6.22.14 kernel and an ancient BOINC core client running from command line with normal user privileges (no daemon involved).
CPU is a Core Duo Mobile "Yonah".

Well... I can only say I've only seen it with Ralph running. It never happened until I attached there. It never happened with Rosetta Beta, but with Rosetta Mini (their new rewritten code).

But, it is the same processor you have. It's a Yonah (It's a Gateway laptop with 2GB of RAM). It's running Fedora 7/KDE. Here's the output of uname -a

[kathryn@Galaxy ~]$ uname -a
Linux Galaxy.Fedora 2.6.23.14-64.fc7 #1 SMP Sun Jan 20 23:54:08 EST 2008 i686 i686 i386 GNU/Linux

It's been nearly a week since I updated. I usually do that Friday evening or Saturday morning.

BOINC is installed via rpm. It's version 5.10.21 packaged up by Eric Myers. It does run as a system daemon out of its own user account. Now that Eric and I have the log rotation sorted, I can try running it out of my user account if anyone thinks that would be helpful. I can also try an older version from Berkeley or compile the current 5.10 branch and try that.

Kathryn :o)

Einstein@Home Moderator

Annika
Annika
Joined: 8 Aug 06
Posts: 720
Credit: 494410
RAC: 0

Interesting that it is the

Interesting that it is the same CPU, makes me wonder if that one maybe tends to misinterprete a certain kind of instruction within the Einstein app... this is just wild guessing, of course. Bernd, Akos, do you copy?

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4273
Credit: 245282196
RAC: 11983

A new SSE Linux App is out:

A new SSE Linux App is out: 4.35.

BM

BM

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4273
Credit: 245282196
RAC: 11983

Kathryn, Annika, thanks

Kathryn, Annika,

thanks for your report.

I do have a possible explanation, though it's very technical. Maybe someone can translate it:

The code in the BOINC library (that gets linked into the Apps) that is actually meant to determine the CPU time was changed quite some times in the last months. The reason is that old Linux kernels (i.e. the pthread library there) behaves non-standard. so some non-standard "trick" is used to get this information anyway.

Using this trick for all (non-Windows) systems lead to deadlocks on MacOS, with the App apparently being "stuck" (no progress for hours until it is restarted). So currently the BOINC library uses the "standard" method on MacOS and the non-standard "trick" on Linux.

It might be that a similar dedlock we previously saw on MacOS could now happen with certain newer Linux kernels, too. If it's really a dedlock, it might depend on the version of the BOINC library that's in other Apps running at the same time, so this may well be limited to certain project pairs.

If this is what's happening the only way around this would be to change the method at run-time bease on the current kernel/pthread version on the system, which would take quite some programming effort I guess (in the BOINC library, though).

Is it possible to explicitly trigger this problem? The only way to find out what's wrong is to attach a debugger or profiler to the App and see what it actually does or where it is stuck (on MacOS I found it with Shark).

BM

BM

KSMarksPsych
KSMarksPsych
Moderator
Joined: 15 Oct 05
Posts: 2702
Credit: 4090227
RAC: 0

RE: Is it possible to

Message 77473 in response to message 77472

Quote:

Is it possible to explicitly trigger this problem? The only way to find out what's wrong is to attach a debugger or profiler to the App and see what it actually does or where it is stuck (on MacOS I found it with Shark).

BM

I've yet to see anything that might trigger it. It seems to happen randomly, at least for me.

I can see if I still have the logs from that work I reported at Ralph. Eric's init.d script captures the daemon output, so maybe something is of interest in there...

Kathryn :o)

Einstein@Home Moderator

Annika
Annika
Joined: 8 Aug 06
Posts: 720
Credit: 494410
RAC: 0

Well, I've only experienced

Well, I've only experienced this three times so far, so the chances of catching it with the debugger are probably not so great. I could try, though, DDD is still on this system and maybe I'll get lucky. At the moment I can't figure out a way to trigger it, either, to me it appeared just as random as to Kathryn. First time it happened I actually posted on the message board asking what I should do... I could look for that thread, maybe the guy who helped me (can't think of who exactly it was just now) could give you some valuable information, Bernd.
Your explanation sounds very logical to me. Not too "technical" at all ;-) but then, I think I'm already developing a bit of a "background"...

EDIT: It was actually this very thread... starting here:
http://einsteinathome.org/node/193452&nowrap=true#80340

Conan
Conan
Joined: 19 Jun 05
Posts: 172
Credit: 7181418
RAC: 1270

For some reason I have just

For some reason I have just had 11 workunits fail in a row all with this error

Input file missing or invalid

They were 921.00 frequency and ran from sequence 431 to 427 then 421 to 416, all now lost to me and I had a great sequence run going from the very peak and I was following it down the hill, bit annoyed about that.

This is one of the Work units

No previous problems with this host and all other projects are running fine.
Nothing in the messages indicates a problem was encountered.

Also no other Linux host I have displayed this problem.

I am using an AMD Opteron 275 with Fedora core 3 Linux and power user App 4.27.

KSMarksPsych
KSMarksPsych
Moderator
Joined: 15 Oct 05
Posts: 2702
Credit: 4090227
RAC: 0

RE: I do have a possible

Message 77476 in response to message 77472

Quote:

I do have a possible explanation, though it's very technical. Maybe someone can translate it:

I do remember that discussion on the mailing list.

If I'm understanding it (which is highly unlikely), then information on what version of the api different projects are linking with may be helpful. If so, I can try to get that information from the Baker Lab folks and Rytis (Annika, what PG app did you have problems with?).

Kathryn :o)

Einstein@Home Moderator

Annika
Annika
Joined: 8 Aug 06
Posts: 720
Credit: 494410
RAC: 0

Goodness, Kathryn... I'm

Goodness, Kathryn... I'm sorry but I couldn't possibly tell. It's too long ago and I didn't make a screenshot like you... I'm running everything the project has to offer, which means four or five different apps, and no idea which version they were on when I had trouble... only thing I could do was look at the exact time and date of my answer to Gary's posting, try to translate it to my local timezone and check if my BOINC logs go back to that point... okay just noticed the message board is in UTC so it shouldn't be so terribly difficult ;-)

EDIT: No luck so far with the logs... looks as if, by contrast to the daemon, the shell script version doesn't keep any logs per default...

Annika
Annika
Joined: 8 Aug 06
Posts: 720
Credit: 494410
RAC: 0

I had this same error with

I had this same error with two Einstein WUs today, both of which got stuck. Unfortunately I'm not running the debugger atm. Core client version 5.3.31 running with user privileges under my account (in a screen), Einstein science app 4.38 on Ubuntu Linux 7.10, 2.6.22-14 kernel, on a Core Duo machine (laptop). It seems this problem gets more frequent with longer uptime of my box, but that might be coincidence.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.