GNU/Linux S5R3 "power users" App 4.27 available

tullio
tullio
Joined: 22 Jan 05
Posts: 2,022
Credit: 32,525,382
RAC: 4,319

After installing Linux on my

After installing Linux on my Opteron box I reattached to thw project via the BOINC 5.10.28 manager. Now I have two Einstein jobs running, plus others. But I see my application is the standard 4.20. Should I switch to 4.27?
Tullio

Donald A. Tevault
Donald A. Tevault
Joined: 17 Feb 06
Posts: 439
Credit: 73,516,529
RAC: 0

RE: After installing Linux

Message 77440 in response to message 77439

Quote:
After installing Linux on my Opteron box I reattached to thw project via the BOINC 5.10.28 manager. Now I have two Einstein jobs running, plus others. But I see my application is the standard 4.20. Should I switch to 4.27?
Tullio

Most definitely. You will see a speed increase.

Annika
Annika
Joined: 8 Aug 06
Posts: 720
Credit: 494,410
RAC: 0

No problems on my box so far

No problems on my box so far and apparently good performance, but now I have a WU that is acting strangely... I'm talking about this WU:
http://einsteinathome.org/workunit/36744754
By acting strangely I mean that the WU seems to be "stuck" at 66% finished... it is shown in BOINC manager as running, but there is no progress and CPU time doesn't increase. Top shows that the corresponding CPU core has more than half of its ressources free (it is shown as idle) and the Einstein task takes up 0% CPU power. My wingman has already reported a client error on the WU.
What should I do? Should I abort the task? I have suspended it for the time being to avoid wasting more CPU time.
Advice on how to react and on possible causes would be appreciated.
Greetings, Annika

tullio
tullio
Joined: 22 Jan 05
Posts: 2,022
Credit: 32,525,382
RAC: 4,319

RE: RE: After installing

Message 77442 in response to message 77440

Quote:
Quote:
After installing Linux on my Opteron box I reattached to thw project via the BOINC 5.10.28 manager. Now I have two Einstein jobs running, plus others. But I see my application is the standard 4.20. Should I switch to 4.27?
Tullio

Most definitely. You will see a speed increase.


My first Opteron run completed with 4.27. It is six times faster than my PII. Good work, AMD! It is running two Einstein jobs, two SETI jobs and a QMC job. My CPU is an Opteron 1210 at 1.8 GHz (the cheapest I found), my RAM is 512 MB, my disk is 160 GB. I shall not mention the WS maker but you can guess it.
Tullio

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,130
Credit: 36,997,772,148
RAC: 38,597,350

RE: ... the WU seems to be

Message 77443 in response to message 77441

Quote:
... the WU seems to be "stuck" at 66% finished... it is shown in BOINC manager as running, but there is no progress and CPU time doesn't increase.

I've seen exactly the same from time to time on several different machines and the way to kickstart things is to stop BOINC completely and then restart it. Each time I've done that, the task starts from the last saved checkpoint and continues to completion and then validates. I have no idea what causes it to become stuck.

Cheers,
Gary.

Annika
Annika
Joined: 8 Aug 06
Posts: 720
Credit: 494,410
RAC: 0

Tried that... worked just

Tried that... worked just fine, thanks a lot :-D

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,130
Credit: 36,997,772,148
RAC: 38,597,350

A couple of errors I've

A couple of errors I've noticed with this 4.27 app. Both of them were on the same host, which is overclocked, but has been at an unchanged level of overclocking for the last 3 years (24/7 operation) with no prior history of problems. The HSF seems OK but I guess I should clean the fins again to be sure, seeing as it is the middle of summer here at the moment.

This result had a "process exited with code 99" error. Later on in the output it said, "Required frequency-bins [-7, 8] not covered by SFT-interval [1412793, 1413317]"

This result had a "process exited with code 41" - which on inspection is a signal 11. I thought they were supposed to be solved now :).

Cheers,
Gary.

tullio
tullio
Joined: 22 Jan 05
Posts: 2,022
Credit: 32,525,382
RAC: 4,319

I completed three tasks on my

I completed three tasks on my new box, all asking for 236,29 credits (pending). The first two, done with app. 4.20, took 52901,11 s and 53336,03 s. The third one, done with 4.27, took 44241,41 s. There is a good speed increase.
Tullio

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,130
Credit: 36,997,772,148
RAC: 38,597,350

Bernd, Earlier today I

Bernd,

Earlier today I noticed this results list for another host of mine that I switched from 4.14 to 4.27 several days ago. A string of signal 11 errors under 4.27.

The irony is that the machine was crunching OK under 4.14. The oldest two successful results still visible were done with 4.14. The very first task that was switched midstream from 4.14 to 4.27 errored out with a signal 11 not long after the transition was made. Until today, every task since the transition has errored out.

By chance (as I wasn't aware of the problem) I hooked up keyboard, mouse and monitor to this box and found that BOINC was complaining about no network. It's a long story but this is not entirely unusual for this particular hardware - I have about 25 similar boxes split between Linux and Windows. It's a Gigabyte Tualatin PIII motherboard and the onboard ethernet locks up when loaded under Linux but is fine under Windows. I solve the problem by throwing in an external ethernet card for Linux and usually no more problems. Funnily enough, this particular box was "different" in that it always worked fine under Linux with the onboard ethernet. I think I've actually got two like this that don't seem to need an external card so I'm guessing there was perhaps an undocumented hardware revision done on these two as the motherboard revisions seem identical. Also the latest BIOS doesn't make any difference.

So when BOINC was complaining today about no network, I pinged the router and that was fine and ifconfig said eth0 was up with a valid IP address so I decided to restart BOINC. There was a finished result stuck in upload. That didn't work, BOINC was still complaining so I decided to reboot.

Well the machine now has a problem trying to start X and drops to a console login screen and still has problems with very long delays trying to bring up the network during boot. In desperation after many failed reboots, I used a long patch cable to hook the machine up to a different network hub. X still wont start but I can login at the console screen and see that I now have a network. Using BOINC Manager on another machine I attached to the the BOINC client on the problem machine which immediately complained that there was no network. Using boinc_cmd on the problem machine I stopped and restarted BOINC and then the remote BOINC Manager found everything in a happy state. The machine had already downloaded new work and was crunching away and an update soon fixed the stuck upload. To my surprise, the uploaded result validated.

EDIT:
The previous sentence should say "...the uploaded result didn't show a client error."

My guess is that the original network hub and the onboard ethernet have some issues with each other causing network instability and a bad reaction in the 4.27 app. However other machines (different hardware) on the original hub seem to be working OK. In other words these are specific hardware/driver issues that cause a problem for the 4.27 app that didn't seem to be a problem for 4.14. My concern is that I thought the 4.27 app wasn't supposed to crash with a signal 11 if the network was unstable? Does the version of BOINC in use have any impact on this? Should people still experiencing signal 11s upgrade both the app and BOINC itself?

Cheers,
Gary.

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 3,915
Credit: 193,361,792
RAC: 21,440

RE: A couple of errors I've

Message 77448 in response to message 77445

Quote:

A couple of errors I've noticed with this 4.27 app. Both of them were on the same host, which is overclocked, but has been at an unchanged level of overclocking for the last 3 years (24/7 operation) with no prior history of problems. The HSF seems OK but I guess I should clean the fins again to be sure, seeing as it is the middle of summer here at the moment.

This result had a "process exited with code 99" error. Later on in the output it said, "Required frequency-bins [-7, 8] not covered by SFT-interval [1412793, 1413317]"

This result had a "process exited with code 41" - which on inspection is a signal 11. I thought they were supposed to be solved now :).


I'd consider it rather normal for a computer to run for a certain time w/o problems and then suddenly all kinds of weird things happen. Three years might look a bit short, but I'd suspect this to be within normal tolerance. Our old Merlin cluster machines ran under perfectly controlled conditions for almost precisely three years, now one or two machines die per week. In addition on average I'd suspect overclocked machines to age faster than ones that are run on specifications.

The "frequency-bins [-7, 8]" points to a problem in the FPU (a variable that lives only in an FPU register became NaN). A "signal 11" is the Linux equivalent of a "General access violation" on Windows - the number of possible reasons is infinite, and they could be in hardware as well as in software.

I'd think the machine has reached EOL, at least for number crunching.

One reason for a 'signal 11' in the App was in software (the BOINC library) and had been fixed. It occurred when the Core Client became unresponsive, e.g. when waiting for a DNS lookup. You could simulate this by sending the client a STOP signal ("killall -STOP boinc"), waiting for ~30 seconds and send a CONT signal ("killall -CONT boinc"). Old Apps that were running would crash with a "signal 11", new ones should "exit with 0 status and no 'finished' file" and should be restarted by the Client.

According to Charlie Fenton the BOINC developers are currently implementing asynchronous DNS requests, which should fix the same problem from the Client side, i.e. it will respond to the Apps even when waiting for a DNS lookup.

An exit status of 41 is a "signal 11" that was caught by the signal handler. In contrast to a "signal 11" with exit status 11 it means that there should be a stacktrace in stderr that tells us at least where the error happened. And indeed it does! I'll take a look as soon as I have time.

BM

BM

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.