New (Albert) application and workunits

DanNeely
DanNeely
Joined: 4 Sep 05
Posts: 1,364
Credit: 3,562,358,667
RAC: 0

RE: no problems here

Message 22673 in response to message 22672

Quote:

no problems here downloading/crunching alberts. they take less time than the other WUs so i claim less credit.

BUT the computer I am paired with (well, it has been sent most of the alberts i have had) hasn't returned their alberts yet, so i must wait for credit.... grr

Normally I get almost instant credit since I've got a 4 day queue. 3 days to cover my isp going down friday evening and not being fixed until monday (happened twice in the last 6 mo), and one more day incase thier sysadmin needs to overnight a spare part. It looks like the person you're waiting on has a similarly long queue.

IT could be worse afterall. I've got a 5 results waiting on a noob who appears to've quit after returning 6 errors the last week of dec, and a 6th on annother noob that only did a single work unit.

Professor Ray
Professor Ray
Joined: 22 Feb 05
Posts: 46
Credit: 12,510,998
RAC: 924

RE: RE: And on my older

Message 22674 in response to message 22653

Quote:
Quote:


And on my older P3:

5.2.8

2005-12-31 12:45:36.1250 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/albert_4.37_windows_intelx86.exe'.
2005-12-31 12:45:36.1250 [normal]: Started search at lalDebugLevel = 0
2005-12-31 12:45:36.8125 [normal]: Checkpoint-file 'Fstat.out.ckp' not found.
2005-12-31 12:45:36.8125 [normal]: No usable checkpoint found, starting from beginning.
2005-12-31 12:50:57.9843 [normal]: Fstat file reached MaxFileSizeKB ==> compactifying ... done.
2005-12-31 16:24:34.0937 [normal]: Search finished successfully.

Looks like normal operations to me. That is, I think the "No usable checkpoint found . . ." messages are indicative of the first time Albert tried to write a checkpoint for those particular WU's. Every Albert WU I have looked at has one of these messages. In other words, it is only be a problem if a WU gets more than one of these messages.

Man, I don't know abou that. The last three WU's I've processed have failed on me due to excessive CPU times. And these times are way out in space: 55 hours to completion? And the CPU time indicated at abort is a bunch of jive with respect to actual elapsed time. There's no way I could've processed a WU as long as is indicated at abort time.

Ananas
Ananas
Joined: 22 Jan 05
Posts: 272
Credit: 2,500,681
RAC: 0

An idea for the reduced

An idea for the reduced "initial replication" part, I'm not sure if that is possible without a lot of work though:

Maybe fresh results of those workunits, that have result entries with "Over/No reply" could be delivered preferably to hosts with host.avg_turnaround < 3 days

Michael Roycraft
Michael Roycraft
Joined: 10 Mar 05
Posts: 846
Credit: 157,718
RAC: 0

RE: Man, I don't know about

Message 22676 in response to message 22674

Quote:
Man, I don't know about that. The last three WU's I've processed have failed on me due to excessive CPU times. And these times are way out in space: 55 hours to completion? And the CPU time indicated at abort is a bunch of jive with respect to actual elapsed time. There's no way I could've processed a WU as long as is indicated at abort time.

Ray,

70-80 hours is way too long for your machine, especially considering the WUs weren't even completed in that time, unless there's some incompatibility with Win98/albert that I don't know about. I'd suspect either thermal throttling or something very CPU-intensive running alongside it. Anything you know of that might qualify?

Regards,

Michael

microcraft
"The arc of history is long, but it bends toward justice" - MLK

Ananas
Ananas
Joined: 22 Jan 05
Posts: 272
Credit: 2,500,681
RAC: 0

@Professor Ray : Your

@Professor Ray :

Your results really do not look good, the messages indicate a problem.

- No heartbeat from core client for 31 sec - exiting
- Corrupted Fstat-file '...': has 2697271 bytes instead of 2700598

This is what I would do in this case :

- exclude the BOINC directory from beeing scanned by antivirus software
- while BOINC is not running, do a scandisk
- check the message board for known incompatibilities with Win9x

The plain "Maximum CPU time exceeded" error without additional messages might also be caused by an "over-optimized" BOINC client that causes a too high benchmark value. The maximum allowed CPU time isn't a constant but calculated from the benchmark values I think.

Professor Ray
Professor Ray
Joined: 22 Feb 05
Posts: 46
Credit: 12,510,998
RAC: 924

Nope, doesn't make any

Nope, doesn't make any sense.

As is evident from my profile I've accum'd almost 4K credits w/EAH. I'm engaged in three other BOINC science applications, and except for a recent Rosetta hiccup there are no other problems. Rosetta completed the last two WU's w/out issue. Concurrently with BOINC applications, I'm processing UD Agent (Rosetta and/or LigandFit). I'm getting a mean time between UD Agent checkpoints of about 59 minutes with 1 STD being 1:21:00 over a period of 300 checkpoints. This is reasonable performance for UD Agent (and is why I bowed out of WCG processing, i.e., checkpointing for that BOINC application is non-deterministic).

Task switching between BOINC applications occurs about every 3:20:00, and write to disk is every 0:01:00. That should ensure at least one iteration of each application once per CPU wake period.

As far as CPU intensive processing: there's nothing going on. When I desire to launch one of my sims (Falcon4.0 or F1 2002), I wait for UD Stats to show a recent checkpoint, and then I suspend/snooze both BOINC and UD Agent. The rest is just normal IE browsing/Outlook Express.

I'm perceiving either a problem with Albert (and this appears to have just started around New Years).

I'm running default 5.13 BOINC, albeit with a optimized SETI application (that shouldn't affect EAH though). I am OC'd at 112 FSB running PC133 ECC SDRAM async at 4/3. But that hasn't changed either. What HAS changed is Albert.

It could be that my box is dying, i.e., I'm running a slot 1 P3 on a P3V4X, and HD00 (which NEVER spins down because of SpyBot's Tea Timer) is getting long in the tooth at 5 years. The CPU is cooled w/Vantec P35030 dual-fan CPU cooler (shimmed w/Arctic Silver). The P3V4X clock generator has a Arctic Silver shimmed passive (486) heat-sink (as does the Northbridge). If my system is dying, its dying selectively (only w/respect to EAH).

Jord
Joined: 26 Jan 05
Posts: 2,952
Credit: 5,893,653
RAC: 450

My latest Albert went out

My latest Albert went out with an error, and no, before Stick probes me, the exit code -1073741819 (0xc0000005) wasn't caused by me using my screensaver. ;)

(I never use screen saver or graphics).

Reason: Access Violation (0xc0000005) at address 0x0045CB31 read attempt to address 0x00000000

I never knew the application could read the top part of my memory. I thought it was in use by Windows. :)

Michael Roycraft
Michael Roycraft
Joined: 10 Mar 05
Posts: 846
Credit: 157,718
RAC: 0

RE: My latest Albert went

Message 22680 in response to message 22679

Quote:

My latest Albert went out with an error, and no, before Stick probes me, the exit code -1073741819 (0xc0000005) wasn't caused by me using my screensaver. ;)

(I never use screen saver or graphics).

Reason: Access Violation (0xc0000005) at address 0x0045CB31 read attempt to address 0x00000000

I never knew the application could read the top part of my memory. I thought it was in use by Windows. :)

Jord,

Eej, maat! As the other half of the "Graphics Bug" tag-team, I guess that leaves me off the case, too, since it's equally unlikely to be a graphics adaptor driver issue. :-)

Michael

microcraft
"The arc of history is long, but it bends toward justice" - MLK

Stick
Stick
Joined: 24 Feb 05
Posts: 790
Credit: 33,115,076
RAC: 2,651

RE: My latest Albert went

Message 22681 in response to message 22679

Quote:

My latest Albert went out with an error, and no, before Stick probes me, the exit code -1073741819 (0xc0000005) wasn't caused by me using my screensaver. ;)

(I never use screen saver or graphics).

Reason: Access Violation (0xc0000005) at address 0x0045CB31 read attempt to address 0x00000000

I never knew the application could read the top part of my memory. I thought it was in use by Windows. :)

Maybe you should try the Beta application! ;-)

Actually, I happened to find a similar result last week and posted this message on the NEW: WINDOWS TEST APPLICATION FOR EINSTEIN@HOME board.

I have to admit that Jord used the more appropriate venue.

Edited - to improve the humor (maybe).

Jord
Joined: 26 Jan 05
Posts: 2,952
Credit: 5,893,653
RAC: 450

Wow... 6 in a row?? All with

Wow... 6 in a row?? All with the same error. Anyone?

I stopped BOINC already, restarted it, did a reboot. Or am I getting the bad batch on purpose? ;)

edit: 8 in a row now. Einstein is at No New Work until I figure out what's happening here. No need to blow through the other 8 units.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.