New (Albert) application and workunits

DanNeely

Joined: 4 Sep 05

Posts: 1364

Credit: 3562358667

RAC: 0

RE: no problems here

8 Jan 2006 23:02:45 UTC

Message 22673 in response to message 22672

(moderation:

)

Quote:

no problems here downloading/crunching alberts. they take less time than the other WUs so i claim less credit.

BUT the computer I am paired with (well, it has been sent most of the alberts i have had) hasn't returned their alberts yet, so i must wait for credit.... grr

Normally I get almost instant credit since I've got a 4 day queue. 3 days to cover my isp going down friday evening and not being fixed until monday (happened twice in the last 6 mo), and one more day incase thier sysadmin needs to overnight a spare part. It looks like the person you're waiting on has a similarly long queue.

IT could be worse afterall. I've got a 5 results waiting on a noob who appears to've quit after returning 6 errors the last week of dec, and a 6th on annother noob that only did a single work unit.

Professor Ray

Joined: 22 Feb 05

Posts: 46

Credit: 12464567

RAC: 1210

RE: RE: And on my older

11 Jan 2006 23:02:48 UTC

Message 22674 in response to message 22653

(moderation:

)

Quote:

Quote:

And on my older P3:

5.2.8

2005-12-31 12:45:36.1250 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/albert_4.37_windows_intelx86.exe'.
2005-12-31 12:45:36.1250 [normal]: Started search at lalDebugLevel = 0
2005-12-31 12:45:36.8125 [normal]: Checkpoint-file 'Fstat.out.ckp' not found.
2005-12-31 12:45:36.8125 [normal]: No usable checkpoint found, starting from beginning.
2005-12-31 12:50:57.9843 [normal]: Fstat file reached MaxFileSizeKB ==> compactifying ... done.
2005-12-31 16:24:34.0937 [normal]: Search finished successfully.

Looks like normal operations to me. That is, I think the "No usable checkpoint found . . ." messages are indicative of the first time Albert tried to write a checkpoint for those particular WU's. Every Albert WU I have looked at has one of these messages. In other words, it is only be a problem if a WU gets more than one of these messages.

Man, I don't know abou that. The last three WU's I've processed have failed on me due to excessive CPU times. And these times are way out in space: 55 hours to completion? And the CPU time indicated at abort is a bunch of jive with respect to actual elapsed time. There's no way I could've processed a WU as long as is indicated at abort time.

Ananas

Joined: 22 Jan 05

Posts: 272

Credit: 2500681

RAC: 0

An idea for the reduced

11 Jan 2006 23:23:47 UTC

Message 22675

(moderation:

)

An idea for the reduced "initial replication" part, I'm not sure if that is possible without a lot of work though:

Maybe fresh results of those workunits, that have result entries with "Over/No reply" could be delivered preferably to hosts with host.avg_turnaround < 3 days

Michael Roycraft

Joined: 10 Mar 05

Posts: 846

Credit: 157718

RAC: 0

RE: Man, I don't know about

11 Jan 2006 23:38:05 UTC

Message 22676 in response to message 22674

(moderation:

)

Quote:

Man, I don't know about that. The last three WU's I've processed have failed on me due to excessive CPU times. And these times are way out in space: 55 hours to completion? And the CPU time indicated at abort is a bunch of jive with respect to actual elapsed time. There's no way I could've processed a WU as long as is indicated at abort time.

Ray,

70-80 hours is way too long for your machine, especially considering the WUs weren't even completed in that time, unless there's some incompatibility with Win98/albert that I don't know about. I'd suspect either thermal throttling or something very CPU-intensive running alongside it. Anything you know of that might qualify?

Regards,

Michael

microcraft
"The arc of history is long, but it bends toward justice" - MLK

Ananas

Joined: 22 Jan 05

Posts: 272

Credit: 2500681

RAC: 0

@Professor Ray : Your

11 Jan 2006 23:41:40 UTC

Message 22677

(moderation:

)

@Professor Ray :

Your results really do not look good, the messages indicate a problem.

- No heartbeat from core client for 31 sec - exiting
- Corrupted Fstat-file '...': has 2697271 bytes instead of 2700598

This is what I would do in this case :

- exclude the BOINC directory from beeing scanned by antivirus software
- while BOINC is not running, do a scandisk
- check the message board for known incompatibilities with Win9x

The plain "Maximum CPU time exceeded" error without additional messages might also be caused by an "over-optimized" BOINC client that causes a too high benchmark value. The maximum allowed CPU time isn't a constant but calculated from the benchmark values I think.

Professor Ray

Joined: 22 Feb 05

Posts: 46

Credit: 12464567

RAC: 1210

Nope, doesn't make any

11 Jan 2006 23:55:27 UTC

Message 22678

(moderation:

)

Nope, doesn't make any sense.

As is evident from my profile I've accum'd almost 4K credits w/EAH. I'm engaged in three other BOINC science applications, and except for a recent Rosetta hiccup there are no other problems. Rosetta completed the last two WU's w/out issue. Concurrently with BOINC applications, I'm processing UD Agent (Rosetta and/or LigandFit). I'm getting a mean time between UD Agent checkpoints of about 59 minutes with 1 STD being 1:21:00 over a period of 300 checkpoints. This is reasonable performance for UD Agent (and is why I bowed out of WCG processing, i.e., checkpointing for that BOINC application is non-deterministic).

Task switching between BOINC applications occurs about every 3:20:00, and write to disk is every 0:01:00. That should ensure at least one iteration of each application once per CPU wake period.

As far as CPU intensive processing: there's nothing going on. When I desire to launch one of my sims (Falcon4.0 or F1 2002), I wait for UD Stats to show a recent checkpoint, and then I suspend/snooze both BOINC and UD Agent. The rest is just normal IE browsing/Outlook Express.

I'm perceiving either a problem with Albert (and this appears to have just started around New Years).

I'm running default 5.13 BOINC, albeit with a optimized SETI application (that shouldn't affect EAH though). I am OC'd at 112 FSB running PC133 ECC SDRAM async at 4/3. But that hasn't changed either. What HAS changed is Albert.

It could be that my box is dying, i.e., I'm running a slot 1 P3 on a P3V4X, and HD00 (which NEVER spins down because of SpyBot's Tea Timer) is getting long in the tooth at 5 years. The CPU is cooled w/Vantec P35030 dual-fan CPU cooler (shimmed w/Arctic Silver). The P3V4X clock generator has a Arctic Silver shimmed passive (486) heat-sink (as does the Northbridge). If my system is dying, its dying selectively (only w/respect to EAH).

Jord

Joined: 26 Jan 05

Posts: 2952

Credit: 5877416

RAC: 7778

My latest Albert went out

12 Jan 2006 2:48:58 UTC

Message 22679

(moderation:

)

My latest Albert went out with an error, and no, before Stick probes me, the exit code -1073741819 (0xc0000005) wasn't caused by me using my screensaver. ;)

(I never use screen saver or graphics).

Reason: Access Violation (0xc0000005) at address 0x0045CB31 read attempt to address 0x00000000

I never knew the application could read the top part of my memory. I thought it was in use by Windows. :)

Michael Roycraft

Joined: 10 Mar 05

Posts: 846

Credit: 157718

RAC: 0

RE: My latest Albert went

12 Jan 2006 3:06:31 UTC

Message 22680 in response to message 22679

(moderation:

)

Quote:

My latest Albert went out with an error, and no, before Stick probes me, the exit code -1073741819 (0xc0000005) wasn't caused by me using my screensaver. ;)

(I never use screen saver or graphics).

Reason: Access Violation (0xc0000005) at address 0x0045CB31 read attempt to address 0x00000000

I never knew the application could read the top part of my memory. I thought it was in use by Windows. :)

Jord,

Eej, maat! As the other half of the "Graphics Bug" tag-team, I guess that leaves me off the case, too, since it's equally unlikely to be a graphics adaptor driver issue. :-)

Michael

microcraft
"The arc of history is long, but it bends toward justice" - MLK

Stick

Joined: 24 Feb 05

Posts: 790

Credit: 33004196

RAC: 22131

RE: My latest Albert went

12 Jan 2006 3:09:49 UTC

Message 22681 in response to message 22679

(moderation:

)

Quote:

My latest Albert went out with an error, and no, before Stick probes me, the exit code -1073741819 (0xc0000005) wasn't caused by me using my screensaver. ;)

(I never use screen saver or graphics).

Reason: Access Violation (0xc0000005) at address 0x0045CB31 read attempt to address 0x00000000

I never knew the application could read the top part of my memory. I thought it was in use by Windows. :)

Maybe you should try the Beta application! ;-)

Actually, I happened to find a similar result last week and posted this message on the NEW: WINDOWS TEST APPLICATION FOR EINSTEIN@HOME board.

I have to admit that Jord used the more appropriate venue.

Edited - to improve the humor (maybe).

Jord

Joined: 26 Jan 05

Posts: 2952

Credit: 5877416

RAC: 7778

Wow... 6 in a row?? All with

12 Jan 2006 13:42:33 UTC

Message 22682

(moderation:

)

Wow... 6 in a row?? All with the same error. Anyone?

I stopped BOINC already, restarted it, did a reboot. Or am I getting the bad batch on purpose? ;)

edit: 8 in a row now. Einstein is at No New Work until I figure out what's happening here. No need to blow through the other 8 units.

New (Albert) application and workunits

Forums › Cruncher's Corner

Comment viewing options

Forums › Cruncher's Corner