how often does einstein checkpoint workunits?

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4273
Credit: 245287009
RAC: 11974

1. NTFS5 should be

1. NTFS5 should be journaling, isn't it?

2. The checkpoint files are separate from the client_state files. The are also written according to your settings, too, so once a minute per default.

3. The temp files can grow large - several MBs. I think for most users it's not a good idea to keep and write two copies of them. Most would prefer to deal with BOINC like with any other program - prevent the machine from crashing, as it may trash work. At least with BOINC it's not you own work you lose, just a bit of CPU time.

BM

BM

Ned
Ned
Joined: 22 Jan 05
Posts: 18
Credit: 24493621
RAC: 0

> 1. NTFS5 should be

Message 3971 in response to message 3970

> 1. NTFS5 should be journaling, isn't it?

Did not know that NTFS5 journaled... But I've had bad experiences with using NTFS as C: drive file system, so I avoid it. I would redirect Einstein to put its files on another drive with NTFS if BOINC allowed it...
>
> 2. The checkpoint files are separate from the client_state files. The are also
> written according to your settings, too, so once a minute per default.

That brings up another challange with windows systems that have "deleted file recovery"... HUNDREDS of old copies of "client_state_prev.xml"... Can you just reuse the alternate instead of deleting the old copy and creating a new one??

>
> 3. The temp files can grow large - several MBs. I think for most users it's
> not a good idea to keep and write two copies of them. Most would prefer to
> deal with BOINC like with any other program - prevent the machine from
> crashing, as it may trash work. At least with BOINC it's not you own work you
> lose, just a bit of CPU time.

Perhaps, but that's what started this thread. Ned
>
> BM
>

Ol' Retired IT Geezer

cIclops
cIclops
Joined: 19 Feb 05
Posts: 26
Credit: 450
RAC: 0

> > 1. NTFS5 should be

Message 3972 in response to message 3971

> > 1. NTFS5 should be journaling, isn't it?
>
> Did not know that NTFS5 journaled... But I've had bad experiences with using
> NTFS as C: drive file system, so I avoid it. I would redirect Einstein to put
> its files on another drive with NTFS if BOINC allowed it...
> >
> > 2. The checkpoint files are separate from the client_state files. The are
> also
> > written according to your settings, too, so once a minute per default.
>
> That brings up another challange with windows systems that have "deleted file
> recovery"... HUNDREDS of old copies of "client_state_prev.xml"... Can you just
> reuse the alternate instead of deleting the old copy and creating a new one??
>
> >
> > 3. The temp files can grow large - several MBs. I think for most users
> it's
> > not a good idea to keep and write two copies of them. Most would prefer
> to
> > deal with BOINC like with any other program - prevent the machine from
> > crashing, as it may trash work. At least with BOINC it's not you own work
> you
> > lose, just a bit of CPU time.
>
> Perhaps, but that's what started this thread. Ned
> >
> > BM
> >
>

Good news: my system crashed again but this time einstein recovered well and no processing time was lost (as a workunit takes 13 hours on my system losing 6 hours of CPU would be very annoying). from this i infer that the previous failure to resume was a timing problem, with the crash occurring during some critical disk write operation.

Bad news: both crashes locked the system hard, but left the HD running, forcing a power reset. twice may be a coincidence, but this has only ever happened since einstein was installed two days ago.

einstein 4.79
boinc 4.19
Win98SE

--
searching for gravitational waves since 2005

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4273
Credit: 245287009
RAC: 11974

Quite a lot of people

Quite a lot of people (literally thousands by now) are running E@H without problems, so it's unlikely that BOINC/E@H causes the crashes by itself. It may, however, trigger other problems you have on your system and remained unnoticed before. Frequent issues include the graphics driver (E@H makes much more use of OpenGL than most other programs) and, as it is mainly CPU-bound, overheating / cooling problems.

BM

BM

genes
genes
Joined: 10 Nov 04
Posts: 41
Credit: 1440267
RAC: 1479

> Quite a lot of people

Message 3974 in response to message 3973

> Quite a lot of people (literally thousands by now) are running E@H without
> problems, so it's unlikely that BOINC/E@H causes the crashes by itself. It
> may, however, trigger other problems you have on your system and remained
> unnoticed before. Frequent issues include the graphics driver (E@H makes much
> more use of OpenGL than most other programs) and, as it is mainly CPU-bound,
> overheating / cooling problems.
>
> BM
>

I agree totally. I have had episodes of crashing in the past which seemed to coincide with a new version of Einstein, but it turned out to be a (relatively new) graphics card that was drawing too much power. I had borrowed it from work, to play with some of its fancy new features, so I just put back the card I had before. Once I swapped that out, everything started running smoothly again. Both were nVidia cards, using the same driver version, so I can't blame the driver.


Mikie Tim T
Mikie Tim T
Joined: 22 Jan 05
Posts: 105
Credit: 263777741
RAC: 0

> 1. NTFS5 should be

Message 3975 in response to message 3970

> 1. NTFS5 should be journaling, isn't it?
>
> 2. The checkpoint files are separate from the client_state files. The are also
> written according to your settings, too, so once a minute per default.
>
> 3. The temp files can grow large - several MBs. I think for most users it's
> not a good idea to keep and write two copies of them. Most would prefer to
> deal with BOINC like with any other program - prevent the machine from
> crashing, as it may trash work. At least with BOINC it's not you own work you
> lose, just a bit of CPU time.
>
> BM
>

Win98SE wouldn't be running NTFS of any variety, but FAT32 as I recall. I'm not aware of FAT32 having journalling capabilities, but please correct me if I'm wrong. Anyway, the real fix is to prevent the crashing in the first place, which alleviates the need to come up with workarounds for the current checkpointing scheme.

cIclops
cIclops
Joined: 19 Feb 05
Posts: 26
Credit: 450
RAC: 0

> Quite a lot of people

Message 3976 in response to message 3973

> Quite a lot of people (literally thousands by now) are running E@H without
> problems, so it's unlikely that BOINC/E@H causes the crashes by itself. It
> may, however, trigger other problems you have on your system and remained
> unnoticed before. Frequent issues include the graphics driver (E@H makes much
> more use of OpenGL than most other programs) and, as it is mainly CPU-bound,
> overheating / cooling problems.

Thanks. I'm investigating some possible causes; one of which may be an interaction with the Java 2 runtime environment.

Does einstein save all the checkpoint files when a normal exit is performed? On restart it continues with almost no progress loss, hopefully if another crash occurs it will resume from the last exit state.

--
searching for gravitational waves since 2005

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4273
Credit: 245287009
RAC: 11974

> Does einstein save all the

> Does einstein save all the checkpoint files when a normal exit is performed?

Sure. That's what checkpointing is for.

BM

BM

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.