I just discovered that one of my hosts had stopped crunching for several weeks because it had ran out of disk space and the Einstein scheduler wasn't smart enough to delete some of the 16GB of data files that it might need again some day in order to download data files that it could work on today.
Since the computer itself remained online and was still communicating with e@h itself the failure had slipped through my primary and secondary monitoring systems.
To prevent this from happening again, is there something I could use that would send an alarm (preferably via email) if one of my hosts stopped reporting completed tasks for a day or so?
Copyright © 2024 Einstein@Home. All rights reserved.