A few days ago in another thread I had experienced sudden computation errors. The cause at that time could not be determined so it was chalked up to any number of issues. Today I did what I had done a few days ago and immediately received "Computation Errors".
The machine is a Win 7 box with a NVIDIA GTX 650 Ti crunching E@H only. Since the earlier cleanup (removing BOINC and manually cleaning BOINC's data directory, and removing E@H project) this machine had been back online crunching data for these E@H data types: Binary Radio Pulsar Search (Perseus Arm Survey ... BRP5-cuda32-nv301), CasA and Gamma-ray pulsar search #3.
Here is what I did to induce "Computation Error":
I want to process Perseus Arm Survey only. I noticed that the "location" for this PC was "default". I had other locations defined which processed "other" combinations of WUs. I choose the "school" location which said only give me Perseus Arm. No other job types and NO CPU work. Use NVIDIA GPU is checked. I clicked "update" in Boinc manager. Ok nothing much happening. Did a Win7 software update. Suspended E@H. Rebooted the system.
Went into Boinc Manger "unsuspened" E@H and immediate got 3 computation error for the Persus Arm Survey jobs that had been running. Two other Perseus Arm Survey jobs that had been in a wait state entered the run state. But I have a feeling they are going to error out. Not sure but....
Maybe my procedure is wrong but if all Persues Arm jobs now constantly produce "Error conditions" then I would think that this is something that everyone could induce by doing what I did and this would not be desirable.
I also just received a Gravatational Wave job that is NOT part of the "School" location. It seems that BOINC is ignoring the settings on the E@H website.
Copyright © 2024 Einstein@Home. All rights reserved.
how to create a computatioin error scenario on E@H
)
Something I could imagine to happen is BOINC not shutting down cleanly. When you hit "restart" after the update Win tells programs to shut down. Some programs may just do this, but BOINC will write the current task state to disk / checkpoint. If your HDD is being hammered with such requests it will take some time to finish. After a few seconds Win will show you a list with programs which have not yet closed and a button "force restart". If you hit this, the outstanding I/O operations will be aborted, which could cause all sorts of trouble (unless the program in question was really just hanging).
Another thought: was it the same update in both cases, which might have not correctly been installed or rolled back? A nVidia driver update?
MrS
Scanning for our furry friends since Jan 2002
Hi! I don't think this is
)
Hi!
I don't think this is related tpo the venues at all.
The tasks that errored out (as far as I can see) were terminated because BOINC was detecting an excessive runtime, suspecting they were stuck, so to speak. Indeed the tasks are reported as having run for almost 2 days (and it's clear from previously submitted tasks that your card can handle tasks much, much faster).
So either the tasks were actually stuck (and time in suspend state should not be counted as elapsed time!), or something happened to your PCs real time/date settings (e.g. date/time was wrong and then corrected during the update process???) which made BOINC think that the tasks were running much longer than you would expect.
Cheers
HB
RE: Hi! I don't think this
)
Hmm. HB I would not/am not arguing with you but this node had been working flawlessly up unti a few days ago when I modified a "location profile". Then it started having comp errors. I had to delete E@H project and delete/remove BOINC and manually cleanup the data directory for BOINC because some of the slots contained references to E@H. I then reinstalled BOINC and added E@H back in. Things progressed normally. This morning it was fine until I changed from one "location profile to another". And again I started to notice errors. I am really trying to understand what is causing this problem.
I had placed E@H into suspend prior to rebooting for the windows update to take effect. After reboot I took E@H out of suspend. I noticed at this point there was a long delay before BOINC manager updated and then when it did update I noticed the errors. Also this node is time sync'd so I know that its time is correct.
the computer in question is #10373677 and the "error listing" is showing run times of ~5secs for all failed tasks. I am not sure where you see two days. I am probably looking in the wrong place.
RE: Something I could
)
I too had concerns about the "force restart" so I did not do that. I had placed E@H into suspend prior to the reboot. I wonder if that could have an effect when E@H is restarted after the reboot.
Not the same update. I just checked this list of updates for today's date in winupdate and there were no new NVIDIA drivers installed. I had not considered that possibility so happy you called that out.
[EDIT] if a new driver were installed would it be flagged as NVIDIA or as some windows KBxxxxxxxxx type of generic fix?
Did you only suspend Einstein
)
Did you only suspend Einstein or did you fully exit Boinc prior to the restart?
I always do a File -> Exit from Boinc manager and opt to "Stop running task when exiting the Boinc manager" before any shutdown or restart of my computer to make sure Boinc has enough time to finish all shutdown procedures before the actual shutdown of the computer. It's a old times habit but this discussion reinforces me to continue doing it the "safe" way.
RE: Did you only suspend
)
I suspended Einstein and feel fairly certain that I "Exited" Boinc manager prior to doing a Window restart.
RE: [EDIT] if a new driver
)
You'd recognize them as nVidia drivers, no problem.
Well, this box of your is really acting strangely. If more (maybe other) errors appear some hardware might be broken or breaking.
MrS
Scanning for our furry friends since Jan 2002
RE: RE: [EDIT] if a new
)
It never recovered so I had to uninstall BOINC, E@H and manually delete the data directory again. Its seems ok now. But, you might be right. This box had been around for ~10 years and idling/working 24/7 so it might be having disk/other issues. At present I am considering freeing up a relatively new Linux box whose work I have off loaded to a VM. I could then install Win 7 on this new box and if it resolves the problems then I would retire the older box and scrub the OS drive(s) before destroying.
Doing a Win 7 rebuild/restore is just time consuming. At least for me. I am not "that" comfortable with Windows from a migration point of view (email, etc.).
MrS I took this machine
)
MrS
I took this machine off line and re-purposed a Linux box to take its place running a GTX 760. I let the initial download of various types of WUs to complete after having set "no more work". I then assigned a new location for E@H for GPU WUs only because I will share this node with Rosetta which is strictly CPU oriented. Re-enabled more work and received only the GPU WUs requested by the profile/location change
All seems well so maybe there was some issue with the hardware on the older machine as you pointed out.
I will replace the drives in the older machine and return it to crunching after a suitable period of time. Still want to keep as an off line win box should I discover stuff I require but which I neglected to transfer.