I've been wondering extensively on the causes of "error in computing", specifically E@H, and I wonder if the causes are one of:
- Using the computer for other things like surfing, emails, or multi-media while your computer is running BOINC in the background.
- Having an 'older' computer CPU &/or GPU that doesn't have the necessary processor features required for newer project files.
- Errors in the making of the computing process from the project's end.
- Not enough memory either available as a whole or as allocated to the BOINC project.
- Or... what else could you think of as a factor in "error in computing"?
Obviously I'm curious about this because I have more than an abundance of errors.
Thoughts?
George
Proud member of the Old Farts Association
George wrote:1. Using the
)
I'm pretty sure that 'yes' for that one. The light version just gives a computation error silently in the background while you're loading the system with other activities (not e-mails but preferably something heavy and video intensive). More severe version may freeze your screen and after a reboot you'd find that a task crashed and there's a computation error.
I don't think that CPUs have been the reason for computation errors in that way. CPU is either compatible with the app or not and server has succesfully sent CPUs only appropriate tasks. But with GPUs we've seen that older GPUs have received a task but then been running it endlessly till time limit caused a computation error.
^^
One additional thing that comes to my mind is Windows if it decided to start updating GPU drivers in the background. I'm pretty sure I got erros from that kind of intrusion a couple of times.
^^
One thing that comes to my mind is heat and power stability on that i7-990X. It has TDP of 130 watts and max operating temp is relatively low on these i7-9xx series (68C). i7-920 on the other end of the line had those same thermal specs, but your top of the line version is getting hotter more easily because it's quickly running with much higher clock speed with stock settings. So, in case you are putting that system under heavy load have you checked CPU temps ?
What's the model of the motherboard and PSU on that computer? Initial quality of the motherboard could have some role in the game at this point as the original platform is somewhat old (but fantastic).
Have you run stress tests on that computer (mem, cpu)?
On the Ryzen host components are newer but as always a well performing PSU and rock solid memory settings would be important factors.
Looking at stderr of your
)
Looking at stderr of your results (for example https://einsteinathome.org/task/1047231990 ):
So, it seems like some server, not client issue.
BOINC should check such files automatically and reload them in case of damage (CRC control).
So, file has correct CRC from BOINC point of view but wrong data from E@h science app point of view.
Perhaps, worth to reset project. This will cause re-downloading of all (including that one) files and maybe will solve issue (maybe not if corrupted file was used for CRC computations indeed, in that case only project staff can solve this).
And I have similarly bunch of errors on one of my hosts.
All your errors are from a
)
All your errors are from a very weird, never heard of error on both your Windows and Linux hosts..
on the Windows host and
on the Linux host.
Googling seems to point at a problem with network ports. Either the Windows firewall is misconfigured or you are running a VM like Virtual Box or something.
You will have to enlist somebody with a lot more knowledge than I can provide.
The root error occurs much
)
The root error occurs much earlier in the stderr.txt output before the message about the damaged file.
Keith, final outcome (-68) if
)
Keith, final outcome (68) if weird indeed, but stderr gives direct clue:
Stderr output
14:33:06 (55951): [normal]: Start of BOINC application '../../projects/einstein.phys.uwm.edu/hsgamma_FGRP5_1.08_x86_64-pc-linux-gnu__FGRPSSE'.
14:33:06 (55951): [debug]: 2.1e+15 fp, 4e+09 fp/s, 523566 s, 145h26m06s32
command line: ../../projects/einstein.phys.uwm.edu/hsgamma_FGRP5_1.08_x86_64-pc-linux-gnu__FGRPSSE --inputfile ../../projects/einstein.phys.uwm.edu/JPLEPH.405 --alpha 2.1039176188 --delta -0.9808959836 --skyRadius 0.001361356817 --ldiBins 15 --f0start 1064 --f0Band 16 --firstSkyPoint 706092 --numSkyPoints 58 --f1dot -1.0e-13 --f1dotBand 1.0e-13 --df1dot 1.344493449e-15 --ephemdir ../../projects/einstein.phys.uwm.edu/JPLEPH --Tcoh 4194304.0 --toplist 10 --cohFollow 10 --numCells 1 --useWeights 1 --Srefinement 1 --CohSkyRef 1 --cohfullskybox 1 --mmfu 0.15 --reftime 56757.0 --f0orbit 0.005 --freeRadiusFactor 2 --mismatch 0.15 --debug 0 -o LATeah1075F_1080.0_706092_0.0_0_0.out
output files: 'LATeah1075F_1080.0_706092_0.0_0_0.out' '../../projects/einstein.phys.uwm.edu/LATeah1075F_1080.0_706092_0.0_0_0' 'LATeah1075F_1080.0_706092_0.0_0_0.out.cohfu' '../../projects/einstein.phys.uwm.edu/LATeah1075F_1080.0_706092_0.0_0_1'
14:33:06 (55951): [debug]: Flags: X64 SSE SSE2 GNUC X86 GNUX86
14:33:06 (55951): [debug]: glibc version/release: 2.31/stable
14:33:06 (55951): [debug]: Set up communication with graphics process.
Line 1 in inputfile ../../projects/einstein.phys.uwm.edu/JPLEPH.405 seems to be damaged.
14:33:06 (55951): [CRITICAL]: ERROR: MAIN() returned with error '4'
FPU status flags:
mv: cannot stat 'LATeah1075F_1080.0_706092_0.0_0_0.out': No such file or directory
mv: cannot stat 'LATeah1075F_1080.0_706092_0.0_0_0.out': No such file or directory
mv: cannot stat 'LATeah1075F_1080.0_706092_0.0_0_0.out': No such file or directory
mv: cannot stat 'LATeah1075F_1080.0_706092_0.0_0_0.out': No such file or directory
mv: cannot stat 'LATeah1075F_1080.0_706092_0.0_0_0.out': No such file or directory
mv: cannot stat 'LATeah1075F_1080.0_706092_0.0_0_0.out.cohfu': No such file or directory
mv: cannot stat 'LATeah1075F_1080.0_706092_0.0_0_0.out.cohfu': No such file or directory
mv: cannot stat 'LATeah1075F_1080.0_706092_0.0_0_0.out.cohfu': No such file or directory
mv: cannot stat 'LATeah1075F_1080.0_706092_0.0_0_0.out.cohfu': No such file or directory
mv: cannot stat 'LATeah1075F_1080.0_706092_0.0_0_0.out.cohfu': No such file or directory
mv: cannot stat 'LATeah1075F_1080.0_706092_0.0_0_0.out.cohfu': No such file or directory
14:33:17 (55951): [normal]: done. calling boinc_finish(68).
14:33:17 (55951): called boinc_finish
</stderr_txt>
]]>
Keith Myers wrote: The root
)
Could you highlight it please seems I missing that line
I think the message about the
)
I think the message about the file being damaged is because of the port to read the file is blocked or run out of resources.
That is what Googling the error about
message seems to indicate in both Linux and Windows environments.
He has the same error on both hosts, one in Windows and one in Linux.
So common component.
https://einsteinathome.org/task/1047141326
06:27:47 (2764): [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/hsgamma_FGRP5_1.08_windows_intelx86__FGRPSSE.exe'.
06:27:47 (2764): [debug]: 2.1e+015 fp, 5.4e+009 fp/s, 389725 s, 108h15m24s67
command line: projects/einstein.phys.uwm.edu/hsgamma_FGRP5_1.08_windows_intelx86__FGRPSSE.exe --inputfile ../../projects/einstein.phys.uwm.edu/JPLEPH.405 --alpha 2.1039176188 --delta -0.9808959836 --skyRadius 0.001361356817 --ldiBins 15 --f0start 1048 --f0Band 16 --firstSkyPoint 959552 --numSkyPoints 58 --f1dot -1.0e-13 --f1dotBand 1.0e-13 --df1dot 1.344493449e-15 --ephemdir ..\..\projects\einstein.phys.uwm.edu\JPLEPH --Tcoh 4194304.0 --toplist 10 --cohFollow 10 --numCells 1 --useWeights 1 --Srefinement 1 --CohSkyRef 1 --cohfullskybox 1 --mmfu 0.15 --reftime 56757.0 --f0orbit 0.005 --freeRadiusFactor 2 --mismatch 0.15 --debug 0 -o LATeah1075F_1064.0_959552_0.0_0_0.out
output files: 'LATeah1075F_1064.0_959552_0.0_0_0.out' '../../projects/einstein.phys.uwm.edu/LATeah1075F_1064.0_959552_0.0_0_0' 'LATeah1075F_1064.0_959552_0.0_0_0.out.cohfu' '../../projects/einstein.phys.uwm.edu/LATeah1075F_1064.0_959552_0.0_0_1'
06:27:47 (2764): [debug]: Flags: i386 SSE GNUC X86 GNUX86
06:27:47 (2764): [debug]: Set up communication with graphics process.
Line 1 in inputfile ../../projects/einstein.phys.uwm.edu/JPLEPH.405 seems to be damaged.
06:27:47 (2764): [CRITICAL]: ERROR: MAIN() returned with error '4'
FPU status flags: PRECISION
06:27:58 (2764): [normal]: done. calling boinc_finish(68).
06:27:58 (2764): called boinc_finish
</stderr_txt>
]]>
In my understanding 68
)
In my understanding
68 comes from science app itself:
14:33:17 (55951): [normal]: done. calling boinc_finish(68).
So, boinc_finish was called and return value is 68.
And then BOINC interprets it through own list of errors.
Keith Myers wrote:I think
)
Well, don't forget I got similar bunch of errors too on host that rotinely returned correct results before and do nothing but computing E@h and warming pretty cold room now ;)
Here is the link:
https://einsteinathome.org/host/12826851/tasks/6/0
Same code, same file... But quite different host and its location...
So, I would suspect server itself...
But to have the same error 68
)
But to have the same error 68 on both hosts is of interest. Different OS'
I have never seen this error in any of my tasks.
Even ones that have had the message about the JLEPH file being damaged.
Not normal.
A reboot is all that is needed to fix that. But he has had repeated same errors over many days and months now.
I am certain he has rebooted the hosts at least once in this time.