What could cause "error in computing"?

GWGeorge007
GWGeorge007
Joined: 8 Jan 18
Posts: 3020
Credit: 4931697727
RAC: 473155
Topic 224835

I've been wondering extensively on the causes of "error in computing", specifically E@H, and I wonder if the causes are one of:

  1. Using the computer for other things like surfing, emails, or multi-media while your computer is running BOINC in the background.
  2. Having an 'older' computer CPU &/or GPU that doesn't have the necessary processor features required for newer project files.
  3. Errors in the making of the computing process from the project's end.
  4. Not enough memory either available as a whole or as allocated to the BOINC project.
  5. Or... what else could you think of as a factor in "error in computing"?

Obviously I'm curious about this because I have more than an abundance of errors.

Thoughts?

George

Proud member of the Old Farts Association

Richie
Richie
Joined: 7 Mar 14
Posts: 656
Credit: 1702989778
RAC: 0

George wrote:1. Using the

George wrote:
1. Using the computer for other things like surfing, emails, or multi-media while your computer is running BOINC in the background.

I'm pretty sure that 'yes' for that one. The light version just gives a computation error silently in the background while you're loading the system with other activities (not e-mails but preferably something heavy and video intensive). More severe version may freeze your screen and after a reboot you'd find that a task crashed and there's a computation error.

Quote:
2. Having an 'older' computer CPU &/or GPU that doesn't have the necessary processor features required for newer project files.

I don't think that CPUs have been the reason for computation errors in that way. CPU is either compatible with the app or not and server has succesfully sent CPUs only appropriate tasks. But with GPUs we've seen that older GPUs have received a task but then been running it endlessly till time limit caused a computation error.

^^

One additional thing that comes to my mind is Windows if it decided to start updating GPU drivers in the background. I'm pretty sure I got erros from that kind of intrusion a couple of times.

^^

One thing that comes to my mind is heat and power stability on that i7-990X. It has TDP of 130 watts and max operating temp is relatively low on these i7-9xx series (68C). i7-920 on the other end of the line had those same thermal specs, but your top of the line version is getting hotter more easily because it's quickly running with much higher clock speed with stock settings. So, in case you are putting that system under heavy load have you checked CPU temps ?

What's the model of the motherboard and PSU on that computer? Initial quality of the motherboard could have some role in the game at this point as the original platform is somewhat old (but fantastic).

Have you run stress tests on that computer (mem, cpu)?

On the Ryzen host components are newer but as always a well performing PSU and rock solid memory settings would be important factors.

Raistmer*
Raistmer*
Joined: 20 Feb 05
Posts: 208
Credit: 181215041
RAC: 8828

Looking at stderr of your

Looking at stderr of your results (for example https://einsteinathome.org/task/1047231990 ):

Line 1 in inputfile ../../projects/einstein.phys.uwm.edu/JPLEPH.405 seems to be damaged.

So, it seems like some server, not client issue.

BOINC should check such files automatically and reload them in case of damage (CRC control).

So, file has correct CRC from BOINC point of view but wrong data from E@h science app point of view.

 

Perhaps, worth to reset project. This will cause re-downloading of all (including that one) files and maybe will solve issue (maybe not if corrupted file was used for CRC computations indeed, in that case only project staff can solve this).

 

And I have similarly bunch of errors on one of my hosts.

 

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4941
Credit: 18573381471
RAC: 5678668

All your errors are from a

All your errors are from a very weird, never heard of error on both your Windows and Linux hosts..

The name limit for the local computer network adapter card was exceeded.
 (0x44) - exit code 68 (0x44)</message>

on the Windows host and 

<message>
process exited with code 68 (0x44, -188)</message>

on the Linux host.

Googling seems to point at a problem with network ports.  Either the Windows firewall is misconfigured or you are running a VM like Virtual Box or something.

You will have to enlist somebody with a lot more knowledge than I can provide.

 

 

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4941
Credit: 18573381471
RAC: 5678668

The root error occurs much

The root error occurs much earlier in the stderr.txt output before the message about the damaged file.

 

Raistmer*
Raistmer*
Joined: 20 Feb 05
Posts: 208
Credit: 181215041
RAC: 8828

Keith, final outcome (-68) if

Keith, final outcome (68) if weird indeed, but stderr gives direct clue:

 

Stderr output

<core_client_version>7.16.6</core_client_version>
<![CDATA[
<message>
process exited with code 68 (0x44, -188)</message>
<stderr_txt>
14:33:06 (55951): [normal]: This Einstein@home App was built at: Jul 26 2017 11:32:40

14:33:06 (55951): [normal]: Start of BOINC application '../../projects/einstein.phys.uwm.edu/hsgamma_FGRP5_1.08_x86_64-pc-linux-gnu__FGRPSSE'.
14:33:06 (55951): [debug]: 2.1e+15 fp, 4e+09 fp/s, 523566 s, 145h26m06s32
command line: ../../projects/einstein.phys.uwm.edu/hsgamma_FGRP5_1.08_x86_64-pc-linux-gnu__FGRPSSE --inputfile ../../projects/einstein.phys.uwm.edu/JPLEPH.405 --alpha 2.1039176188 --delta -0.9808959836 --skyRadius 0.001361356817 --ldiBins 15 --f0start 1064 --f0Band 16 --firstSkyPoint 706092 --numSkyPoints 58 --f1dot -1.0e-13 --f1dotBand 1.0e-13 --df1dot 1.344493449e-15 --ephemdir ../../projects/einstein.phys.uwm.edu/JPLEPH --Tcoh 4194304.0 --toplist 10 --cohFollow 10 --numCells 1 --useWeights 1 --Srefinement 1 --CohSkyRef 1 --cohfullskybox 1 --mmfu 0.15 --reftime 56757.0 --f0orbit 0.005 --freeRadiusFactor 2 --mismatch 0.15 --debug 0 -o LATeah1075F_1080.0_706092_0.0_0_0.out
output files: 'LATeah1075F_1080.0_706092_0.0_0_0.out' '../../projects/einstein.phys.uwm.edu/LATeah1075F_1080.0_706092_0.0_0_0' 'LATeah1075F_1080.0_706092_0.0_0_0.out.cohfu' '../../projects/einstein.phys.uwm.edu/LATeah1075F_1080.0_706092_0.0_0_1'
14:33:06 (55951): [debug]: Flags: X64 SSE SSE2 GNUC X86 GNUX86
14:33:06 (55951): [debug]: glibc version/release: 2.31/stable
14:33:06 (55951): [debug]: Set up communication with graphics process.
Line 1 in inputfile ../../projects/einstein.phys.uwm.edu/JPLEPH.405 seems to be damaged.
14:33:06 (55951): [CRITICAL]: ERROR: MAIN() returned with error '4'
FPU status flags:
mv: cannot stat 'LATeah1075F_1080.0_706092_0.0_0_0.out': No such file or directory
mv: cannot stat 'LATeah1075F_1080.0_706092_0.0_0_0.out': No such file or directory
mv: cannot stat 'LATeah1075F_1080.0_706092_0.0_0_0.out': No such file or directory
mv: cannot stat 'LATeah1075F_1080.0_706092_0.0_0_0.out': No such file or directory
mv: cannot stat 'LATeah1075F_1080.0_706092_0.0_0_0.out': No such file or directory
mv: cannot stat 'LATeah1075F_1080.0_706092_0.0_0_0.out.cohfu': No such file or directory
mv: cannot stat 'LATeah1075F_1080.0_706092_0.0_0_0.out.cohfu': No such file or directory
mv: cannot stat 'LATeah1075F_1080.0_706092_0.0_0_0.out.cohfu': No such file or directory
mv: cannot stat 'LATeah1075F_1080.0_706092_0.0_0_0.out.cohfu': No such file or directory
mv: cannot stat 'LATeah1075F_1080.0_706092_0.0_0_0.out.cohfu': No such file or directory
mv: cannot stat 'LATeah1075F_1080.0_706092_0.0_0_0.out.cohfu': No such file or directory
14:33:17 (55951): [normal]: done. calling boinc_finish(68).
14:33:17 (55951): called boinc_finish

</stderr_txt>
]]>


Raistmer*
Raistmer*
Joined: 20 Feb 05
Posts: 208
Credit: 181215041
RAC: 8828

Keith Myers wrote: The root

Keith Myers wrote:

The root error occurs much earlier in the stderr.txt output before the message about the damaged file.

 

Could you highlight it please seems I missing that line

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4941
Credit: 18573381471
RAC: 5678668

I think the message about the

I think the message about the file being damaged is because of the port to read the file is blocked or run out of resources.

That is what Googling the error about 

"The name limit for the local computer network adapter card was exceeded."

message seems to indicate in both Linux and Windows environments.

He has the same error on both hosts, one in Windows and one in Linux.

So common component.

https://einsteinathome.org/task/1047141326

<core_client_version>7.16.11</core_client_version>
<![CDATA[
<message>
The name limit for the local computer network adapter card was exceeded.
 (0x44) - exit code 68 (0x44)</message>
<stderr_txt>
06:27:47 (2764): [normal]: This Einstein@home App was built at: Jul 26 2017 09:32:43

06:27:47 (2764): [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/hsgamma_FGRP5_1.08_windows_intelx86__FGRPSSE.exe'.
06:27:47 (2764): [debug]: 2.1e+015 fp, 5.4e+009 fp/s, 389725 s, 108h15m24s67
command line: projects/einstein.phys.uwm.edu/hsgamma_FGRP5_1.08_windows_intelx86__FGRPSSE.exe --inputfile ../../projects/einstein.phys.uwm.edu/JPLEPH.405 --alpha 2.1039176188 --delta -0.9808959836 --skyRadius 0.001361356817 --ldiBins 15 --f0start 1048 --f0Band 16 --firstSkyPoint 959552 --numSkyPoints 58 --f1dot -1.0e-13 --f1dotBand 1.0e-13 --df1dot 1.344493449e-15 --ephemdir ..\..\projects\einstein.phys.uwm.edu\JPLEPH --Tcoh 4194304.0 --toplist 10 --cohFollow 10 --numCells 1 --useWeights 1 --Srefinement 1 --CohSkyRef 1 --cohfullskybox 1 --mmfu 0.15 --reftime 56757.0 --f0orbit 0.005 --freeRadiusFactor 2 --mismatch 0.15 --debug 0 -o LATeah1075F_1064.0_959552_0.0_0_0.out
output files: 'LATeah1075F_1064.0_959552_0.0_0_0.out' '../../projects/einstein.phys.uwm.edu/LATeah1075F_1064.0_959552_0.0_0_0' 'LATeah1075F_1064.0_959552_0.0_0_0.out.cohfu' '../../projects/einstein.phys.uwm.edu/LATeah1075F_1064.0_959552_0.0_0_1'
06:27:47 (2764): [debug]: Flags: i386 SSE GNUC X86 GNUX86
06:27:47 (2764): [debug]: Set up communication with graphics process.
Line 1 in inputfile ../../projects/einstein.phys.uwm.edu/JPLEPH.405 seems to be damaged.
06:27:47 (2764): [CRITICAL]: ERROR: MAIN() returned with error '4'
FPU status flags: PRECISION
06:27:58 (2764): [normal]: done. calling boinc_finish(68).
06:27:58 (2764): called boinc_finish

</stderr_txt>
]]>


 

Raistmer*
Raistmer*
Joined: 20 Feb 05
Posts: 208
Credit: 181215041
RAC: 8828

In my understanding  68

In my understanding 

68 comes from science app itself:

14:33:17 (55951): [normal]: done. calling boinc_finish(68).

So, boinc_finish was called and return value is 68.

And then BOINC interprets it through own list of errors.

 

Raistmer*
Raistmer*
Joined: 20 Feb 05
Posts: 208
Credit: 181215041
RAC: 8828

Keith Myers wrote:I think

Keith Myers wrote:

I think the message about the file being damaged is because of the port to read the file is blocked or run out of resources.

That is what Googling the error about 

"The name limit for the local computer network adapter card was exceeded."

message seems to indicate in both Linux and Windows environments.

He has the same error on both hosts, one in Windows and one in Linux.

So common component.

 

 

Well, don't forget I got similar bunch of errors too on host that rotinely returned correct results before and do nothing but computing E@h and warming pretty cold room now ;)

 

Here is the link:

https://einsteinathome.org/host/12826851/tasks/6/0

01:49:20 (924): [debug]: Flags: i386 SSE GNUC X86 GNUX86
01:49:20 (924): [debug]: Set up communication with graphics process.
Line 1 in inputfile ../../projects/einstein.phys.uwm.edu/JPLEPH.405 seems to be damaged.
01:49:20 (924): [CRITICAL]: ERROR: MAIN() returned with error '4'
FPU status flags:  PRECISION
01:49:30 (924): [normal]: done. calling boinc_finish(68).
01:49:30 (924): called boinc_finish

 

Same code, same file... But quite different host and its location...

So, I would suspect server itself...

 

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4941
Credit: 18573381471
RAC: 5678668

But to have the same error 68

But to have the same error 68 on both hosts is of interest. Different OS'

I have never seen this error in any of my tasks.

Even ones that have had the message about the JLEPH file being damaged.

Not normal.

A reboot is all that is needed to fix that.  But he has had repeated same errors over many days and months now.

I am certain he has rebooted the hosts at least once in this time.

 

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.