CPU tasks error out after 12 seconds.

halfempty
halfempty
Joined: 3 Apr 20
Posts: 14
Credit: 37595576
RAC: 0
Topic 224263

In addition to the congestion issues everybody has been having, I can't seem to get any CPU tasks because all of mine started to error out. The error message seems to be:

 

"The name limit for the local computer network adapter card was exceeded.

 (0x44) - exit code 68 (0x44)"

 

This has me totally confused. Would appreciate any suggestions.

 

Here's a link to the error task list:

https://einsteinathome.org/host/12820614/tasks/6/0

Richard de Lhorbe
Richard de Lhorbe
Joined: 15 Dec 05
Posts: 43
Credit: 9273102268
RAC: 708115

I am getting similar

I am getting similar problems with all CPU tasks for Gamma Ray Pulsar search now failing after about 12 seconds, but with a different error message than the original poster ..... a partial cut-and-paste here

13:56:56 (23269): [debug]: Set up communication with graphics process.
Line 1 in inputfile ../../projects/einstein.phys.uwm.edu/JPLEPH.405 seems to be damaged.
13:56:56 (23269): [CRITICAL]: ERROR: MAIN() returned with error '4'

Of course I now can’t get any more WUs due to not being able to upload anything .... but, I have confidence this will gradually work itself out as it always does ....

 

San-Fernando-Valley
San-Fernando-Valley
Joined: 16 Mar 16
Posts: 264
Credit: 7181214928
RAC: 14614401

He has the same error as you,

He has the same error as you, can be seen further down in the ouput. I guess he did not see that.

halfempty
halfempty
Joined: 3 Apr 20
Posts: 14
Credit: 37595576
RAC: 0

You're right, same error

You're right, same error further down. Guess I just have to wait for them to work it out. Thanks.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5845
Credit: 109973019658
RAC: 29826736

halfempty wrote:"The name

halfempty wrote:
"The name limit for the local computer network adapter card was exceeded."

Stupidly, Windows intervenes and uses the error code which is specific to the app as if it were a Windows error code - which it's not.  You need to look elsewhere for the real problem.

Pick any one of your failed tasks and click on its Task ID link.  Scroll down and look through what was returned to the project.  In this case it isLine 1 in inputfile ../../projects/einstein.phys.uwm.edu/JPLEPH.405 seems to be damaged.The nearby lines give some context.

The file JPLEPH.405 appears to be the problem so that's the first thing to investigate.  It's a static data file needed for all GRP tasks which is why they all quickly fail if the file is corrupt.  That there is another person reporting the same issue is a concern.  However, the best you can do is to see if your copy is corrupt in some way.

I've experienced something similar in the distant past.  The first thing I used to do was replace the file. I would rename it to JPLEPH.BAD (so it remained covering the same disk sectors) and replace it with a fresh copy from another machine (or download it afresh).  That seemed to work - for a while - but the problem returned.  Eventually, by running a memory testing app, I found one of the RAM sticks had a bad location.  Replacing that stick permanently fixed the problem.

The three things I would try are, (1) replace file with a fresh copy,  (2) check your disk for bad sectors,  (3) test your RAM.  If more people start reporting problems with the same file, maybe it will be something else.

Cheers,
Gary.

mohavewolfpup
mohavewolfpup
Joined: 8 Mar 20
Posts: 9
Credit: 5768052
RAC: 0

I'm up to 69 failed tasks and

I'm up to 69 failed tasks and counting, so killing the client until it is fixed least I get banned.

 

<core_client_version>7.16.11</core_client_version>
<![CDATA[
<message>
The name limit for the local computer network adapter card was exceeded.
 (0x44) - exit code 68 (0x44)</message>
<stderr_txt>
02:45:19 (6924): [normal]: This Einstein@home App was built at: Jul 26 2017 09:32:43

02:45:19 (6924): [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/hsgamma_FGRP5_1.08_windows_intelx86__FGRPSSE.exe'.
02:45:19 (6924): [debug]: 2.1e+015 fp, 4.2e+009 fp/s, 495478 s, 137h37m58s42
command line: projects/einstein.phys.uwm.edu/hsgamma_FGRP5_1.08_windows_intelx86__FGRPSSE.exe --inputfile ../../projects/einstein.phys.uwm.edu/JPLEPH.405 --alpha 2.1039176188 --delta -0.9808959836 --skyRadius 0.001361356817 --ldiBins 15 --f0start 1080 --f0Band 16 --firstSkyPoint 586670 --numSkyPoints 58 --f1dot -1.0e-13 --f1dotBand 1.0e-13 --df1dot 1.344493449e-15 --ephemdir ..\..\projects\einstein.phys.uwm.edu\JPLEPH --Tcoh 4194304.0 --toplist 10 --cohFollow 10 --numCells 1 --useWeights 1 --Srefinement 1 --CohSkyRef 1 --cohfullskybox 1 --mmfu 0.15 --reftime 56757.0 --f0orbit 0.005 --freeRadiusFactor 2 --mismatch 0.15 --debug 0 -o LATeah1075F_1096.0_586670_0.0_2_0.out
output files: 'LATeah1075F_1096.0_586670_0.0_2_0.out' '../../projects/einstein.phys.uwm.edu/LATeah1075F_1096.0_586670_0.0_2_0' 'LATeah1075F_1096.0_586670_0.0_2_0.out.cohfu' '../../projects/einstein.phys.uwm.edu/LATeah1075F_1096.0_586670_0.0_2_1'
02:45:19 (6924): [debug]: Flags: i386 SSE GNUC X86 GNUX86
02:45:19 (6924): [debug]: Set up communication with graphics process.
Line 1 in inputfile ../../projects/einstein.phys.uwm.edu/JPLEPH.405 seems to be damaged.
02:45:19 (6924): [CRITICAL]: ERROR: MAIN() returned with error '4'
FPU status flags: PRECISION
02:45:30 (6924): [normal]: done. calling boinc_finish(68).
02:45:30 (6924): called boinc_finish

</stderr_txt>
]]>


halfempty
halfempty
Joined: 3 Apr 20
Posts: 14
Credit: 37595576
RAC: 0

Thanks for the suggestions.

Thanks for the suggestions. I'm at work right now, but I'll play around with it when I get home. 

By the other people having the same problem I'm thinking it could be a server side update gone awry. I'll see what the time stamp on the file is before I do anything. 

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5845
Credit: 109973019658
RAC: 29826736

halfempty wrote:By the other

halfempty wrote:
By the other people having the same problem I'm thinking it could be a server side update gone awry.

Yes, I tend to agree that it's server-side or while downloading.

My understanding is that this file (certainly the same name) gets used for both CPU and GPU tasks.  I'm not seeing the problem for GPUs but I'm also not getting a new copy so I don't think it's an updated file being sent to everyone.

Perhaps it's just those who ask for the file because they don't have it, ie. just joined this search or starting up a new machine.  Maybe some sort of corruption is happening during the download in which case replacing with a known good copy from another machine might work for now.

Cheers,
Gary.

Wedge009
Wedge009
Joined: 5 Mar 05
Posts: 117
Credit: 15812119341
RAC: 7165486

I only found this report just

I only found this report just now (since previously I was only checking Technical News and Cruncher's Corner), but I have observed this problem since about 19:24 UTC on 19th December. It looks like there's something wrong with tasks with ID starting with LATeah1075F.

For me the problem is across multiple hosts, and only for CPU FGRP5 work units. I'm still getting repeat work units (eg ending in _5 indicating the sixth attempt) so I'm quite certain it's the units themselves that are bad, not necessarily the machines processing them or the downloaded data.

Edit: Checking one host's copy of JPLEPH.405, it is dated as 2020-05-27, more than half a year ago. So quite strange that this problem only manifests now.

Edit: Forced a re-download of JPLEPH.405 at 2020-12-21 03:19 UTC - confirmed there are still FGRP5 CPU tasks that are failing with the same error message. I can only conclude thus far that there is something really bad with this batch of work units.

Soli Deo Gloria

Eugene Stemple
Eugene Stemple
Joined: 9 Feb 11
Posts: 58
Credit: 270624310
RAC: 353757

I'm seeing some of these

I'm seeing some of these also.  In my account error tasks there are six instances of LATeah1075F with computation error after 13 seconds.  Exit code 68.  The stdout log file has them tagged with "output file absent."  These are all CPU tasks, via v1.08 FGRPSSE (Linux).  However, not all work units of that LATeah1075F series are failing, at least not recently.  I do see successful work units from December 18 and earlier that completed and validated.  Their (CPU) run times are on the order of 10,000 seconds.  Browsing backwards through the stdoutdae.txt log it's the last six that failed.  They have work unit IDs - after the LATeah1075F common initial string - of 1096.0_220748 / 1096.0_303978 / 1064.0_1143644 / 1080.0_358000 / 1096.0_1261558 / 1096.0_850792 .  All the earlier work units (that finished normally) had IDs of 920.0 / 872.0 / 856.0 / etc. all LESS THAN 1000.  Probably a coincidence that the 1000 threshold is a boundary between good and bad... but odd anyway.  I don't seem to have any of these in my cache, so nothing to monitor more closely.

:^)  maybe it's the Jupiter/Saturn conjunction messing things up...

 

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4273
Credit: 245218601
RAC: 12923

Nope, I'm afraid that was us.

Nope, I'm afraid that was us. The idea was to move the FGRP5 workunit generator away from the overloaded upload server, but something went wrong there unnoticed. Sorry for that.

BM

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.