download trouble September 11

archae86
archae86
Joined: 6 Dec 05
Posts: 3165
Credit: 7398861687
RAC: 1958699
Topic 201028

All four of my hosts currently have download trouble, starting perhaps two hours ago.  The general pattern is exceedingly slow and intermittent rates, timing out before completion, and going into pending.

archae86
archae86
Joined: 6 Dec 05
Posts: 3165
Credit: 7398861687
RAC: 1958699

The download trouble

The download trouble persisted in similar behavior to my original report, up to a very few minutes ago.  But in the last five minutes all my machines cleared all their download backlog very rapidly after a single retry request.

Christian Beer
Christian Beer
Joined: 9 Feb 05
Posts: 595
Credit: 197729422
RAC: 16213

I would say either a clogged

I would say either a clogged Internet connection or a clogged download mirror. Here at AEI everything was nominal.

archae86
archae86
Joined: 6 Dec 05
Posts: 3165
Credit: 7398861687
RAC: 1958699

The problem resumed not very

The problem resumed not very long after I made that post, and has now persisted for some hours.

As to my local connection, other activities on my hosts appear to behave normally (including the check connection to google made by boinc itself.  A quick check of speedtest.net during the earlier incident gave about 60 Mbps down/6 Mbps up to a nearby test location during the earlier trouble.  I tried using a test location many hundreds of miles away (Singtel in Palo Alto, CA) just now at got 38 msec ping time, 59.24 Mbps down, 6.13 up.  I don't know a way to diagnose a download mirror problem, nor how to get my Einstein work to try a different mirror.

The message log contains many instances of "transient HTTP error" and "project communication failed", and "connection nnnn seems to be dead".

So far as I can tell with my limited ability to read the message logs, uploads and scheduler interactions have been working fine throughout.

The lack of other user reports suggests this problem is, if not entirely specific to me, then limited in scope.  My location is in Albuquerque NM, my ISP comcast.

A trace route to shows several timeout in the Hannover area, but I don't know whether this is normal.

Here is the European side of a tracert to einstein1.aei.uni-hannover.de

 9   149 ms   148 ms   149 ms  ae-4-90.edge5.Frankfurt1.Level3.net [4.69.154.201]
10   154 ms   149 ms   147 ms  ae-4-90.edge5.Frankfurt1.Level3.net [4.69.154.201]
 11   168 ms   150 ms   149 ms  212.162.4.6
 12   157 ms   153 ms   153 ms  cr-han2-be6-7.x-win.dfn.de [188.1.144.222]
 13   153 ms   153 ms   153 ms  kr-han68-0.x-win.dfn.de [188.1.232.206]
 14   158 ms   153 ms   153 ms  fwgw-2-ext.connect.uni-hannover.de [130.75.78.232]
 15     *        *        *     Request timed out.
 16     *        *        *     Request timed out.
 17     *        *        *     Request timed out.
 18   153 ms   155 ms   155 ms  einstein1.aei.uni-hannover.de [130.75.116.31]

And here is the (similar) tracert to a specific address mentioned in the message log as "trying"

  9   148 ms   147 ms   147 ms  ae-4-90.edge5.Frankfurt1.Level3.net [4.69.154.201]
 10   152 ms   149 ms   147 ms  ae-4-90.edge5.Frankfurt1.Level3.net [4.69.154.201]
 11   149 ms   149 ms   149 ms  212.162.4.6
 12   152 ms   153 ms   154 ms  cr-han2-be6-7.x-win.dfn.de [188.1.144.222]
 13     *      154 ms     *     kr-han68-0.x-win.dfn.de [188.1.232.206]
 14   155 ms   155 ms   168 ms  fwgw-2-ext.connect.uni-hannover.de [130.75.78.232]
 15     *        *        *     Request timed out.
 16     *        *        *     Request timed out.
 17     *        *        *     Request timed out.
 18   155 ms   154 ms   155 ms  einstein1.aei.uni-hannover.de [130.75.116.31]

 

 

 

Christian Beer
Christian Beer
Joined: 9 Feb 05
Posts: 595
Credit: 197729422
RAC: 16213

The timeouts are normal since

The timeouts are normal since some nodes on the way don't respond to direct ICMP requests. Important is that you can reach the target with a good ping time.

Since you are in the US you should test your connectivity to our US download mirrors too. They are:

einstein-dl.syr.edu
einstein-dl2.phys.uwm.edu
einstein-dl3.phys.uwm.edu
einstein.ligo.caltech.edu

There are also some more servers in Europe where data files are downloaded from:

einstein1.aei.uni-hannover.de
einstein2.aei.uni-hannover.de
einstein6.aei.uni-hannover.de

tracert shows you the round-robin-time between the different stations that are on the way to the target by sending only one packet to each station. On Windows you should also have the pathping tool which sends more than one packet and where you should see any stations that have a high packet loss value. That would indicate something is wrong with this station.

Edit: I just checked from several different networks that there is no packet loss between Germany and the download servers in the US.

archae86
archae86
Joined: 6 Dec 05
Posts: 3165
Credit: 7398861687
RAC: 1958699

Christian Beer wrote:The

Christian Beer wrote:

The timeouts are normal since some nodes on the way don't respond to direct ICMP requests. Important is that you can reach the target with a good ping time.

Since you are in the US you should test your connectivity to our US download mirrors too. They are:

einstein-dl.syr.edu
einstein-dl2.phys.uwm.edu
einstein-dl3.phys.uwm.edu
einstein.ligo.caltech.edu

pings (with the default 4 trials) to all of these worked fine, just now, with response times around 50 msecs

Quote:
There are also some more servers in Europe where data files are downloaded from:

einstein1.aei.uni-hannover.de
einstein2.aei.uni-hannover.de
einstein6.aei.uni-hannover.de

pings to all of these timed out, just now, while a repetition of the tracert gave the same behavior as I posted

I found old forum traffic from about November 20, 2015 regarding a problem we described as a Kaspersky interaction with the current Hannover configuration after a change.  That problem had significantly different symptoms than this one.  Nevertheless, I tried making the Kaspersky adjustment documented there to tell Kaspersky not to look at network traffic from the boinc client just now.  In this case, on two different machines, that adjustment, which was effective for the previous problem, had no effect.

While the details differ, possibly there is again a Kaspersky-specific interaction with the current configuration, which, if true, would make the low rate of user difficulty unsurprising.  I'll try more drastic Kaspersky-defeating measures in an hour or so if nothing else turns up.  I've already done a full power down and restart of all my hosts, cable modem, and router, with no helpful effect.

As I post, the problem continues on all my hosts.  Requesting a retry on the pending transfers gets a very few bytes of transfer, soon coasting to a stop of traffic, and later a timeout.

 

Christian Beer
Christian Beer
Joined: 9 Feb 05
Posts: 595
Credit: 197729422
RAC: 16213

That makes no sense. If the

That makes no sense. If the tracert is showing a connection then the ping should show a connection too. I just tried to reach the comcast router (route-server.newyork.ny.ibone.comcast.net) from the AEI network and I get a ping of 90ms without packet loss.

It's possible that this is the same (or a similar) Kaspersky issue but since we didn't change anything on our download servers or the redirects in a long time I doubt that.

There is another tool you can use to diagnose the problem. It combines tracert and ping in one tool. We used this during a network outage last month to pinpoint the faulty node. On Linux you need to install the mtr package, for Windows I found this: http://winmtr.net/ but can't test since I don't have a Windows handy.

archae86
archae86
Joined: 6 Dec 05
Posts: 3165
Credit: 7398861687
RAC: 1958699

Christian Beer wrote:It's

Christian Beer wrote:

It's possible that this is the same (or a similar) Kaspersky issue but since we didn't change anything on our download servers or the redirects in a long time I doubt that.

I tried the full Kaspersky disable remedy that successfully took it out of the way in two previous puzzling episodes (right-clicking the K system try icon, selecting settings, unclicking the autorun option, then rebooting).  Neither the primary downloading rate problem nor the puzzling ping behavior changed.  I think perhaps we can consider Kaspersky not likely part of the picture on this one.

Quote:
There is another tool you can use to diagnose the problem. It combines tracert and ping in one tool. We used this during a network outage last month to pinpoint the faulty node. On Linux you need to install the mtr package, for Windows I found this: http://winmtr.net/ but can't test since I don't have a Windows handy.

While it is mildly disconcerting that the source does not mention compatibility to any later Windows than 7, I ran the 64-bit version.  

settting the host to einstein1.aei.uni-hannover.de, I see results similar to those I get from tracert.  This includes four no response lines, one the first beyond my house network, and three at the last stage in Germany.

|                       Host              -   %  | Sent | Recv | Best | Avrg | Wrst | Last |
|                                TP-SHARE -    0 |  148 |  148 |    0 |    0 |    2 |    0 |
|                   No response from host -  100 |   29 |    0 |    0 |    0 |    0 |    0 |
|te-0-3-0-12-sur02.sandia.nm.albuq.comcast.net -    0 |  148 |  148 |    7 |   10 |   26 |   17 |
|be-3-ar02.albuquerque.nm.albuq.comcast.net -    0 |  148 |  148 |    8 |   11 |   27 |   12 |
|be-100-ar01.albuquerque.nm.albuq.comcast.net -    0 |  148 |  148 |    8 |   10 |   21 |   10 |
|be-33654-cr01.1601milehigh.co.ibone.comcast.net -    0 |  148 |  148 |   16 |   19 |   33 |   25 |
|be-11719-cr02.denver.co.ibone.comcast.net -    0 |  148 |  148 |   17 |   20 |   40 |   20 |
|           ae14.edge3.Denver1.Level3.net -    0 |  148 |  148 |   16 |   20 |   62 |   26 |
|     ae-4-90.edge5.Frankfurt1.Level3.net -    1 |  144 |  143 |    0 |  149 |  163 |  147 |
|     ae-4-90.edge5.Frankfurt1.Level3.net -    0 |  148 |  148 |  147 |  150 |  185 |  149 |
|                             212.162.4.6 -    1 |  144 |  143 |    0 |  150 |  161 |  150 |
|              cr-han2-be6-7.x-win.dfn.de -    0 |  148 |  148 |  152 |  154 |  167 |  153 |
|                 kr-han68-0.x-win.dfn.de -    1 |  144 |  143 |    0 |  170 |  371 |  154 |
|      fwgw-2-ext.connect.uni-hannover.de -    2 |  140 |  138 |    0 |  155 |  167 |  154 |
|                   No response from host -  100 |   29 |    0 |    0 |    0 |    0 |    0 |
|                   No response from host -  100 |   29 |    0 |    0 |    0 |    0 |    0 |
|                   No response from host -  100 |   29 |    0 |    0 |    0 |    0 |    0 |
|           einstein1.aei.uni-hannover.de -    7 |  117 |  109 |  152 |  155 |  166 |  157 |
archae86
archae86
Joined: 6 Dec 05
Posts: 3165
Credit: 7398861687
RAC: 1958699

Sorry for the long and sloppy

Sorry for the long and sloppy winmtr result post.

Things I notice are that it logs a non-zero failure rate for several levels besides the four "no response levels", but the highest loss rate in this sample is to the final einstein1.aee.uni-hannover.de level.   I have no idea whether a 7% loss rate is abnormal, nor whether it might somehow relate to my trouble.  It is at least somewhat repeatable, as another five minute run after the one I posted, (so 300 trials at 1/second), again posted 7% loss at that level.

Richie
Richie
Joined: 7 Mar 14
Posts: 656
Credit: 1702989778
RAC: 0

I run that same test from

I run that same test from northern Europe and got those "no response from host" lines too.

|------------------------------------------------------------------------------------------|

|                                      WinMTR statistics                                   |

|                       Host              -   %  | Sent | Recv | Best | Avrg | Wrst | Last |

|------------------------------------------------|------|------|------|------|------|------|

...

|                        de-hmb.nordu.net -    0 |  300 |  300 |   30 |   30 |   44 |   31 |

|     nordunet-bckp2.mx1.ham.de.geant.net -    0 |  300 |  300 |   33 |   33 |   56 |   33 |

|                    cr-tub1.x-win.dfn.de -    0 |  300 |  300 |   38 |   38 |   45 |   39 |

|              cr-han2-be7-7.x-win.dfn.de -    0 |  300 |  300 |   45 |   45 |   49 |   45 |

|                 kr-han68-0.x-win.dfn.de -    0 |  300 |  300 |   42 |   48 |  132 |   43 |

|      fwgw-2-ext.connect.uni-hannover.de -    0 |  300 |  300 |   42 |   42 |   47 |   42 |

|                   No response from host -  100 |   61 |    0 |    0 |    0 |    0 |    0 |

|                   No response from host -  100 |   61 |    0 |    0 |    0 |    0 |    0 |

|                   No response from host -  100 |   61 |    0 |    0 |    0 |    0 |    0 |

|           einstein1.aei.uni-hannover.de -    0 |  300 |  300 |   43 |   43 |   48 |   43 |

|________________________________________________|______|______|______|______|______|______|

AgentB
AgentB
Joined: 17 Mar 12
Posts: 915
Credit: 513211304
RAC: 0

archae86 wrote:I have no idea

archae86 wrote:

I have no idea whether a 7% loss rate is abnormal, nor whether it might somehow relate to my trouble.

Yes it is abnormal, and yes that will cause some of the issues.

If you are seeing traceroute and ping differences, this is often caused by firewall/router ACL filters blocking outbound / inbound or slow to inspect -  there could be many firewalls on route. 

In linux you can traceroute using ICMP, UDP and TCP - and you do see differences when you run a traceroute using the different protocols.

I see no network issues with einstein1.aei.uni-hannover.de from home.

I think i would try disabling network traffic inspection on the host with the problem, and if all hosts have the problem - look closely at next hop (router / firewall) - then take it up with the ISP.

good luck.

edit: to be clear firewalls include software versions such as Windows Firewall, Kapersky etc.

edit++: the no response from those three hosts is normal.

 

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.