ssh suddenly stops communicating with some hosts

Here’s a puzzle I’m having trouble figuring out. This afternoon, ssh from my workstation or laptop stopped working to any of my servers (at OVH). The servers are all running wheezy, the local machines jessie. This happens on both my DSL and when tethered to my mobile phone. They had not applied any updates since the last time ssh worked. When looking at it with ssh -v, they were all hanging after:

debug1: SSH2_MSG_KEXINIT sent
debug1: SSH2_MSG_KEXINIT received
debug1: kex: server->client aes128-ctr umac-64@openssh.com none
debug1: kex: client->server aes128-ctr umac-64@openssh.com none
debug1: sending SSH2_MSG_KEX_ECDH_INIT
debug1: expecting SSH2_MSG_KEX_ECDH_REPLY

Now, I noticed that a server on my LAN — running wheezy — could successfully connect. It was a little different:

debug1: kex: server->client aes128-ctr hmac-md5 none
debug1: kex: client->server aes128-ctr hmac-md5 none
debug1: sending SSH2_MSG_KEX_ECDH_INIT
debug1: expecting SSH2_MSG_KEX_ECDH_REPLY

And indeed, if I run ssh -o MACs=hmac-md5, it works fine.

Now, I tried rebooting machines at multiple ends of this. No change. I tried connecting from multiple networks. No change. And then, as I was writing this blog post, all of a sudden it works normally again. Supremely weird! Any ideas what I can blame here?

20 thoughts on “ssh suddenly stops communicating with some hosts

  1. Is the debug output from the system that was not working and now is, the same as it was or has it now changed?

    I usually blame these kinds of things on “sun spots”.

    Reply

  2. what does tcpdump say is happening at L3, did we stop making progress in communications or just interpreting that communication?

    Reply

    John Goerzen Reply:

    Got a lot of resends going on from the client. The MTU is 1500 everywhere so I can’t imagine that’s the problem…

    Reply

  3. I wonder if the packet size is different, and exceeds the path MTU when UMAC is used? (tcpdump as Chuckie mentioned would help check this, though a successful login would also hang at some point if this was really happening)

    Or another wild guess: maybe the UMAC algo requires more entropy and blocks waiting for it?

    Reply

  4. I’ve often seen this hang over tunnels when the MTU is set too high. See if lowering the MTU fixes it.

    Reply

  5. I’ve seen lots of random broken connections recently (use ssh to enable vnc tunneling). I can connect, tunnel works but it randomly breaks with an ssh “broken pipe” error.

    Reply

  6. Tracked down the problem. One of my ISP’s new upstream links appears to have a MTU set at 1486.

    Reply

    Paul Reply:

    And what fix did you find most appropriate?

    Reply

    John Goerzen Reply:

    Well, I’ve notified them of the issue and pointed out which link of theirs is causing the problem. A workaround is setting a lower MTU on the interface, but ultimately they have to fix it.

    Reply

    Marius Gedminas Reply:

    Are they blocking ICMP Fragmentation Needed packets? Path MTU discovery needs those to be avle to deal with arbitrary MTUs on routers between you and the destination.

    John Goerzen Reply:

    That’s a funny one. I see ICMP Fragmentation Needed arrive — sometimes, and about 2 minutes late. It is completely non-reliable. This explains the intermittent working — the OS had cached the MTU from that Fragmentation Needed packet for a spell.

    Christoph Egger Reply:

    Seen this on a *BSD firewall when the ICMP quota was too low for the network traffic (here >200 users and the BSD with lower MTU upstream). It’s probably a bug but PathMTU ICMP stuff got the same rate limiting applied as the rest

    Emmanuel Kasper Reply:

    You said: “I’ve notified them of the issue and pointed out which link of theirs is causing the problem. ”
    How do you find out the problematic link ?

    I am having exactly the same problems, (ssh login hangin) that I solved with the same solution (lowering the MTU) Debian 7 and Debian8 machines.
    My ISP is UPC / Austria.

  7. Saw the post earlier in my mobile feedreader and couldn’t comment.
    My first guess would’ve been MTU as well, we had that with OpenVPN (and maybe also SSH) on Macs at work a while ago, took a while to find the culprit…

    Reply

    John Goerzen Reply:

    PIng and traceroute to the rescue. For instance, I used a command like this to try to isolate both which machines I couldn’t talk to, and what the relevant MTU was:

    $ ping -M do target.example.com -s 1470
    PING target.example.com 1470(1498) bytes of data.

    Then a traceroute:

    traceroute -F target.example.com 1498

    (Note that ping and traceroute calculate packet sizes differently; see their manpages)

    Reply

  8. Thanks, I also found tracepath along the way, but ping was simply the most versatile.
    I cannot found the problem on my ISP anymore so it might have been fixed inbetween but I am going to double check my router as it is BSD base.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *