ssh suddenly stops communicating with some hosts

Here’s a puzzle I’m having trouble figuring out. This afternoon, ssh from my workstation or laptop stopped working to any of my servers (at OVH). The servers are all running wheezy, the local machines jessie. This happens on both my DSL and when tethered to my mobile phone. They had not applied any updates since the last time ssh worked. When looking at it with ssh -v, they were all hanging after:

debug1: SSH2_MSG_KEXINIT sent
debug1: SSH2_MSG_KEXINIT received
debug1: kex: server->client aes128-ctr umac-64@openssh.com none
debug1: kex: client->server aes128-ctr umac-64@openssh.com none
debug1: sending SSH2_MSG_KEX_ECDH_INIT
debug1: expecting SSH2_MSG_KEX_ECDH_REPLY

Now, I noticed that a server on my LAN — running wheezy — could successfully connect. It was a little different:

debug1: kex: server->client aes128-ctr hmac-md5 none
debug1: kex: client->server aes128-ctr hmac-md5 none
debug1: sending SSH2_MSG_KEX_ECDH_INIT
debug1: expecting SSH2_MSG_KEX_ECDH_REPLY

And indeed, if I run ssh -o MACs=hmac-md5, it works fine.

Now, I tried rebooting machines at multiple ends of this. No change. I tried connecting from multiple networks. No change. And then, as I was writing this blog post, all of a sudden it works normally again. Supremely weird! Any ideas what I can blame here?

22 thoughts on “ssh suddenly stops communicating with some hosts

  1. Steve says:

    Is the debug output from the system that was not working and now is, the same as it was or has it now changed?

    I usually blame these kinds of things on “sun spots”.

  2. John Goerzen says:

    Aaaaannnd… now it stopped working again. Sigh.

    Yeah, the debug output was the same; it just progressed beyond that point.

  3. Joey Hess says:

    Maybe a DPI firewall that sometimes handles new ssh protocol features and other times falls over on them?

  4. Joey Hess says:

    Hmm, or a DPI firewall that is only sometimes in the route to the server..

  5. Chuckie says:

    what does tcpdump say is happening at L3, did we stop making progress in communications or just interpreting that communication?

    1. John Goerzen says:

      Got a lot of resends going on from the client. The MTU is 1500 everywhere so I can’t imagine that’s the problem…

  6. Steven C. says:

    I wonder if the packet size is different, and exceeds the path MTU when UMAC is used? (tcpdump as Chuckie mentioned would help check this, though a successful login would also hang at some point if this was really happening)

    Or another wild guess: maybe the UMAC algo requires more entropy and blocks waiting for it?

  7. brian m. carlson says:

    I’ve often seen this hang over tunnels when the MTU is set too high. See if lowering the MTU fixes it.

  8. spotter says:

    I’ve seen lots of random broken connections recently (use ssh to enable vnc tunneling). I can connect, tunnel works but it randomly breaks with an ssh “broken pipe” error.

  9. John Goerzen says:

    Tracked down the problem. One of my ISP’s new upstream links appears to have a MTU set at 1486.

    1. Paul says:

      And what fix did you find most appropriate?

      1. John Goerzen says:

        Well, I’ve notified them of the issue and pointed out which link of theirs is causing the problem. A workaround is setting a lower MTU on the interface, but ultimately they have to fix it.

        1. Are they blocking ICMP Fragmentation Needed packets? Path MTU discovery needs those to be avle to deal with arbitrary MTUs on routers between you and the destination.

        2. John Goerzen says:

          That’s a funny one. I see ICMP Fragmentation Needed arrive — sometimes, and about 2 minutes late. It is completely non-reliable. This explains the intermittent working — the OS had cached the MTU from that Fragmentation Needed packet for a spell.

        3. Christoph Egger says:

          Seen this on a *BSD firewall when the ICMP quota was too low for the network traffic (here >200 users and the BSD with lower MTU upstream). It’s probably a bug but PathMTU ICMP stuff got the same rate limiting applied as the rest

        4. You said: “I’ve notified them of the issue and pointed out which link of theirs is causing the problem. ”
          How do you find out the problematic link ?

          I am having exactly the same problems, (ssh login hangin) that I solved with the same solution (lowering the MTU) Debian 7 and Debian8 machines.
          My ISP is UPC / Austria.

  10. Florian says:

    Saw the post earlier in my mobile feedreader and couldn’t comment.
    My first guess would’ve been MTU as well, we had that with OpenVPN (and maybe also SSH) on Macs at work a while ago, took a while to find the culprit…

    1. John Goerzen says:

      PIng and traceroute to the rescue. For instance, I used a command like this to try to isolate both which machines I couldn’t talk to, and what the relevant MTU was:

      $ ping -M do target.example.com -s 1470
      PING target.example.com 1470(1498) bytes of data.

      Then a traceroute:

      traceroute -F target.example.com 1498

      (Note that ping and traceroute calculate packet sizes differently; see their manpages)

  11. Thanks, I also found tracepath along the way, but ping was simply the most versatile.
    I cannot found the problem on my ISP anymore so it might have been fixed inbetween but I am going to double check my router as it is BSD base.

  12. Neticis says:

    One problem I found is related to trasparent proxies.
    After command
    ssh -vvv user@host
    it returns:
    debug2: channel 0: open confirm rwindow 0 rmax 32768
    Try to connect using command:
    ssh -o “ProxyCommand nc %h %p” user@host
    Link: http://askubuntu.com/questions/344863/ssh-new-connection-begins-to-hang-not-reject-or-terminate-after-a-day-or-so-on

  13. neb0t says:

    sudo ip link set $iface mtu 1400

    1. bttcld says:

      This is fine for me (at the moment). Thanks

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.