Here’s a puzzle I’m having trouble figuring out. This afternoon, ssh from my workstation or laptop stopped working to any of my servers (at OVH). The servers are all running wheezy, the local machines jessie. This happens on both my DSL and when tethered to my mobile phone. They had not applied any updates since the last time ssh worked. When looking at it with ssh -v, they were all hanging after:
debug1: SSH2_MSG_KEXINIT sent debug1: SSH2_MSG_KEXINIT received debug1: kex: server->client aes128-ctr umac-64@openssh.com none debug1: kex: client->server aes128-ctr umac-64@openssh.com none debug1: sending SSH2_MSG_KEX_ECDH_INIT debug1: expecting SSH2_MSG_KEX_ECDH_REPLY
Now, I noticed that a server on my LAN — running wheezy — could successfully connect. It was a little different:
debug1: kex: server->client aes128-ctr hmac-md5 none debug1: kex: client->server aes128-ctr hmac-md5 none debug1: sending SSH2_MSG_KEX_ECDH_INIT debug1: expecting SSH2_MSG_KEX_ECDH_REPLY
And indeed, if I run ssh -o MACs=hmac-md5, it works fine.
Now, I tried rebooting machines at multiple ends of this. No change. I tried connecting from multiple networks. No change. And then, as I was writing this blog post, all of a sudden it works normally again. Supremely weird! Any ideas what I can blame here?
Is the debug output from the system that was not working and now is, the same as it was or has it now changed?
I usually blame these kinds of things on “sun spots”.
Aaaaannnd… now it stopped working again. Sigh.
Yeah, the debug output was the same; it just progressed beyond that point.
Maybe a DPI firewall that sometimes handles new ssh protocol features and other times falls over on them?
Hmm, or a DPI firewall that is only sometimes in the route to the server..
what does tcpdump say is happening at L3, did we stop making progress in communications or just interpreting that communication?
Got a lot of resends going on from the client. The MTU is 1500 everywhere so I can’t imagine that’s the problem…
I wonder if the packet size is different, and exceeds the path MTU when UMAC is used? (tcpdump as Chuckie mentioned would help check this, though a successful login would also hang at some point if this was really happening)
Or another wild guess: maybe the UMAC algo requires more entropy and blocks waiting for it?
I’ve often seen this hang over tunnels when the MTU is set too high. See if lowering the MTU fixes it.
I’ve seen lots of random broken connections recently (use ssh to enable vnc tunneling). I can connect, tunnel works but it randomly breaks with an ssh “broken pipe” error.
Tracked down the problem. One of my ISP’s new upstream links appears to have a MTU set at 1486.
And what fix did you find most appropriate?
Well, I’ve notified them of the issue and pointed out which link of theirs is causing the problem. A workaround is setting a lower MTU on the interface, but ultimately they have to fix it.
Are they blocking ICMP Fragmentation Needed packets? Path MTU discovery needs those to be avle to deal with arbitrary MTUs on routers between you and the destination.
That’s a funny one. I see ICMP Fragmentation Needed arrive — sometimes, and about 2 minutes late. It is completely non-reliable. This explains the intermittent working — the OS had cached the MTU from that Fragmentation Needed packet for a spell.
Seen this on a *BSD firewall when the ICMP quota was too low for the network traffic (here >200 users and the BSD with lower MTU upstream). It’s probably a bug but PathMTU ICMP stuff got the same rate limiting applied as the rest
You said: “I’ve notified them of the issue and pointed out which link of theirs is causing the problem. ”
How do you find out the problematic link ?
I am having exactly the same problems, (ssh login hangin) that I solved with the same solution (lowering the MTU) Debian 7 and Debian8 machines.
My ISP is UPC / Austria.
Saw the post earlier in my mobile feedreader and couldn’t comment.
My first guess would’ve been MTU as well, we had that with OpenVPN (and maybe also SSH) on Macs at work a while ago, took a while to find the culprit…
PIng and traceroute to the rescue. For instance, I used a command like this to try to isolate both which machines I couldn’t talk to, and what the relevant MTU was:
$ ping -M do target.example.com -s 1470
PING target.example.com 1470(1498) bytes of data.
Then a traceroute:
traceroute -F target.example.com 1498
(Note that ping and traceroute calculate packet sizes differently; see their manpages)
Thanks, I also found tracepath along the way, but ping was simply the most versatile.
I cannot found the problem on my ISP anymore so it might have been fixed inbetween but I am going to double check my router as it is BSD base.
One problem I found is related to trasparent proxies.
After command
ssh -vvv user@host
it returns:
debug2: channel 0: open confirm rwindow 0 rmax 32768
Try to connect using command:
ssh -o “ProxyCommand nc %h %p” user@host
Link: http://askubuntu.com/questions/344863/ssh-new-connection-begins-to-hang-not-reject-or-terminate-after-a-day-or-so-on
sudo ip link set $iface mtu 1400
This is fine for me (at the moment). Thanks