Category Archives: Software

Easily Accessing All Your Stuff with a Zero-Trust Mesh VPN

Probably everyone is familiar with a regular VPN. The traditional use case is to connect to a corporate or home network from a remote location, and access services as if you were there.

But these days, the notion of “corporate network” and “home network” are less based around physical location. For instance, a company may have no particular office at all, may have a number of offices plus a number of people working remotely, and so forth. A home network might have, say, a PVR and file server, while highly portable devices such as laptops, tablets, and phones may want to talk to each other regardless of location. For instance, a family member might be traveling with a laptop, another at a coffee shop, and those two devices might want to communicate, in addition to talking to the devices at home.

And, in both scenarios, there might be questions about giving limited access to friends. Perhaps you’d like to give a friend access to part of your file server, or as a company, you might have contractors working on a limited project.

Pretty soon you wind up with a mess of VPNs, forwarded ports, and tricks to make it all work. With the increasing prevalence of CGNAT, a lot of times you can’t even open a port to the public Internet. Each application or device probably has its own gateway just to make it visible on the Internet, some of which you pay for.

Then you add on the question of: should you really trust your LAN anyhow? With possibilities of guests using it, rogue access points, etc., the answer is probably “no”.

We can move the responsibility for dealing with NAT, fluctuating IPs, encryption, and authentication, from the application layer further down into the network stack. We then arrive at a much simpler picture for all.

So this page is fundamentally about making the network work, simply and effectively.

How do we make the Internet work in these scenarios?

We’re going to combine three concepts:

  1. A VPN, providing fully encrypted and authenticated communication and stable IPs
  2. Mesh Networking, in which devices automatically discover optimal paths to reach each other
  3. Zero-trust networking, in which we do not need to trust anything about the underlying LAN, because all our traffic uses the secure systems in points 1 and 2.

By combining these concepts, we arrive at some nice results:

  • You can ssh hostname, where hostname is one of your machines (server, laptop, whatever), and as long as hostname is up, you can reach it, wherever it is, wherever you are.
    • Combined with mosh, these sessions will be durable even across moving to other host networks.
    • You could just as well use telnet, because the underlying network should be secure.
  • You don’t have to mess with encryption keys, certs, etc., for every internal-only service. Since IPs are now trustworthy, that’s all you need. hosts.allow could make a comeback!
  • You have a way of transiting out of extremely restrictive networks. Every tool discussed here has a way of falling back on routing things via a broker (relay) on TCP port 443 if all else fails.

There might sometimes be tradeoffs. For instance:

  • On LANs faster than 1Gbps, performance may degrade due to encryption and encapsulation overhead. However, these tools should let hosts discover the locality of each other and not send traffic over the Internet if the devices are local.
  • With some of these tools, hosts local to each other (on the same LAN) may be unable to find each other if they can’t reach the control plane over the Internet (Internet is down or provider is down)

Some other features that some of the tools provide include:

  • Easy sharing of limited access with friends/guests
  • Taking care of everything you need, including SSL certs, for exposing a certain on-net service to the public Internet
  • Optional routing of your outbound Internet traffic via an exit node on your network. Useful, for instance, if your local network is blocking tons of stuff.

Let’s dive in.

Types of Mesh VPNs

I’ll go over several types of meshes in this article:

  1. Fully decentralized with automatic hop routing

    This model has no special central control plane. Nodes discover each other in various ways, and establish routes to each other. These routes can be direct connections over the Internet, or via other nodes. This approach offers the greatest resilience. Examples I’ll cover include Yggdrasil and tinc.

  2. Automatic peer-to-peer with centralized control

    In this model, nodes, by default, communicate by establishing direct links between them. A regular node never carries traffic on behalf of other nodes. Special-purpose relays are used to handle cases in which NAT traversal is impossible. This approach tends to offer simple setup. Examples I’ll cover include Tailscale, Zerotier, Nebula, and Netmaker.

  3. Roll your own and hybrid approaches

    This is a “grab bag” of other ideas; for instance, running Yggdrasil over Tailscale.

Terminology

For the sake of consistency, I’m going to use common language to discuss things that have different terms in different ecosystems:

  • Every tool discussed here has a way of dealing with NAT traversal. It may assist with establishing direct connections (eg, STUN), and if that fails, it may simply relay traffic between nodes. I’ll call such a relay a “broker”. This may or may not be the same system that is a control plane for a tool.
  • All of these systems operate over lower layers that are unencrypted. Those lower layers may be a LAN (wired or wireless, which may or may not have Internet access), or the public Internet (IPv4 and/or IPv6). I’m going to call the unencrypted lower layer, whatever it is, the “clearnet”.

Evaluation Criteria

Here are the things I want to see from a solution:

  • Secure, with all communications end-to-end encrypted and authenticated, and prevention of traffic from untrusted devices.
  • Flexible, adapting to changes in network topology quickly and automatically.
  • Resilient, without single points of failure, and with devices local to each other able to communicate even if cut off from the Internet or other parts of the network.
  • Private, minimizing leakage of information or metadata about me and my systems
  • Able to traverse CGNAT without having to use a broker whenever possible
  • A lesser requirement for me, but still a nice to have, is the ability to include others via something like Internet publishing or inviting guests.
  • Fully or nearly fully Open Source
  • Free or very cheap for personal use
  • Wide operating system support, including headless Linux on x86_64 and ARM.

Fully Decentralized VPNs with Automatic Hop Routing

Two systems fit this description: Yggdrasil and Tinc. Let’s dive in.

Yggdrasil

I’ll start with Yggdrasil because I’ve written so much about it already. It featured in prior posts such as:

Yggdrasil can be a private mesh VPN, or something more

Yggdrasil can be a private mesh VPN, just like the other tools covered here. It’s unique, however, in that a key goal of the project is to also make it useful as a planet-scale global mesh network. As such, Yggdrasil is a testbed of new ideas in distributed routing designed to scale up to massive sizes and all sorts of connection conditions. As of 2023-04-10, the main global Yggdrasil mesh has over 5000 nodes in it. You can choose whether or not to participate.

Every node in a Yggdrasil mesh has a public/private keypair. Each node then has an IPv6 address (in a private address space) derived from its public key. Using these IPv6 addresses, you can communicate right away.

Yggdrasil differs from most of the other tools here in that it does not necessarily seek to establish a direct link on the clearnet between, say, host A and host G for them to communicate. It will prefer such a direct link if it exists, but it is perfectly happy if it doesn’t.

The reason is that every Yggdrasil node is also a router in the Yggdrasil mesh. Let’s sit with that concept for a moment. Consider:

  • If you have a bunch of machines on your LAN, but only one of them can peer over the clearnet, that’s fine; all the other machines will discover this route to the world and use it when necessary.
  • All you need to run a broker is just a regular node with a public IP address. If you are participating in the global mesh, you can use one (or more) of the free public peers for this purpose.
  • It is not necessary for every node to know about the clearnet IP address of every other node (improving privacy). In fact, it’s not even necessary for every node to know about the existence of all the other nodes, so long as it can find a route to a given node when it’s asked to.
  • Yggdrasil can find one or more routes between nodes, and it can use this knowledge of multiple routes to aggressively optimize for varying network conditions, including combinations of, say, downloads and low-latency ssh sessions.

Behind the scenes, Yggdrasil calculates optimal routes between nodes as necessary, using a mesh-wide DHT for initial contact and then deriving more optimal paths. (You can also read more details about the routing algorithm.)

One final way that Yggdrasil is different from most of the other tools is that there is no separate control server. No node is “special”, in charge, the sole keeper of metadata, or anything like that. The entire system is completely distributed and auto-assembling.

Meeting neighbors

There are two ways that Yggdrasil knows about peers:

  • By broadcast discovery on the local LAN
  • By listening on a specific port (or being told to connect to a specific host/port)

Sometimes this might lead to multiple ways to connect to a node; Yggdrasil prefers the connection auto-discovered by broadcast first, then the lowest-latency of the defined path. In other words, when your laptops are in the same room as each other on your local LAN, your packets will flow directly between them without traversing the Internet.

Unique uses

Yggdrasil is uniquely suited to network-challenged situations. As an example, in a post-disaster situation, Internet access may be unavailable or flaky, yet there may be many local devices – perhaps ones that had never known of each other before – that could share information. Yggdrasil meets this situation perfectly. The combination of broadcast auto-detection, distributed routing, and so forth, basically means that if there is any physical path between two nodes, Yggdrasil will find and enable it.

Ad-hoc wifi is rarely used because it is a real pain. Yggdrasil actually makes it useful! Its broadcast discovery doesn’t require any IP address provisioned on the interface at all (it just uses the IPv6 link-local address), so you don’t need to figure out a DHCP server or some such. And, Yggdrasil will tend to perform routing along the contours of the RF path. So you could have a laptop in the middle of a long distance relaying communications from people farther out, because it could see both. Or even a chain of such things.

Yggdrasil: Security and Privacy

Yggdrasil’s mesh is aggressively greedy. It will peer with any node it can find (unless told otherwise) and will find a route to anywhere it can. There are two main ways to make sure you keep unauthorized traffic out: by restricting who can talk to your mesh, and by firewalling the Yggdrasil interface. Both can be used, and they can be used simultaneously.

I’ll discuss firewalling more at the end of this article. Basically, you’ll almost certainly want to do this if you participate in the public mesh, because doing so is akin to having a globally-routable public IP address direct to your device.

If you want to restrict who can talk to your mesh, you just disable the broadcast feature on all your nodes (empty MulticastInterfaces section in the config), and avoid telling any of your nodes to connect to a public peer. You can set a list of authorized public keys that can connect to your nodes’ listening interfaces, which you’ll probably want to do. You will probably want to either open up some inbound ports (if you can) or set up a node with a known clearnet IP on a place like a $5/mo VPS to help with NAT traversal (again, setting AllowedPublicKeys as appropriate). Yggdrasil doesn’t allow filtering multicast clients by public key, only by network interface, so that’s why we disable broadcast discovery. You can easily enough teach Yggdrasil about static internal LAN IPs of your nodes and have things work that way. (Or, set up an internal “gateway” node or two, that the clients just connect to when they’re local). But fundamentally, you need to put a bit more thought into this with Yggdrasil than with the other tools here, which are closed-only.

Compared to some of the other tools here, Yggdrasil is better about information leakage; nodes only know details, such as clearnet IPs, of directly-connected peers. You can obtain the list of directly-connected peers of any known node in the mesh – but that list is the public keys of the directly-connected peers, not the clearnet IPs.

Some of the other tools contain a limited integrated firewall of sorts (with limited ACLs and such). Yggdrasil does not, but is fully compatible with on-host firewalls. I recommend these anyway even with many other tools.

Yggdrasil: Connectivity and NAT traversal

Compared to the other tools, Yggdrasil is an interesting mix. It provides a fully functional mesh and facilitates connectivity in situations in which no other tool can. Yet its NAT traversal, while it exists and does work, results in using a broker under some of the more challenging CGNAT situations more often than some of the other tools, which can impede performance.

Yggdrasil’s underlying protocol is TCP-based. Before you run away screaming that it must be slow and unreliable like OpenVPN over TCP – it’s not, and it is even surprisingly good around bufferbloat. I’ve found its performance to be on par with the other tools here, and it works as well as I’d expect even on flaky 4G links.

Overall, the NAT traversal story is mixed. On the one hand, you can run a node that listens on port 443 – and Yggdrasil can even make it speak TLS (even though that’s unnecessary from a security standpoint), so you can likely get out of most restrictive firewalls you will ever encounter. If you join the public mesh, know that plenty of public peers do listen on port 443 (and other well-known ports like 53, plus random high-numbered ones).

If you connect your system to multiple public peers, there is a chance – though a very small one – that some public transit traffic might be routed via it. In practice, public peers hopefully are already peered with each other, preventing this from happening (you can verify this with yggdrasilctl debug_remotegetpeers key=ABC...). I have never experienced a problem with this. Also, since latency is a factor in routing for Yggdrasil, it is highly unlikely that random connections we use are going to be competitive with datacenter peers.

Yggdrasil: Sharing with friends

If you’re open to participating in the public mesh, this is one of the easiest things of all. Have your friend install Yggdrasil, point them to a public peer, give them your Yggdrasil IP, and that’s it. (Well, presumably you also open up your firewall – you did follow my advice to set one up, right?)

If your friend is visiting at your location, they can just hop on your wifi, install Yggdrasil, and it will automatically discover a route to you. Yggdrasil even has a zero-config mode for ephemeral nodes such as certain Docker containers.

Yggdrasil doesn’t directly support publishing to the clearnet, but it is certainly possible to proxy (or even NAT) to/from the clearnet, and people do.

Yggdrasil: DNS

There is no particular extra DNS in Yggdrasil. You can, of course, run a DNS server within Yggdrasil, just as you can anywhere else. Personally I just add relevant hosts to /etc/hosts and leave it at that, but it’s up to you.

Yggdrasil: Source code, pricing, and portability

Yggdrasil is fully open source (LGPLv3 plus additional permissions in an exception) and highly portable. It is written in Go, and has prebuilt binaries for all major platforms (including a Debian package which I made).

There is no charge for anything with Yggdrasil. Listed public peers are free and run by volunteers. You can run your own peers if you like; they can be public and unlisted, public and listed (just submit a PR to get it listed), or private (accepting connections only from certain nodes’ keys). A “peer” in this case is just a node with a known clearnet IP address.

Yggdrasil encourages use in other projects. For instance, NNCP integrates a Yggdrasil node for easy communication with other NNCP nodes.

Yggdrasil conclusions

Yggdrasil is tops in reliability (having no single point of failure) and flexibility. It will maintain opportunistic connections between peers even if the Internet is down. The unique added feature of being able to be part of a global mesh is a nice one. The tradeoffs include being more prone to need to use a broker in restrictive CGNAT environments. Some other tools have clients that override the OS DNS resolver to also provide resolution of hostnames of member nodes; Yggdrasil doesn’t, though you can certainly run your own DNS infrastructure over Yggdrasil (or, for that matter, let public DNS servers provide Yggdrasil answers if you wish).

There is also a need to pay more attention to firewalling or maintaining separation from the public mesh. However, as I explain below, many other options have potential impacts if the control plane, or your account for it, are compromised, meaning you ought to firewall those, too. Still, it may be a more immediate concern with Yggdrasil.

Although Yggdrasil is listed as experimental, I have been using it for over a year and have found it to be rock-solid. They did change how mesh IPs were calculated when moving from 0.3 to 0.4, causing a global renumbering, so just be aware that this is a possibility while it is experimental.

tinc

tinc is the oldest tool on this list; version 1.0 came out in 2003! You can think of tinc as something akin to “an older Yggdrasil without the public option.”

I will be discussing tinc 1.0.36, the latest stable version, which came out in 2019. The development branch, 1.1, has been going since 2011 and had its latest release in 2021. The last commit to the Github repo was in June 2022.

Tinc is the only tool here to support both tun and tap style interfaces. I go into the difference more in the Zerotier review below. Tinc actually provides a better tap implementation than Zerotier, with various sane options for broadcasts, but I still think the call for an Ethernet, as opposed to IP, VPN is small.

To configure tinc, you generate a per-host configuration and then distribute it to every tinc node. It contains a host’s public key. Therefore, adding a host to the mesh means distributing its key everywhere; de-authorizing it means removing its key everywhere. This makes it rather unwieldy.

tinc can do LAN broadcast discovery and mesh routing, but generally speaking you must manually teach it where to connect initially. Somewhat confusingly, the examples all mention listing a public address for a node. This doesn’t make sense for a laptop, and I suspect you’d just omit it. I think that address is used for something akin to a Yggdrasil peer with a clearnet IP.

Unlike all of the other tools described here, tinc has no tool to inspect the running state of the mesh.

Some of the properties of tinc made it clear I was unlikely to adopt it, so this review wasn’t as thorough as that of Yggdrasil.

tinc: Security and Privacy

As mentioned above, every host in the tinc mesh is authenticated based on its public key. However, to be more precise, this key is validated only at the point it connects to its next hop peer. (To be sure, this is also the same as how the list of allowed pubkeys works in Yggdrasil.) Since IPs in tinc are not derived from their key, and any host can assign itself whatever mesh IP it likes, this implies that a compromised host could impersonate another.

It is unclear whether packets are end-to-end encrypted when using a tinc node as a router. The fact that they can be routed at the kernel level by the tun interface implies that they may not be.

tinc: Connectivity and NAT traversal

I was unable to find much information about NAT traversal in tinc, other than that it does support it. tinc can run over UDP or TCP and auto-detects which to use, preferring UDP.

tinc: Sharing with friends

tinc has no special support for this, and the difficulty of configuration makes it unlikely you’d do this with tinc.

tinc: Source code, pricing, and portability

tinc is fully open source (GPLv2). It is written in C and generally portable. It supports some very old operating systems. Mobile support is iffy.

tinc does not seem to be very actively maintained.

tinc conclusions

I haven’t mentioned performance in my other reviews (see the section at the end of this post). But, it is so poor as to only run about 300Mbps on my 2.5Gbps network. That’s 1/3 the speed of Yggdrasil or Tailscale. Combine that with the unwieldiness of adding hosts and some uncertainties in security, and I’m not going to be using tinc.

Automatic Peer-to-Peer Mesh VPNs with centralized control

These tend to be the options that are frequently discussed. Let’s talk about the options.

Tailscale

Tailscale is a popular choice in this type of VPN. To use Tailscale, you first sign up on tailscale.com. Then, you install the tailscale client on each machine. On first run, it prints a URL for you to click on to authorize the client to your mesh (“tailnet”). Tailscale assigns a mesh IP to each system. The Tailscale client lets the Tailscale control plane gather IP information about each node, including all detectable public and private clearnet IPs.

When you attempt to contact a node via Tailscale, the client will fetch the known contact information from the control plane and attempt to establish a link. If it can contact over the local LAN, it will (it doesn’t have broadcast autodetection like Yggdrasil; the information must come from the control plane). Otherwise, it will try various NAT traversal options. If all else fails, it will use a broker to relay traffic; Tailscale calls a broker a DERP relay server. Unlike Yggdrasil, a Tailscale node never relays traffic for another; all connections are either direct P2P or via a broker.

Tailscale, like several others, is based around Wireguard; though wireguard-go rather than the in-kernel Wireguard.

Tailscale has a number of somewhat unique features in this space:

  • Funnel, which lets you expose ports on your system to the public Internet via the VPN.
  • Exit nodes, which automate the process of routing your public Internet traffic over some other node in the network. This is possible with every tool mentioned here, but Tailscale makes switching it on or off a couple of quick commands away.
  • Node sharing, which lets you share a subset of your network with guests
  • A fantastic set of documentation, easily the best of the bunch.

Funnel, in particular, is interesting. With a couple of “tailscale serve”-style commands, you can expose a directory tree (or a development webserver) to the world. Tailscale gives you a public hostname, obtains a cert for it, and proxies inbound traffic to you. This is subject to some unspecified bandwidth limits, and you can only choose from three public ports, so it’s not really a production solution – but as a quick and easy way to demonstrate something cool to a friend, it’s a neat feature.

Tailscale: Security and Privacy

With Tailscale, as with the other tools in this category, one of the main threats to consider is the control plane. What are the consequences of a compromise of Tailscale’s control plane, or of the credentials you use to access it?

Let’s begin with the credentials used to access it. Tailscale operates no identity system itself, instead relying on third parties. For individuals, this means Google, Github, or Microsoft accounts; Okta and other SAML and similar identity providers are also supported, but this runs into complexity and expense that most individuals aren’t wanting to take on. Unfortunately, all three of those types of accounts often have saved auth tokens in a browser. Personally I would rather have a separate, very secure, login.

If a person does compromise your account or the Tailscale servers themselves, they can’t directly eavesdrop on your traffic because it is end-to-end encrypted. However, assuming an attacker obtains access to your account, they could:

  • Tamper with your Tailscale ACLs, permitting new actions
  • Add new nodes to the network
  • Forcibly remove nodes from the network
  • Enable or disable optional features

Of note is that they cannot just commandeer an existing IP. I would say the riskiest possibility here is that could add new nodes to the mesh. Because they could also tamper with your ACLs, they could then proceed to attempt to access all your internal services. They could even turn on service collection and have Tailscale tell them what and where all the services are.

Therefore, as with other tools, I recommend a local firewall on each machine with Tailscale. More on that below.

Tailscale has a new alpha feature called tailnet lock which helps with this problem. It requires existing nodes in the mesh to sign a request for a new node to join. Although this doesn’t address ACL tampering and some of the other things, it does represent a significant help with the most significant concern. However, tailnet lock is in alpha, only available on the Enterprise plan, and has a waitlist, so I have been unable to test it.

Any Tailscale node can request the IP addresses belonging to any other Tailscale node. The Tailscale control plane captures, and exposes to you, this information about every node in your network: the OS hostname, IP addresses and port numbers, operating system, creation date, last seen timestamp, and NAT traversal parameters. You can optionally enable service data capture as well, which sends data about open ports on each node to the control plane.

Tailscale likes to highlight their key expiry and rotation feature. By default, all keys expire after 180 days, and traffic to and from the expired node will be interrupted until they are renewed (basically, you re-login with your provider and do a renew operation). Unfortunately, the only mention I can see of warning of impeding expiration is in the Windows client, and even there you need to edit a registry key to get the warning more than the default 24 hours in advance. In short, it seems likely to cut off communications when it’s most important. You can disable key expiry on a per-node basis in the admin console web interface, and I mostly do, due to not wanting to lose connectivity at an inopportune time.

Tailscale: Connectivity and NAT traversal

When thinking about reliability, the primary consideration here is being able to reach the Tailscale control plane. While it is possible in limited circumstances to reach nodes without the Tailscale control plane, it is “a fairly brittle setup” and notably will not survive a client restart. So if you use Tailscale to reach other nodes on your LAN, that won’t work unless your Internet is up and the control plane is reachable.

Assuming your Internet is up and Tailscale’s infrastructure is up, there is little to be concerned with. Your own comfort level with cloud providers and your Internet should guide you here.

Tailscale wrote a fantastic article about NAT traversal and they, predictably, do very well with it. Tailscale prefers UDP but falls back to TCP if needed. Broker (DERP) servers step in as a last resort, and Tailscale clients automatically select the best ones. I’m not aware of anything that is more successful with NAT traversal than Tailscale. This maximizes the situations in which a direct P2P connection can be used without a broker.

I have found Tailscale to be a bit slow to notice changes in network topography compared to Yggdrasil, and sometimes needs a kick in the form of restarting the client process to re-establish communications after a network change. However, it’s possible (maybe even probable) that if I’d waited a bit longer, it would have sorted this all out.

Tailscale: Sharing with friends

I touched on the funnel feature earlier. The sharing feature lets you give an invite to an outsider. By default, a person accepting a share can make only outgoing connections to the network they’re invited to, and cannot receive incoming connections from that network – this makes sense. When sharing an exit node, you get a checkbox that lets you share access to the exit node as well. Of course, the person accepting the share needs to install the Tailnet client. The combination of funnel and sharing make Tailscale the best for ad-hoc sharing.

Tailscale: DNS

Tailscale’s DNS is called MagicDNS. It runs as a layer atop your standard DNS – taking over /etc/resolv.conf on Linux – and provides resolution of mesh hostnames and some other features. This is a concept that is pretty slick.

It also is a bit flaky on Linux; dueling programs want to write to /etc/resolv.conf. I can’t really say this is entirely Tailscale’s fault; they document the problem and some workarounds.

I would love to be able to add custom records to this service; for instance, to override the public IP for a service to use the in-mesh IP. Unfortunately, that’s not yet possible. However, MagicDNS can query existing nameservers for certain domains in a split DNS setup.

Tailscale: Source code, pricing, and portability

Tailscale is almost fully open source and the client is highly portable. The client is open source (BSD 3-clause) on open source platforms, and closed source on closed source platforms. The DERP servers are open source. The coordination server is closed source, although there is an open source coordination server called Headscale (also BSD 3-clause) made available with Tailscale’s blessing and informal support. It supports most, but not all, features in the Tailscale coordination server.

Tailscale’s pricing (which does not apply when using Headscale) provides a free plan for 1 user with up to 20 devices. A Personal Pro plan expands that to 100 devices for $48 per year - not a bad deal at $4/mo. A “Community on Github” plan also exists, and then there are more business-oriented plans as well. See the pricing page for details.

As a small note, I appreciated Tailscale’s install script. It properly added Tailscale’s apt key in a way that it can only be used to authenticate the Tailscale repo, rather than as a systemwide authenticator. This is a nice touch and speaks well of their developers.

Tailscale conclusions

Tailscale is tops in sharing and has a broad feature set and excellent documentation. Like other solutions with a centralized control plane, device communications can stop working if the control plane is unreachable, and the threat model of the control plane should be carefully considered.

Zerotier

Zerotier is a close competitor to Tailscale, and is similar to it in a lot of ways. So rather than duplicate all of the Tailscale information here, I’m mainly going to describe how it differs from Tailscale.

The primary difference between the two is that Zerotier emulates an Ethernet network via a Linux tap interface, while Tailscale emulates a TCP/IP network via a Linux tun interface.

However, Zerotier has a number of things that make it be a somewhat imperfect Ethernet emulator. For one, it has a problem with broadcast amplification; the machine sending the broadcast sends it to all the other nodes that should receive it (up to a set maximum). I wouldn’t want to have a lot of programs broadcasting on a slow link. While in theory this could let you run Netware or DECNet across Zerotier, I’m not really convinced there’s much call for that these days, and Zerotier is clearly IP-focused as it allocates IP addresses and such anyhow. Zerotier provides special support for emulated ARP (IPv4) and NDP (IPv6). While you could theoretically run Zerotier as a bridge, this eliminates the zero trust principle, and Tailscale supports subnet routers, which provide much of the same feature set anyhow.

A somewhat obscure feature, but possibly useful, is Zerotier’s built-in support for multipath WAN for the public interface. This actually lets you do a somewhat basic kind of channel bonding for WAN.

Zerotier: Security and Privacy

The picture here is similar to Tailscale, with the difference that you can create a Zerotier-local account rather than relying on cloud authentication. I was unable to find as much detail about Zerotier as I could about Tailscale - notably I couldn’t find anything about how “sticky” an IP address is. However, the configuration screen lets me delete a node and assign additional arbitrary IPs within a subnet to other nodes, so I think the assumption here is that if your Zerotier account (or the Zerotier control plane) is compromised, an attacker could remove a legit device, add a malicious one, and assign the previous IP of the legit device to the malicious one. I’m not sure how to mitigate against that risk, as firewalling specific IPs is ineffective if an attacker can simply take them over. Zerotier also lacks anything akin to Tailnet Lock.

For this reason, I didn’t proceed much further in my Zerotier evaluation.

Zerotier: Connectivity and NAT traversal

Like Tailscale, Zerotier has NAT traversal with STUN. However, it looks like it’s more limited than Tailscale’s, and in particular is incompatible with double NAT that is often seen these days. Zerotier operates brokers (“root servers”) that can do relaying, including TCP relaying. So you should be able to connect even from hostile networks, but you are less likely to form a P2P connection than with Tailscale.

Zerotier: Sharing with friends

I was unable to find any special features relating to this in the Zerotier documentation. Therefore, it would be at the same level as Yggdrasil: possible, maybe even not too difficult, but without any specific help.

Zerotier: DNS

Unlike Tailscale, Zerotier does not support automatically adding DNS entries for your hosts. Therefore, your options are approximately the same as Yggdrasil, though with the added option of pushing configuration pointing to your own non-Zerotier DNS servers to the client.

Zerotier: Source code, pricing, and portability

The client ZeroTier One is available on Github under a custom “business source license” which prevents you from using it in certain settings. This license would preclude it being included in Debian. Their library, libzt, is available under the same license. The pricing page mentions a community edition for self hosting, but the documentation is sparse and it was difficult to understand what its feature set really is.

The free plan lets you have 1 user with up to 25 devices. Paid plans are also available.

Zerotier conclusions

Frankly I don’t see much reason to use Zerotier. The “virtual Ethernet” model seems to be a weird hybrid that doesn’t bring much value. I’m concerned about the implications of a compromise of a user account or the control plane, and it lacks a lot of Tailscale features (MagicDNS and sharing). The only thing it may offer in particular is multipath WAN, but that’s esoteric enough – and also solvable at other layers – that it doesn’t seem all that compelling to me. Add to that the strange license and, to me anyhow, I don’t see much reason to bother with it.

Netmaker

Netmaker is one of the projects that is making noise these days. Netmaker is the only one here that is a wrapper around in-kernel Wireguard, which can make a performance difference when talking to peers on a 1Gbps or faster link. Also, unlike other tools, it has an ingress gateway feature that lets people that don’t have the Netmaker client, but do have Wireguard, participate in the VPN. I believe I also saw a reference somewhere to nodes as routers as with Yggdrasil, but I’m failing to dig it up now.

The project is in a bit of an early state; you can sign up for an “upcoming closed beta” with a SaaS host, but really you are generally pointed to self-hosting using the code in the github repo. There are community and enterprise editions, but it’s not clear how to actually choose. The server has a bunch of components: binary, CoreDNS, database, and web server. It also requires elevated privileges on the host, in addition to a container engine. Contrast that to the single binary that some others provide.

It looks like releases are frequent, but sometimes break things, and have a somewhat more laborious upgrade processes than most.

I don’t want to spend a lot of time managing my mesh. So because of the heavy needs of the server, the upgrades being labor-intensive, it taking over iptables and such on the server, I didn’t proceed with a more in-depth evaluation of Netmaker. It has a lot of promise, but for me, it doesn’t seem to be in a state that will meet my needs yet.

Nebula

Nebula is an interesting mesh project that originated within Slack, seems to still be primarily sponsored by Slack, but is also being developed by Defined Networking (though their product looks early right now). Unlike the other tools in this section, Nebula doesn’t have a web interface at all. Defined Networking looks likely to provide something of a SaaS service, but for now, you will need to run a broker (“lighthouse”) yourself; perhaps on a $5/mo VPS.

Due to the poor firewall traversal properties, I didn’t do a full evaluation of Nebula, but it still has a very interesting design.

Nebula: Security and Privacy

Since Nebula lacks a traditional control plane, the root of trust in Nebula is a CA (certificate authority). The documentation gives this example of setting it up:

./nebula-cert sign -name "lighthouse1" -ip "192.168.100.1/24"
./nebula-cert sign -name "laptop" -ip "192.168.100.2/24" -groups "laptop,home,ssh"
./nebula-cert sign -name "server1" -ip "192.168.100.9/24" -groups "servers"
./nebula-cert sign -name "host3" -ip "192.168.100.10/24"

So the cert contains your IP, hostname, and group allocation. Each host in the mesh gets your CA certificate, and the per-host cert and key generated from each of these steps.

This leads to a really nice security model. Your CA is the gatekeeper to what is trusted in your mesh. You can even have it airgapped or something to make it exceptionally difficult to breach the perimeter.

Nebula contains an integrated firewall. Because the ability to keep out unwanted nodes is so strong, I would say this may be the one mesh VPN you might consider using without bothering with an additional on-host firewall.

You can define static mappings from a Nebula mesh IP to a clearnet IP. I haven’t found information on this, but theoretically if NAT traversal isn’t required, these static mappings may allow Nebula nodes to reach each other even if Internet is down. I don’t know if this is truly the case, however.

Nebula: Connectivity and NAT traversal

This is a weak point of Nebula. Nebula sends all traffic over a single UDP port; there is no provision for using TCP. This is an issue at certain hotel and other public networks which open only TCP egress ports 80 and 443.

I couldn’t find a lot of detail on what Nebula’s NAT traversal is capable of, but according to a certain Github issue, this has been a sore spot for years and isn’t as capable as Tailscale.

You can designate nodes in Nebula as brokers (relays). The concept is the same as Yggdrasil, but it’s less versatile. You have to manually designate what relay to use. It’s unclear to me what happens if different nodes designate different relays. Keep in mind that this always happens over a UDP port.

Nebula: Sharing with friends

There is no particular support here.

Nebula: DNS

Nebula has experimental DNS support. In contrast with Tailscale, which has an internal DNS server on every node, Nebula only runs a DNS server on a lighthouse. This means that it can’t forward requests to a DNS server that’s upstream for your laptop’s particular current location. Actually, Nebula’s DNS server doesn’t forward at all. It also doesn’t resolve its own name.

The Nebula documentation makes reference to using multiple lighthouses, which you may want to do for DNS redundancy or performance, but it’s unclear to me if this would make each lighthouse form a complete picture of the network.

Nebula: Source code, pricing, and portability

Nebula is fully open source (MIT). It consists of a single Go binary and configuration. It is fairly portable.

Nebula conclusions

I am attracted to Nebula’s unique security model. I would probably be more seriously considering it if not for the lack of support for TCP and poor general NAT traversal properties. Its datacenter connectivity heritage does show through.

Roll your own and hybrid

Here is a grab bag of ideas:

Running Yggdrasil over Tailscale

One possibility would be to use Tailscale for its superior NAT traversal, then allow Yggdrasil to run over it. (You will need a firewall to prevent Tailscale from trying to run over Yggdrasil at the same time!) This creates a closed network with all the benefits of Yggdrasil, yet getting the NAT traversal from Tailscale.

Drawbacks might be the overhead of the double encryption and double encapsulation. A good Yggdrasil peer may wind up being faster than this anyhow.

Public VPN provider for NAT traversal

A public VPN provider such as Mullvad will often offer incoming port forwarding and nodes in many cities. This could be an attractive way to solve a bunch of NAT traversal problems: just use one of those services to get you an incoming port, and run whatever you like over that.

Be aware that a number of public VPN clients have a “kill switch” to prevent any traffic from egressing without using the VPN; see, for instance, Mullvad’s. You’ll need to disable this if you are running a mesh atop it.

Other

Combining with local firewalls

For most of these tools, I recommend using a local firewal in conjunction with them. I have been using firehol and find it to be quite nice. This means you don’t have to trust the mesh, the control plane, or whatever. The catch is that you do need your mesh VPN to provide strong association between IP address and node. Most, but not all, do.

Performance

I tested some of these for performance using iperf3 on a 2.5Gbps LAN. Here are the results. All speeds are in Mbps.

Tool iperf3 (default) iperf3 -P 10 iperf3 -R
Direct (no VPN) 2406 2406 2764
Wireguard (kernel) 1515 1566 2027
Yggdrasil 892 1126 1105
Tailscale 950 1034 1085
Tinc 296 300 277

You can see that Wireguard was significantly faster than the other options. Tailscale and Yggdrasil were roughly comparable, and Tinc was terrible.

IP collisions

When you are communicating over a network such as these, you need to trust that the IP address you are communicating with belongs to the system you think it does. This protects against two malicious actor scenarios:

  1. Someone compromises one machine on your mesh and reconfigures it to impersonate a more important one
  2. Someone connects an unauthorized system to the mesh, taking over a trusted IP, and uses the privileges of the trusted IP to access resources

To summarize the state of play as highlighted in the reviews above:

  • Yggdrasil derives IPv6 addresses from a public key
  • tinc allows any node to set any IP
  • Tailscale IPs aren’t user-assignable, but the assignment algorithm is unknown
  • Zerotier allows any IP to be allocated to any node at the control plane
  • I don’t know what Netmaker does
  • Nebula IPs are baked into the cert and signed by the CA, but I haven’t verified the enforcement algorithm

So this discussion really only applies to Yggdrasil and Tailscale. tinc and Zerotier lack detailed IP security, while Nebula expects IP allocations to be handled outside of the tool and baked into the certs (therefore enforcing rigidity at that level).

So the question for Yggdrasil and Tailscale is: how easy is it to commandeer a trusted IP?

Yggdrasil has a brief discussion of this. In short, Yggdrasil offers you both a dedicated IP and a rarely-used /64 prefix which you can delegate to other machines on your LAN. Obviously by taking the dedicated IP, a lot more bits are available for the hash of the node’s public key, making “collisions technically impractical, if not outright impossible.” However, if you use the /64 prefix, a collision may be more possible. Yggdrasil’s hashing algorithm includes some optimizations to make this more difficult. Yggdrasil includes a genkeys tool that uses more CPU cycles to generate keys that are maximally difficult to collide with.

Tailscale doesn’t document their IP assignment algorithm, but I think it is safe to say that the larger subnet you use, the better. If you try to use a /24 for your mesh, it is certainly conceivable that an attacker could remove your trusted node, then just manually add the 240 or so machines it would take to get that IP reassigned. It might be a good idea to use a purely IPv6 mesh with Tailscale to minimize this problem as well.

So, I think the risk is low in the default configurations of both Yggdrasil and Tailscale (certainly lower than with tinc or Zerotier). You can drive the risk even lower with both.

Final thoughts

For my own purposes, I suspect I will remain with Yggdrasil in some fashion. Maybe I will just take the small performance hit that using a relay node implies. Or perhaps I will get clever and use an incoming VPN port forward or go over Tailscale.

Tailscale was the other option that seemed most interesting. However, living in a region with Internet that goes down more often than I’d like, I would like to just be able to send as much traffic over a mesh as possible, trusting that if the LAN is up, the mesh is up.

I have one thing that really benefits from performance in excess of Yggdrasil or Tailscale: NFS. That’s between two machines that never leave my LAN, so I will probably just set up a direct Wireguard link between them. Heck of a lot easier than trying to do Kerberos!

Finally, I wrote this intending to be useful. I dealt with a lot of complexity and under-documentation, so it’s possible I got something wrong somewhere. Please let me know if you find any errors.


This blog post is a copy of a page on my website. That page may be periodically updated.

Music Playing: Both Whole-House and Mobile

It’s been nearly 8 years since I last made choices about music playing. At the time, I picked Logitech Media Server (LMS, aka Slimserver and Squeezebox server) for whole-house audio and Ampache with the DSub Android app.

It’s time to revisit that approach. Here are the things I’m looking for:

  • Whole-house audio: a single control point for all the speakers in the house, which are all connected to some form of Linux (Raspberry Pi or x86). The speakers should be reasonably in sync with each other, and the control point should be able to adjust volume on them centrally. I should be able to play albums, playlists, etc. on them, and skip tracks or seek within a track.
  • The ability to stream to an Android mobile device, ideally with downloading capabilities for offline use.
  • If multiple solutions are used, playlist syncing between them.
  • Ideally, bookmark support to resume playing a long track where it was left off.
  • Ideally, podcast support.

The current setup

Here are the current components:

  • Logitech Media Server, which serves the music library for whole-house synchronized audio
  • Squeezelite is the LMS client running on my Raspberry Pi and x86 systems
  • Squeezer is a nice Android client for LMS to control playback, adjust volume, etc. It doesn’t do any playback on the Android device, of course.
  • Ampache provides the server for streaming clients, both browser-based and mobile
  • DSub (F-Droid, Play Store) is a nice Android client for Ampache providing streaming and offline playback

LMS makes an excellent whole-house audio system. I can pull up the webpage (or use an Android app like Squeezer) to browse my music library, queue things up to play, and so forth. I can also create playlists, which it saves as m3u files.

This whole setup is boringly reliable. It just works, year in, year out.

The main problem with this is that LMS has no real streaming/offline mobile support. It is also a rather dated system, with a painful UI for playlist management, and in general doesn’t feel very modern. (It’s written largely in Perl also!)

So, I paired with it is Ampache. As a streaming player, Ampache is fantastic; I can access it from a web browser, and it will transcode my FLAC files to the quality I’ve set in my user prefs. The DSub app for Android is fantastic and remembers my last-play locations and such.

The problem is that Ampache doesn’t write its playlists back to m3u format, so I can’t use them with LMS. I have to therefore maintain all the playlists in LMS, and it has a smallish limit on the number of tracks per playlist. Ampache also doesn’t auto-update from LMS playlists, so I have to delete and recreate the playlists catalog periodically to get updates into Ampache. Not fun.

The new experiment

I’m trying out a new system based on these components:

  • Jellyfin is a media player. It supports not just music, but also video (in fact, the emphasis is more on video). Notably it supports controlling various devices. Its normal frontend is a web browser; Jellyfin’s server won’t output audio to a device itself.
  • Mopidy is a media player with a web interface that does output audio to a local device. In normal use, it displays an interface to your music, letting you select, queue up, etc.
  • Mopidy-Jellyfin (docs) is a plugin for Mopidy that enables two things: 1) Browsing the Jellyfin library within Mopidy, and 2) controlling Mopidy from within Jellyfin. Mode 1 barely works, but mode 2 works perfectly. Within Jellyfin, I can “cast to Mopidy” and queue up things, seek, skip tracks, etc.
  • Snapcast is a generic solution to take audio from some sort of source and distribute it throughout the house, syncing each device (and with better syncing than LMS, too!). The source can be just about anything, and the docs include an example of how to set it up with Mopidy.
  • Mopidy has selectable web interfaces, and the Mopidy-Muse interface has the added benefit of having integrated Snapcast control. (Mopidy-Iris does as well, though it wasn’t documented there.) Within it, I can adjust volume on devices, mute devices, etc. I could also use the Snapcast web interface for this purpose.
  • The default Jellyfin Android app lets me stream media to the mobile device, as well as control the Mopidy player.
  • Finamp (F-Droid, Play Store) is a very nice Android Jellyfin music playing client, which notably supports downloads for offline playing, a feature the stock app lacks.
  • The Snapcast Android app (F-Droid, Play Store) isn’t strictly necessary, since the Snapcast web app is so simple to use. But it provides near-instant control of speakers and volumes.

This looks a lot more complicated than what I had before, but in reality it only has one additional layer. Since Snapcast is a general audio syncing tool, and Jellyfin doesn’t itself output audio, Mopidy and its extensions is the “glue”.

There’s a lot to like about this setup. There is one single canonical source for music and playlists. Jellyfin can do a lot more besides music, and its mobile app gives me video access also. The setup, in general, works pretty well.

There are a few minor glitches, but nothing huge. For instance, Jellyfin fails to clear the play queue on the mopidy side.

But there is one problem, though: when playing a playlist, it is played out of order. Jellyfin itself has the same issue internally, so I’m unsure where the bug lies.

Rejected option: Jellyfin with jellycli

This could be a nice option; instead of mopidy with a plugin, just run jellycli in headless mode as a more “native” client. It also has the playlist ordering bug, and in addition, fails to play a couple of my albums which Mopidy-Jellyfin handles fine. But, if those bugs were addressed, it has a ton of promise as a simpler glue between Jellyfin and Snapcast than Mopidy.

Rejected option: Mopidy-Subidy Plugin with Ampache

Mopidy has a Subsonic plugin, and Ampache implements the Subsonic API. This would theoretically let me use a Mopidy client to play things on the whole-house system, coming from the same Ampache system.

Although I did get this connected with some trial and error (legacy auth on, API version 1.13.0), it was extremely slow. Loading the list of playlists took minutes, the list of albums and artists many seconds. It didn’t cache any answers either, so it was unusably slow.

Rejected option: Ampache localplay with mpd

Ampache has a feature called localplay which allows it to control a mpd server. I tested this out with mpd and snapcast. It works, but is highly limited. Basically, it causes Ampache to send a playlist — a literal list of URLs — to the mpd server. Unfortunately, seeking within a track is impossible from within the Ampache interface.

I will note that once a person is using mpd, snapcast makes a much easier whole-house solution than the streaming option I was trying to get working 8 years ago.

Dead USB Drives Are Fine: Building a Reliable Sneakernet

“OK,” you’re probably thinking. “John, you talk a lot about things like Gopher and personal radios, and now you want to talk about building a reliable network out of… USB drives?”

Well, yes. In fact, I’ve already done it.

What is sneakernet?

Normally, “sneakernet” is a sort of tongue-in-cheek reference to using disconnected storage to transport data or messages. By “disconnect storage” I mean anything like CD-ROMs, hard drives, SD cards, USB drives, and so forth. There are times when loading up 12TB on a device and driving it across town is just faster and easier than using the Internet for the same. And, sometimes you need to get data to places that have no Internet at all.

Another reason for sneakernet is security. For instance, if your backup system is online, and your systems being backed up are online, then it could become possible for an attacker to destroy both your primary copy of data and your backups. Or, you might use a dedicated computer with no network connection to do GnuPG (GPG) signing.

What about “reliable” sneakernet, then?

TCP is often considered a “reliable” protocol. That means that the sending side is generally able to tell if its message was properly received. As with most reliable protocols, we have these components:

  1. After transmitting a piece of data, the sender retains it.
  2. After receiving a piece of data, the receiver sends an acknowledgment (ACK) back to the sender.
  3. Upon receiving the acknowledgment, the sender removes its buffered copy of the data.
  4. If no acknowledgment is received at the sender, it retransmits the data, in case it gets lost in transit.
  5. It reorders any packets that arrive out of order, so that the recipient’s data stream is ordered correctly.

Now, a lot of the things I just mentioned for sneakernet are legendarily unreliable. USB drives fail, CD-ROMs get scratched, hard drives get banged up. Think about putting these things in a bicycle bag or airline luggage. Some of them are going to fail.

You might think, “well, I’ll just copy files to a USB drive instead of move them, and once I get them onto the destination machine, I’ll delete them from the source.” Congratulations! You are a human retransmit algorithm! We should be able to automate this!

And we can.

Enter NNCP

NNCP is one of those things that almost defies explanation. It is a toolkit for building asynchronous networks. It can use as a carrier: a pipe, TCP network connection, a mounted filesystem (specifically intended for cases like this), and much more. It also supports multi-hop asynchronous routing and asynchronous meshing, but these are beyond the scope of this particular article.

NNCP’s transports that involve live communication between two hops already had all the hallmarks of being reliable; there was a positive ACK and retransmit. As of version 8.7.0, NNCP’s ACKs themselves can also be asynchronous – meaning that every NNCP transport can now be reliable.

Yes, that’s right. Your ACKs can flow over tapes and USB drives if you want them to.

I use this for archiving and backups.

If you aren’t already familiar with NNCP, you might take a look at my NNCP page. I also have a lot of blog posts about NNCP.

Those pages describe the basics of NNCP: the “packet” (the unit of transmission in NNCP, which can be tiny or many TB), the end-to-end encryption, and so forth. The new command we will now be interested in is nncp-ack.

The Basic Idea

Here are the basic steps to processing this stuff with NNCP:

  1. First, we use nncp-xfer -rx to process incoming packets from the USB (or other media) device. This moves them into the NNCP inbound queue, deleting them from the media device, and verifies the packet integrity.
  2. We use nncp-ack -node $NODE to create ACK packets responding to the packets we just loaded into the rx queue. It writes a list of generated ACKs onto fd 4, which we save off for later use.
  3. We run nncp-toss -seen to process the incoming queue. The use of -seen causes NNCP to remember the hashes of packets seen before, so a duplicate of an already-seen packet will not be processed twice. This command also processes incoming ACKs for packets we’ve sent out previously; if they pass verification, the relevant packets are removed from the local machine’s tx queue.
  4. Now, we use nncp-xfer -keep -tx -mkdir -node $NODE to send outgoing packets to a given node by writing them to a given directory on the media device. -keep causes them to remain in the outgoing queue.
  5. Finally, we use the list of generated ACK packets saved off in step 2 above. That list is passed to nncp-rm -node $NODE -pkt < $FILE to remove those specific packets from the outbound queue. The reason is that there will never be an ACK of ACK packet (that would create an infinite loop), so if we don’t delete them in this manner, they would hang around forever.

You can see these steps follow the same basic outline on upstream’s nncp-ack page.

One thing to keep in mind: if anything else is running nncp-toss, there is a chance of a race condition between steps 1 and 2 (if nncp-toss gets to it first, it might not get an ack generated). This would sort itself out eventually, presumably, as the sender would retransmit and it would be ACKed later.

Further ideas

NNCP guarantees the integrity of packets, but not ordering between packets; if you need that, you might look into my Filespooler program. It is designed to work with NNCP and can provide ordered processing.

An example script

Here is a script you might try for this sort of thing. It may have more logic than you need – really, you just need the steps above – but hopefully it is clear.

#!/bin/bash

set -eo pipefail

MEDIABASE="/media/$USER"

# The local node name
NODENAME="`hostname`"

# All nodes.  NODENAME should be in this list.
ALLNODES="node1 node2 node3"

RUNNNCP=""
# If you need to sudo, use something like RUNNNCP="sudo -Hu nncp"
NNCPPATH="/usr/local/nncp/bin"

ACKPATH="`mktemp -d`"

# Process incoming packets.
#
# Parameters: $1 - the path to scan.  Must contain a directory
# named "nncp".
procrxpath () {
    while [ -n "$1" ]; do
        BASEPATH="$1/nncp"
        shift
        if ! [ -d "$BASEPATH" ]; then
            echo "$BASEPATH doesn't exist; skipping"
            continue
        fi

        echo " *** Incoming: processing $BASEPATH"
        TMPDIR="`mktemp -d`"

        # This rsync and the one below can help with
        # certain permission issues from weird foreign
        # media.  You could just eliminate it and
        # always use $BASEPATH instead of $TMPDIR below.
        rsync -rt "$BASEPATH/" "$TMPDIR/"

        # You may need these next two lines if using sudo as above.
        # chgrp -R nncp "$TMPDIR"
        # chmod -R g+rwX "$TMPDIR"
        echo "     Running nncp-xfer -rx"
        $RUNNNCP $NNCPPATH/nncp-xfer -progress -rx "$TMPDIR"

        for NODE in $ALLNODES; do
                if [ "$NODE" != "$NODENAME" ]; then
                        echo "     Running nncp-ack for $NODE"

                        # Now, we generate ACK packets for each node we will
                        # process.  nncp-ack writes a list of the created
                        # ACK packets to fd 4.  We'll use them later.
                        # If using sudo, add -C 5 after $RUNNNCP.
                        $RUNNNCP $NNCPPATH/nncp-ack -progress -node "$NODE" \
                           4>> "$ACKPATH/$NODE"
                fi
        done

        rsync --delete -rt "$TMPDIR/" "$BASEPATH/"
        rm -fr "$TMPDIR"
    done
}


proctxpath () {
    while [ -n "$1" ]; do
        BASEPATH="$1/nncp"
        shift
        if ! [ -d "$BASEPATH" ]; then
            echo "$BASEPATH doesn't exist; skipping"
            continue
        fi

        echo " *** Outgoing: processing $BASEPATH"
        TMPDIR="`mktemp -d`"
        rsync -rt "$BASEPATH/" "$TMPDIR/"
        # You may need these two lines if using sudo:
        # chgrp -R nncp "$TMPDIR"
        # chmod -R g+rwX "$TMPDIR"

        for DESTHOST in $ALLNODES; do
            if [ "$DESTHOST" = "$NODENAME" ]; then
                continue
            fi

            # Copy outgoing packets to this node, but keep them in the outgoing
            # queue with -keep.
            $RUNNNCP $NNCPPATH/nncp-xfer -keep -tx -mkdir -node "$DESTHOST" -progress "$TMPDIR"

            # Here is the key: that list of ACK packets we made above - now we delete them.
            # There will never be an ACK for an ACK, so they'd keep sending forever
            # if we didn't do this.
            if [ -f "$ACKPATH/$DESTHOST" ]; then
                echo "nncp-rm for node $DESTHOST"
                $RUNNNCP $NNCPPATH/nncp-rm -debug -node "$DESTHOST" -pkt < "$ACKPATH/$DESTHOST"
            fi

        done

        rsync --delete -rt "$TMPDIR/" "$BASEPATH/"
        rm -rf "$TMPDIR"

        # We only want to write stuff once.
        return 0
    done
}

procrxpath "$MEDIABASE"/*

echo " *** Initial tossing..."

# We make sure to use -seen to rule out duplicates.
$RUNNNCP $NNCPPATH/nncp-toss -progress -seen

proctxpath "$MEDIABASE"/*

echo "You can unmount devices now."

echo "Done."

This post is also available on my webiste, where it may be periodically updated.

The PC & Internet Revolution in Rural America

Inspired by several others (such as Alex Schroeder’s post and Szczeżuja’s prompt), as well as a desire to get this down for my kids, I figure it’s time to write a bit about living through the PC and Internet revolution where I did: outside a tiny town in rural Kansas. And, as I’ve been back in that same area for the past 15 years, I reflect some on the challenges that continue to play out.

Although the stories from the others were primarily about getting online, I want to start by setting some background. Those of you that didn’t grow up in the same era as I did probably never realized that a typical business PC setup might cost $10,000 in today’s dollars, for instance. So let me start with the background.

Nothing was easy

This story begins in the 1980s. Somewhere around my Kindergarten year of school, around 1985, my parents bought a TRS-80 Color Computer 2 (aka CoCo II). It had 64K of RAM and used a TV for display and sound.

This got you the computer. It didn’t get you any disk drive or anything, no joysticks (required by a number of games). So whenever the system powered down, or it hung and you had to power cycle it – a frequent event – you’d lose whatever you were doing and would have to re-enter the program, literally by typing it in.

The floppy drive for the CoCo II cost more than the computer, and it was quite common for people to buy the computer first and then the floppy drive later when they’d saved up the money for that.

I particularly want to mention that computers then didn’t come with a modem. What would be like buying a laptop or a tablet without wifi today. A modem, which I’ll talk about in a bit, was another expensive accessory. To cobble together a system in the 80s that was capable of talking to others – with persistent storage (floppy, or hard drive), screen, keyboard, and modem – would be quite expensive. Adjusted for inflation, if you’re talking a PC-style device (a clone of the IBM PC that ran DOS), this would easily be more expensive than the Macbook Pros of today.

Few people back in the 80s had a computer at home. And the portion of those that had even the capability to get online in a meaningful way was even smaller.

Eventually my parents bought a PC clone with 640K RAM and dual floppy drives. This was primarily used for my mom’s work, but I did my best to take it over whenever possible. It ran DOS and, despite its monochrome screen, was generally a more capable machine than the CoCo II. For instance, it supported lowercase. (I’m not even kidding; the CoCo II pretty much didn’t.) A while later, they purchased a 32MB hard drive for it – what luxury!

Just getting a machine to work wasn’t easy. Say you’d bought a PC, and then bought a hard drive, and a modem. You didn’t just plug in the hard drive and it would work. You would have to fight it every step of the way. The BIOS and DOS partition tables of the day used a cylinder/head/sector method of addressing the drive, and various parts of that those addresses had too few bits to work with the “big” drives of the day above 20MB. So you would have to lie to the BIOS and fdisk in various ways, and sort of work out how to do it for each drive. For each peripheral – serial port, sound card (in later years), etc., you’d have to set jumpers for DMA and IRQs, hoping not to conflict with anything already in the system. Perhaps you can now start to see why USB and PCI were so welcomed.

Sharing and finding resources

Despite the two computers in our home, it wasn’t as if software written on one machine just ran on another. A lot of software for PC clones assumed a CGA color display. The monochrome HGC in our PC wasn’t particularly compatible. You could find a TSR program to emulate the CGA on the HGC, but it wasn’t particularly stable, and there’s only so much you can do when a program that assumes color displays on a monitor that can only show black, dark amber, or light amber.

So I’d periodically get to use other computers – most commonly at an office in the evening when it wasn’t being used.

There were some local computer clubs that my dad took me to periodically. Software was swapped back then; disks copied, shareware exchanged, and so forth. For me, at least, there was no “online” to download software from, and selling software over the Internet wasn’t a thing at all.

Three Different Worlds

There were sort of three different worlds of computing experience in the 80s:

  1. Home users. Initially using a wide variety of software from Apple, Commodore, Tandy/RadioShack, etc., but eventually coming to be mostly dominated by IBM PC clones
  2. Small and mid-sized business users. Some of them had larger minicomputers or small mainframes, but most that I had contact with by the early 90s were standardized on DOS-based PCs. More advanced ones had a network running Netware, most commonly. Networking hardware and software was generally too expensive for home users to use in the early days.
  3. Universities and large institutions. These are the places that had the mainframes, the earliest implementations of TCP/IP, the earliest users of UUCP, and so forth.

The difference between the home computing experience and the large institution experience were vast. Not only in terms of dollars – the large institution hardware could easily cost anywhere from tens of thousands to millions of dollars – but also in terms of sheer resources required (large rooms, enormous power circuits, support staff, etc). Nothing was in common between them; not operating systems, not software, not experience. I was never much aware of the third category until the differences started to collapse in the mid-90s, and even then I only was exposed to it once the collapse was well underway.

You might say to me, “Well, Google certainly isn’t running what I’m running at home!” And, yes of course, it’s different. But fundamentally, most large datacenters are running on x86_64 hardware, with Linux as the operating system, and a TCP/IP network. It’s a different scale, obviously, but at a fundamental level, the hardware and operating system stack are pretty similar to what you can readily run at home. Back in the 80s and 90s, this wasn’t the case. TCP/IP wasn’t even available for DOS or Windows until much later, and when it was, it was a clunky beast that was difficult.

One of the things Kevin Driscoll highlights in his book called Modem World – see my short post about it – is that the history of the Internet we usually receive is focused on case 3: the large institutions. In reality, the Internet was and is literally a network of networks. Gateways to and from Internet existed from all three kinds of users for years, and while TCP/IP ultimately won the battle of the internetworking protocol, the other two streams of users also shaped the Internet as we now know it. Like many, I had no access to the large institution networks, but as I’ve been reflecting on my experiences, I’ve found a new appreciation for the way that those of us that grew up with primarily home PCs shaped the evolution of today’s online world also.

An Era of Scarcity

I should take a moment to comment about the cost of software back then. A newspaper article from 1985 comments that WordPerfect, then the most powerful word processing program, sold for $495 (or $219 if you could score a mail order discount). That’s $1360/$600 in 2022 money. Other popular software, such as Lotus 1-2-3, was up there as well. If you were to buy a new PC clone in the mid to late 80s, it would often cost $2000 in 1980s dollars. Now add a printer – a low-end dot matrix for $300 or a laser for $1500 or even more. A modem: another $300. So the basic system would be $3600, or $9900 in 2022 dollars. If you wanted a nice printer, you’re now pushing well over $10,000 in 2022 dollars.

You start to see one barrier here, and also why things like shareware and piracy – if it was indeed even recognized as such – were common in those days.

So you can see, from a home computer setup (TRS-80, Commodore C64, Apple ][, etc) to a business-class PC setup was an order of magnitude increase in cost. From there to the high-end minis/mainframes was another order of magnitude (at least!) increase. Eventually there was price pressure on the higher end and things all got better, which is probably why the non-DOS PCs lasted until the early 90s.

Increasing Capabilities

My first exposure to computers in school was in the 4th grade, when I would have been about 9. There was a single Apple ][ machine in that room. I primarily remember playing Oregon Trail on it. The next year, the school added a computer lab. Remember, this is a small rural area, so each graduating class might have about 25 people in it; this lab was shared by everyone in the K-8 building. It was full of some flavor of IBM PS/2 machines running DOS and Netware. There was a dedicated computer teacher too, though I think she was a regular teacher that was given somewhat minimal training on computers. We were going to learn typing that year, but I did so well on the very first typing program that we soon worked out that I could do programming instead. I started going to school early – these machines were far more powerful than the XT at home – and worked on programming projects there.

Eventually my parents bought me a Gateway 486SX/25 with a VGA monitor and hard drive. Wow! This was a whole different world. It may have come with Windows 3.0 or 3.1 on it, but I mainly remember running OS/2 on that machine. More on that below.

Programming

That CoCo II came with a BASIC interpreter in ROM. It came with a large manual, which served as a BASIC tutorial as well. The BASIC interpreter was also the shell, so literally you could not use the computer without at least a bit of BASIC.

Once I had access to a DOS machine, it also had a basic interpreter: GW-BASIC. There was a fair bit of software written in BASIC at the time, but most of the more advanced software wasn’t. I wondered how these .EXE and .COM programs were written. I could find vague references to DEBUG.EXE, assemblers, and such. But it wasn’t until I got a copy of Turbo Pascal that I was able to do that sort of thing myself. Eventually I got Borland C++ and taught myself C as well. A few years later, I wanted to try writing GUI programs for Windows, and bought Watcom C++ – much cheaper than the competition, and it could target Windows, DOS (and I think even OS/2).

Notice that, aside from BASIC, none of this was free, and none of it was bundled. You couldn’t just download a C compiler, or Python interpreter, or whatnot back then. You had to pay for the ability to write any kind of serious code on the computer you already owned.

The Microsoft Domination

Microsoft came to dominate the PC landscape, and then even the computing landscape as a whole. IBM very quickly lost control over the hardware side of PCs as Compaq and others made clones, but Microsoft has managed – in varying degrees even to this day – to keep a stranglehold on the software, and especially the operating system, side. Yes, there was occasional talk of things like DR-DOS, but by and large the dominant platform came to be the PC, and if you had a PC, you ran DOS (and later Windows) from Microsoft.

For awhile, it looked like IBM was going to challenge Microsoft on the operating system front; they had OS/2, and when I switched to it sometime around the version 2.1 era in 1993, it was unquestionably more advanced technically than the consumer-grade Windows from Microsoft at the time. It had Internet support baked in, could run most DOS and Windows programs, and had introduced a replacement for the by-then terrible FAT filesystem: HPFS, in 1988. Microsoft wouldn’t introduce a better filesystem for its consumer operating systems until Windows XP in 2001, 13 years later. But more on that story later.

Free Software, Shareware, and Commercial Software

I’ve covered the high cost of software already. Obviously $500 software wasn’t going to sell in the home market. So what did we have?

Mainly, these things:

  1. Public domain software. It was free to use, and if implemented in BASIC, probably had source code with it too.
  2. Shareware
  3. Commercial software (some of it from small publishers was a lot cheaper than $500)

Let’s talk about shareware. The idea with shareware was that a company would release a useful program, sometimes limited. You were encouraged to “register”, or pay for, it if you liked it and used it. And, regardless of whether you registered it or not, were told “please copy!” Sometimes shareware was fully functional, and registering it got you nothing more than printed manuals and an easy conscience (guilt trips for not registering weren’t necessarily very subtle). Sometimes unregistered shareware would have a “nag screen” – a delay of a few seconds while they told you to register. Sometimes they’d be limited in some way; you’d get more features if you registered. With games, it was popular to have a trilogy, and release the first episode – inevitably ending with a cliffhanger – as shareware, and the subsequent episodes would require registration. In any event, a lot of software people used in the 80s and 90s was shareware. Also pirated commercial software, though in the earlier days of computing, I think some people didn’t even know the difference.

Notice what’s missing: Free Software / FLOSS in the Richard Stallman sense of the word. Stallman lived in the big institution world – after all, he worked at MIT – and what he was doing with the Free Software Foundation and GNU project beginning in 1983 never really filtered into the DOS/Windows world at the time. I had no awareness of it even existing until into the 90s, when I first started getting some hints of it as a port of gcc became available for OS/2. The Internet was what really brought this home, but I’m getting ahead of myself.

I want to say again: FLOSS never really entered the DOS and Windows 3.x ecosystems. You’d see it make a few inroads here and there in later versions of Windows, and moreso now that Microsoft has been sort of forced to accept it, but still, reflect on its legacy. What is the software market like in Windows compared to Linux, even today?

Now it is, finally, time to talk about connectivity!

Getting On-Line

What does it even mean to get on line? Certainly not connecting to a wifi access point. The answer is, unsurprisingly, complex. But for everyone except the large institutional users, it begins with a telephone.

The telephone system

By the 80s, there was one communication network that already reached into nearly every home in America: the phone system. Virtually every household (note I don’t say every person) was uniquely identified by a 10-digit phone number. You could, at least in theory, call up virtually any other phone in the country and be connected in less than a minute.

But I’ve got to talk about cost. The way things worked in the USA, you paid a monthly fee for a phone line. Included in that monthly fee was unlimited “local” calling. What is a local call? That was an extremely complex question. Generally it meant, roughly, calling within your city. But of course, as you deal with things like suburbs and cities growing into each other (eg, the Dallas-Ft. Worth metroplex), things got complicated fast. But let’s just say for simplicity you could call others in your city.

What about calling people not in your city? That was “long distance”, and you paid – often hugely – by the minute for it. Long distance rates were difficult to figure out, but were generally most expensive during business hours and cheapest at night or on weekends. Prices eventually started to come down when competition was introduced for long distance carriers, but even then you often were stuck with a single carrier for long distance calls outside your city but within your state. Anyhow, let’s just leave it at this: local calls were virtually free, and long distance calls were extremely expensive.

Getting a modem

I remember getting a modem that ran at either 1200bps or 2400bps. Either way, quite slow; you could often read even plain text faster than the modem could display it. But what was a modem?

A modem hooked up to a computer with a serial cable, and to the phone system. By the time I got one, modems could automatically dial and answer. You would send a command like ATDT5551212 and it would dial 555-1212. Modems had speakers, because often things wouldn’t work right, and the telephone system was oriented around speech, so you could hear what was happening. You’d hear it wait for dial tone, then dial, then – hopefully – the remote end would ring, a modem there would answer, you’d hear the screeching of a handshake, and eventually your terminal would say CONNECT 2400. Now your computer was bridged to the other; anything going out your serial port was encoded as sound by your modem and decoded at the other end, and vice-versa.

But what, exactly, was “the other end?”

It might have been another person at their computer. Turn on local echo, and you can see what they did. Maybe you’d send files to each other. But in my case, the answer was different: PC Magazine.

PC Magazine and CompuServe

Starting around 1986 (so I would have been about 6 years old), I got to read PC Magazine. My dad would bring copies that were being discarded at his office home for me to read, and I think eventually bought me a subscription directly. This was not just a standard magazine; it ran something like 350-400 pages an issue, and came out every other week. This thing was a monster. It had reviews of hardware and software, descriptions of upcoming technologies, pages and pages of ads (that often had some degree of being informative to them). And they had sections on programming. Many issues would talk about BASIC or Pascal programming, and there’d be a utility in most issues. What do I mean by a “utility in most issues”? Did they include a floppy disk with software?

No, of course not. There was a literal program listing printed in the magazine. If you wanted the utility, you had to type it in. And a lot of them were written in assembler, so you had to have an assembler. An assembler, of course, was not free and I didn’t have one. Or maybe they wrote it in Microsoft C, and I had Borland C, and (of course) they weren’t compatible. Sometimes they would list the program sort of in binary: line after line of a BASIC program, with lines like “64, 193, 253, 0, 53, 0, 87” that you would type in for hours, hopefully correctly. Running the BASIC program would, if you got it correct, emit a .COM file that you could then run. They did have a rudimentary checksum system built in, but it wasn’t even a CRC, so something like swapping two numbers you’d never notice except when the program would mysteriously hang.

Eventually they teamed up with CompuServe to offer a limited slice of CompuServe for the purpose of downloading PC Magazine utilities. This was called PC MagNet. I am foggy on the details, but I believe that for a time you could connect to the limited PC MagNet part of CompuServe “for free” (after the cost of the long-distance call, that is) rather than paying for CompuServe itself (because, OF COURSE, that also charged you per the minute.) So in the early days, I would get special permission from my parents to place a long distance call, and after some nerve-wracking minutes in which we were aware every minute was racking up charges, I could navigate the menus, download what I wanted, and log off immediately.

I still, incidentally, mourn what PC Magazine became. As with computing generally, it followed the mass market. It lost its deep technical chops, cut its programming columns, stopped talking about things like how SCSI worked, and so forth. By the time it stopped printing in 2009, it was no longer a square-bound 400-page beheamoth, but rather looked more like a copy of Newsweek, but with less depth.

Continuing with CompuServe

CompuServe was a much larger service than just PC MagNet. Eventually, our family got a subscription. It was still an expensive and scarce resource; I’d call it only after hours when the long-distance rates were cheapest. Everyone had a numerical username separated by commas; mine was 71510,1421. CompuServe had forums, and files. Eventually I would use TapCIS to queue up things I wanted to do offline, to minimize phone usage online.

CompuServe eventually added a gateway to the Internet. For the sum of somewhere around $1 a message, you could send or receive an email from someone with an Internet email address! I remember the thrill of one time, as a kid of probably 11 years, sending a message to one of the editors of PC Magazine and getting a kind, if brief, reply back!

But inevitably I had…

The Godzilla Phone Bill

Yes, one month I became lax in tracking my time online. I ran up my parents’ phone bill. I don’t remember how high, but I remember it was hundreds of dollars, a hefty sum at the time. As I watched Jason Scott’s BBS Documentary, I realized how common an experience this was. I think this was the end of CompuServe for me for awhile.

Toll-Free Numbers

I lived near a town with a population of 500. Not even IN town, but near town. The calling area included another town with a population of maybe 1500, so all told, there were maybe 2000 people total I could talk to with a local call – though far fewer numbers, because remember, telephones were allocated by the household. There was, as far as I know, zero modems that were a local call (aside from one that belonged to a friend I met in around 1992). So basically everything was long-distance.

But there was a special feature of the telephone network: toll-free numbers. Normally when calling long-distance, you, the caller, paid the bill. But with a toll-free number, beginning with 1-800, the recipient paid the bill. These numbers almost inevitably belonged to corporations that wanted to make it easy for people to call. Sales and ordering lines, for instance. Some of these companies started to set up modems on toll-free numbers. There were few of these, but they existed, so of course I had to try them!

One of them was a company called PennyWise that sold office supplies. They had a toll-free line you could call with a modem to order stuff. Yes, online ordering before the web! I loved office supplies. And, because I lived far from a big city, if the local K-Mart didn’t have it, I probably couldn’t get it. Of course, the interface was entirely text, but you could search for products and place orders with the modem. I had loads of fun exploring the system, and actually ordered things from them – and probably actually saved money doing so. With the first order they shipped a monster full-color catalog. That thing must have been 500 pages, like the Sears catalogs of the day. Every item had a part number, which streamlined ordering through the modem.

Inbound FAXes

By the 90s, a number of modems became able to send and receive FAXes as well. For those that don’t know, a FAX machine was essentially a special modem. It would scan a page and digitally transmit it over the phone system, where it would – at least in the early days – be printed out in real time (because the machines didn’t have the memory to store an entire page as an image). Eventually, PC modems integrated FAX capabilities.

There still wasn’t anything useful I could do locally, but there were ways I could get other companies to FAX something to me. I remember two of them.

One was for US Robotics. They had an “on demand” FAX system. You’d call up a toll-free number, which was an automated IVR system. You could navigate through it and select various documents of interest to you: spec sheets and the like. You’d key in your FAX number, hang up, and US Robotics would call YOU and FAX you the documents you wanted. Yes! I was talking to a computer (of a sorts) at no cost to me!

The New York Times also ran a service for awhile called TimesFax. Every day, they would FAX out a page or two of summaries of the day’s top stories. This was pretty cool in an era in which I had no other way to access anything from the New York Times. I managed to sign up for TimesFax – I have no idea how, anymore – and for awhile I would get a daily FAX of their top stories. When my family got its first laser printer, I could them even print these FAXes complete with the gothic New York Times masthead. Wow! (OK, so technically I could print it on a dot-matrix printer also, but graphics on a 9-pin dot matrix is a kind of pain that is a whole other article.)

My own phone line

Remember how I discussed that phone lines were allocated per household? This was a problem for a lot of reasons:

  1. Anybody that tried to call my family while I was using my modem would get a busy signal (unable to complete the call)
  2. If anybody in the house picked up the phone while I was using it, that would degrade the quality of the ongoing call and either mess up or disconnect the call in progress. In many cases, that could cancel a file transfer (which wasn’t necessarily easy or possible to resume), prompting howls of annoyance from me.
  3. Generally we all had to work around each other

So eventually I found various small jobs and used the money I made to pay for my own phone line and my own long distance costs. Eventually I upgraded to a 28.8Kbps US Robotics Courier modem even! Yes, you heard it right: I got a job and a bank account so I could have a phone line and a faster modem. Uh, isn’t that why every teenager gets a job?

Now my local friend and I could call each other freely – at least on my end (I can’t remember if he had his own phone line too). We could exchange files using HS/Link, which had the added benefit of allowing split-screen chat even while a file transfer is in progress. I’m sure we spent hours chatting to each other keyboard-to-keyboard while sharing files with each other.

Technology in Schools

By this point in the story, we’re in the late 80s and early 90s. I’m still using PC-style OSs at home; OS/2 in the later years of this period, DOS or maybe a bit of Windows in the earlier years. I mentioned that they let me work on programming at school starting in 5th grade. It was soon apparent that I knew more about computers than anybody on staff, and I started getting pulled out of class to help teachers or administrators with vexing school problems. This continued until I graduated from high school, incidentally – often to my enjoyment, and the annoyance of one particular teacher who, I must say, I was fine with annoying in this way.

That’s not to say that there was institutional support for what I was doing. It was, after all, a small school. Larger schools might have introduced BASIC or maybe Logo in high school. But I had already taught myself BASIC, Pascal, and C by the time I was somewhere around 12 years old. So I wouldn’t have had any use for that anyhow.

There were programming contests occasionally held in the area. Schools would send teams. My school didn’t really “send” anybody, but I went as an individual. One of them was run by a local college but for jr. high or high school students. Years later, I met one of the professors that ran it. He remembered me, and that day, better than I remembered him. The programming contest had problems one could solve in BASIC or Logo. I knew nothing about what to expect going into it, but I had lugged my computer and screen along, and asked him, “Can I write my solutions in C?” He was, apparently, stunned, but said sure, go for it. I took first place that day, leading to some rather confused teams from much larger schools.

The Netware network that the school had was, as these generally were, itself isolated. There was no link to the Internet or anything like it. Several schools across three local counties eventually invested in a fiber-optic network linking them together. This built a larger, but still closed, network. Its primary purpose was to allow students to be exposed to a wider variety of classes at high schools. Participating schools had an “ITV room”, outfitted with cameras and mics. So students at any school could take classes offered over ITV at other schools. For instance, only my school taught German classes, so people at any of those participating schools could take German. It was an early “Zoom room.” But alongside the TV signal, there was enough bandwidth to run some Netware frames. By about 1995 or so, this let one of the schools purchase some CD-ROM software that was made available on a file server and could be accessed by any participating school. Nice! But Netware was mainly about file and printer sharing; there wasn’t even a facility like email, at least not on our deployment.

BBSs

My last hop before the Internet was the BBS. A BBS was a computer program, usually ran by a hobbyist like me, on a computer with a modem connected. Callers would call it up, and they’d interact with the BBS. Most BBSs had discussion groups like forums and file areas. Some also had games. I, of course, continued to have that most vexing of problems: they were all long-distance.

There were some ways to help with that, chiefly QWK and BlueWave. These, somewhat like TapCIS in the CompuServe days, let me download new message posts for reading offline, and queue up my own messages to send later. QWK and BlueWave didn’t help with file downloading, though.

BBSs get networked

BBSs were an interesting thing. You’d call up one, and inevitably somewhere in the file area would be a BBS list. Download the BBS list and you’ve suddenly got a list of phone numbers to try calling. All of them were long distance, of course. You’d try calling them at random and have a success rate of maybe 20%. The other 80% would be defunct; you might get the dreaded “this number is no longer in service” or the even more dreaded angry human answering the phone (and of course a modem can’t talk to a human, so they’d just get silence for probably the nth time that week). The phone company cared nothing about BBSs and recycled their numbers just as fast as any others.

To talk to various people, or participate in certain discussion groups, you’d have to call specific BBSs. That’s annoying enough in the general case, but even more so for someone paying long distance for it all, because it takes a few minutes to establish a connection to a BBS: handshaking, logging in, menu navigation, etc.

But BBSs started talking to each other. The earliest successful such effort was FidoNet, and for the duration of the BBS era, it remained by far the largest. FidoNet was analogous to the UUCP that the institutional users had, but ran on the much cheaper PC hardware. Basically, BBSs that participated in FidoNet would relay email, forum posts, and files between themselves overnight. Eventually, as with UUCP, by hopping through this network, messages could reach around the globe, and forums could have worldwide participation – asynchronously, long before they could link to each other directly via the Internet. It was almost entirely volunteer-run.

Running my own BBS

At age 13, I eventually chose to set up my own BBS. It ran on my single phone line, so of course when I was dialing up something else, nobody could dial up me. Not that this was a huge problem; in my town of 500, I probably had a good 1 or 2 regular callers in the beginning.

In the PC era, there was a big difference between a server and a client. Server-class software was expensive and rare. Maybe in later years you had an email client, but an email server would be completely unavailable to you as a home user. But with a BBS, I could effectively run a server. I even ran serial lines in our house so that the BBS could be connected from other rooms! Since I was running OS/2, the BBS didn’t tie up the computer; I could continue using it for other things.

FidoNet had an Internet email gateway. This one, unlike CompuServe’s, was free. Once I had a BBS on FidoNet, you could reach me from the Internet using the FidoNet address. This didn’t support attachments, but then email of the day didn’t really, either.

Various others outside Kansas ran FidoNet distribution points. I believe one of them was mgmtsys; my memory is quite vague, but I think they offered a direct gateway and I would call them to pick up Internet mail via FidoNet protocols, but I’m not at all certain of this.

Pros and Cons of the Non-Microsoft World

As mentioned, Microsoft was and is the dominant operating system vendor for PCs. But I left that world in 1993, and here, nearly 30 years later, have never really returned. I got an operating system with more technical capabilities than the DOS and Windows of the day, but the tradeoff was a much smaller software ecosystem. OS/2 could run DOS programs, but it ran OS/2 programs a lot better. So if I were to run a BBS, I wanted one that had a native OS/2 version – limiting me to a small fraction of available BBS server software. On the other hand, as a fully 32-bit operating system, there started to be OS/2 ports of certain software with a Unix heritage; most notably for me at the time, gcc. At some point, I eventually came across the RMS essays and started to be hooked.

Internet: The Hunt Begins

I certainly was aware that the Internet was out there and interesting. But the first problem was: how the heck do I get connected to the Internet?

ISPs weren’t really a thing; the first one in my area (though still a long-distance call) started in, I think, 1994. One service that one of my teachers got me hooked up with was Learning Link. Learning Link was a nationwide collaboration of PBS stations and schools, designed to build on the educational mission of PBS. The nearest Learning Link station was more than a 3-hour drive away… but critically, they had a toll-free access number, and my teacher convinced them to let me use it. I connected via a terminal program and a modem, like with most other things. I don’t remember much about it, but I do remember a very important thing it had: Gopher. That was my first experience with Gopher.

Learning Link was hosted by a Unix derivative (Xenix), but it didn’t exactly give everyone a shell. I seem to recall it didn’t have open FTP access either. The Gopher client had FTP access at some point; I don’t recall for sure if it did then. If it did, then when a Gopher server referred to an FTP server, I could get to it. (I am unclear at this point if I could key in an arbitrary FTP location, or knew how, at that time.) I also had email access there, but I don’t recall exactly how; probably Pine. If that’s correct, that would have dated my Learning Link access as no earlier than 1992.

I think my access time to Learning Link was limited. And, since the only way to get out on the Internet from there was Gopher and Pine, I was somewhat limited in terms of technology as well. I believe that telnet services, for instance, weren’t available to me.

Computer labs

There was one place that tended to have Internet access: colleges and universities. In 7th grade, I participated in a program that resulted in me being invited to visit Duke University, and in 8th grade, I participated in National History Day, resulting in a trip to visit the University of Maryland. I probably sought out computer labs at both of those. My most distinct memory was finding my way into a computer lab at one of those universities, and it was full of NeXT workstations. I had never seen or used NeXT before, and had no idea how to operate it. I had brought a box of floppy disks, unaware that the DOS disks probably weren’t compatible with NeXT.

Closer to home, a small college had a computer lab that I could also visit. I would go there in summer or when it wasn’t used with my stack of floppies. I remember downloading disk images of FLOSS operating systems: FreeBSD, Slackware, or Debian, at the time. The hash marks from the DOS-based FTP client would creep across the screen as the 1.44MB disk images would slowly download. telnet was also available on those machines, so I could telnet to things like public-access Archie servers and libraries – though not Gopher. Still, FTP and telnet access opened up a lot, and I learned quite a bit in those years.

Continuing the Journey

At some point, I got a copy of the Whole Internet User’s Guide and Catalog, published in 1994. I still have it. If it hadn’t already figured it out by then, I certainly became aware from it that Unix was the dominant operating system on the Internet. The examples in Whole Internet covered FTP, telnet, gopher – all assuming the user somehow got to a Unix prompt. The web was introduced about 300 pages in; clearly viewed as something that wasn’t page 1 material. And it covered the command-line www client before introducing the graphical Mosaic. Even then, though, the book highlighted Mosaic’s utility as a front-end for Gopher and FTP, and even the ability to launch telnet sessions by clicking on links. But having a copy of the book didn’t equate to having any way to run Mosaic. The machines in the computer lab I mentioned above all ran DOS and were incapable of running a graphical browser. I had no SLIP or PPP (both ways to run Internet traffic over a modem) connectivity at home. In short, the Web was something for the large institutional users at the time.

CD-ROMs

As CD-ROMs came out, with their huge (for the day) 650MB capacity, various companies started collecting software that could be downloaded on the Internet and selling it on CD-ROM. The two most popular ones were Walnut Creek CD-ROM and Infomagic. One could buy extensive Shareware and gaming collections, and then even entire Linux and BSD distributions. Although not exactly an Internet service per se, it was a way of bringing what may ordinarily only be accessible to institutional users into the home computer realm.

Free Software Jumps In

As I mentioned, by the mid 90s, I had come across RMS’s writings about free software – most probably his 1992 essay Why Software Should Be Free. (Please note, this is not a commentary on the more recently-revealed issues surrounding RMS, but rather his writings and work as I encountered them in the 90s.) The notion of a Free operating system – not just in cost but in openness – was incredibly appealing. Not only could I tinker with it to a much greater extent due to having source for everything, but it included so much software that I’d otherwise have to pay for. Compilers! Interpreters! Editors! Terminal emulators! And, especially, server software of all sorts. There’d be no way I could afford or run Netware, but with a Free Unixy operating system, I could do all that. My interest was obviously piqued. Add to that the fact that I could actually participate and contribute – I was about to become hooked on something that I’ve stayed hooked on for decades.

But then the question was: which Free operating system? Eventually I chose FreeBSD to begin with; that would have been sometime in 1995. I don’t recall the exact reasons for that. I remember downloading Slackware install floppies, and probably the fact that Debian wasn’t yet at 1.0 scared me off for a time. FreeBSD’s fantastic Handbook – far better than anything I could find for Linux at the time – was no doubt also a factor.

The de Raadt Factor

Why not NetBSD or OpenBSD? The short answer is Theo de Raadt. Somewhere in this time, when I was somewhere between 14 and 16 years old, I asked some questions comparing NetBSD to the other two free BSDs. This was on a NetBSD mailing list, but for some reason Theo saw it and got a flame war going, which CC’d me. Now keep in mind that even if NetBSD had a web presence at the time, it would have been minimal, and I would have – not all that unusually for the time – had no way to access it. I was certainly not aware of the, shall we say, acrimony between Theo and NetBSD. While I had certainly seen an online flamewar before, this took on a different and more disturbing tone; months later, Theo randomly emailed me under the subject “SLIME” saying that I was, well, “SLIME”. I seem to recall periodic emails from him thereafter reminding me that he hates me and that he had blocked me. (Disclaimer: I have poor email archives from this period, so the full details are lost to me, but I believe I am accurately conveying these events from over 25 years ago)

This was a surprise, and an unpleasant one. I was trying to learn, and while it is possible I didn’t understand some aspect or other of netiquette (or Theo’s personal hatred of NetBSD) at the time, still that is not a reason to flame a 16-year-old (though he would have had no way to know my age). This didn’t leave any kind of scar, but did leave a lasting impression; to this day, I am particularly concerned with how FLOSS projects handle poisonous people. Debian, for instance, has come a long way in this over the years, and even Linus Torvalds has turned over a new leaf. I don’t know if Theo has.

In any case, I didn’t use NetBSD then. I did try it periodically in the years since, but never found it compelling enough to justify a large switch from Debian. I never tried OpenBSD for various reasons, but one of them was that I didn’t want to join a community that tolerates behavior such as Theo’s from its leader.

Moving to FreeBSD

Moving from OS/2 to FreeBSD was final. That is, I didn’t have enough hard drive space to keep both. I also didn’t have the backup capacity to back up OS/2 completely. My BBS, which ran Virtual BBS (and at some point also AdeptXBBS) was deleted and reincarnated in a different form. My BBS was a member of both FidoNet and VirtualNet; the latter was specific to VBBS, and had to be dropped. I believe I may have also had to drop the FidoNet link for a time. This was the biggest change of computing in my life to that point. The earlier experiences hadn’t literally destroyed what came before. OS/2 could still run my DOS programs. Its command shell was quite DOS-like. It ran Windows programs. I was going to throw all that away and leap into the unknown.

I wish I had saved a copy of my BBS; I would love to see the messages I exchanged back then, or see its menu screens again. I have little memory of what it looked like. But other than that, I have no regrets. Pursuing Free, Unixy operating systems brought me a lot of enjoyment and a good career.

That’s not to say it was easy. All the problems of not being in the Microsoft ecosystem were magnified under FreeBSD and Linux. In a day before EDID, monitor timings had to be calculated manually – and you risked destroying your monitor if you got them wrong. Word processing and spreadsheet software was pretty much not there for FreeBSD or Linux at the time; I was therefore forced to learn LaTeX and actually appreciated that. Software like PageMaker or CorelDraw was certainly nowhere to be found for those free operating systems either. But I got a ton of new capabilities.

I mentioned the BBS didn’t shut down, and indeed it didn’t. I ran what was surely a supremely unique oddity: a free, dialin Unix shell server in the middle of a small town in Kansas. I’m sure I provided things such as pine for email and some help text and maybe even printouts for how to use it. The set of callers slowly grew over the time period, in fact.

And then I got UUCP.

Enter UUCP

Even throughout all this, there was no local Internet provider and things were still long distance. I had Internet Email access via assorted strange routes, but they were all… strange. And, I wanted access to Usenet. In 1995, it happened.

The local ISP I mentioned offered UUCP access. Though I couldn’t afford the dialup shell (or later, SLIP/PPP) that they offered due to long-distance costs, UUCP’s very efficient batched processes looked doable. I believe I established that link when I was 15, so in 1995.

I worked to register my domain, complete.org, as well. At the time, the process was a bit lengthy and involved downloading a text file form, filling it out in a precise way, sending it to InterNIC, and probably mailing them a check. Well I did that, and in September of 1995, complete.org became mine. I set up sendmail on my local system, as well as INN to handle the limited Usenet newsfeed I requested from the ISP. I even ran Majordomo to host some mailing lists, including some that were surprisingly high-traffic for a few-times-a-day long-distance modem UUCP link!

The modem client programs for FreeBSD were somewhat less advanced than for OS/2, but I believe I wound up using Minicom or Seyon to continue to dial out to BBSs and, I believe, continue to use Learning Link. So all the while I was setting up my local BBS, I continued to have access to the text Internet, consisting of chiefly Gopher for me.

Switching to Debian

I switched to Debian sometime in 1995 or 1996, and have been using Debian as my primary OS ever since. I continued to offer shell access, but added the WorldVU Atlantis menuing BBS system. This provided a return of a more BBS-like interface (by default; shell was still an uption) as well as some BBS door games such as LoRD and TradeWars 2002, running under DOS emulation.

I also continued to run INN, and ran ifgate to allow FidoNet echomail to be presented into INN Usenet-like newsgroups, and netmail to be gated to Unix email. This worked pretty well. The BBS continued to grow in these days, peaking at about two dozen total user accounts, and maybe a dozen regular users.

Dial-up access availability

I believe it was in 1996 that dial up PPP access finally became available in my small town. What a thrill! FINALLY! I could now FTP, use Gopher, telnet, and the web all from home. Of course, it was at modem speeds, but still.

(Strangely, I have a memory of accessing the Web using WebExplorer from OS/2. I don’t know exactly why; it’s possible that by this time, I had upgraded to a 486 DX2/66 and was able to reinstall OS/2 on the old 25MHz 486, or maybe something was wrong with the timeline from my memories from 25 years ago above. Or perhaps I made the occasional long-distance call somewhere before I ditched OS/2.)

Gopher sites still existed at this point, and I could access them using Netscape Navigator – which likely became my standard Gopher client at that point. I don’t recall using UMN text-mode gopher client locally at that time, though it’s certainly possible I did.

The city

Starting when I was 15, I took computer science classes at Wichita State University. The first one was a class in the summer of 1995 on C++. I remember being worried about being good enough for it – I was, after all, just after my HS freshman year and had never taken the prerequisite C class. I loved it and got an A! By 1996, I was taking more classes.

In 1996 or 1997 I stayed in Wichita during the day due to having more than one class. So, what would I do then but… enjoy the computer lab? The CS dept. had two of them: one that had NCD X terminals connected to a pair of SunOS servers, and another one running Windows. I spent most of the time in the Unix lab with the NCDs; I’d use Netscape or pine, write code, enjoy the University’s fast Internet connection, and so forth.

In 1997 I had graduated high school and that summer I moved to Wichita to attend college. As was so often the case, I shut down the BBS at that time. It would be 5 years until I again dealt with Internet at home in a rural community.

By the time I moved to my apartment in Wichita, I had stopped using OS/2 entirely. I have no memory of ever having OS/2 there. Along the way, I had bought a Pentium 166, and then the most expensive piece of computing equipment I have ever owned: a DEC Alpha, which, of course, ran Linux.

ISDN

I must have used dialup PPP for a time, but I eventually got a job working for the ISP I had used for UUCP, and then PPP. While there, I got a 128Kbps ISDN line installed in my apartment, and they gave me a discount on the service for it. That was around 3x the speed of a modem, and crucially was always on and gave me a public IP. No longer did I have to use UUCP; now I got to host my own things! By at least 1998, I was running a web server on www.complete.org, and I had an FTP server going as well.

Even Bigger Cities

In 1999 I moved to Dallas, and there got my first broadband connection: an ADSL link at, I think, 1.5Mbps! Now that was something! But it had some reliability problems. I eventually put together a server and had it hosted at an acquantaince’s place who had SDSL in his apartment. Within a couple of years, I had switched to various kinds of proper hosting for it, but that is a whole other article.

In Indianapolis, I got a cable modem for the first time, with even tighter speeds but prohibitions on running “servers” on it. Yuck.

Challenges

Being non-Microsoft continued to have challenges. Until the advent of Firefox, a web browser was one of the biggest. While Netscape supported Linux on i386, it didn’t support Linux on Alpha. I hobbled along with various attempts at emulators, old versions of Mosaic, and so forth. And, until StarOffice was open-sourced as Open Office, reading Microsoft file formats was also a challenge, though WordPerfect was briefly available for Linux.

Over the years, I have become used to the Linux ecosystem. Perhaps I use Gimp instead of Photoshop and digikam instead of – well, whatever somebody would use on Windows. But I get ZFS, and containers, and so much that isn’t available there.

Yes, I know Apple never went away and is a thing, but for most of the time period I discuss in this article, at least after the rise of DOS, it was niche compared to the PC market.

Back to Kansas

In 2002, I moved back to Kansas, to a rural home near a different small town in the county next to where I grew up. Over there, it was back to dialup at home, but I had faster access at work. I didn’t much care for this, and thus began a 20+-year effort to get broadband in the country. At first, I got a wireless link, which worked well enough in the winter, but had serious problems in the summer when the trees leafed out. Eventually DSL became available locally – highly unreliable, but still, it was something. Then I moved back to the community I grew up in, a few miles from where I grew up. Again I got DSL – a bit better. But after some years, being at the end of the run of DSL meant I had poor speeds and reliability problems. I eventually switched to various wireless ISPs, which continues to the present day; while people in cities can get Gbps service, I can get, at best, about 50Mbps. Long-distance fees are gone, but the speed disparity remains.

Concluding Reflections

I am glad I grew up where I did; the strong community has a lot of advantages I don’t have room to discuss here. In a number of very real senses, having no local services made things a lot more difficult than they otherwise would have been. However, perhaps I could say that I also learned a lot through the need to come up with inventive solutions to those challenges. To this day, I think a lot about computing in remote environments: partially because I live in one, and partially because I enjoy visiting places that are remote enough that they have no Internet, phone, or cell service whatsoever. I have written articles like Tools for Communicating Offline and in Difficult Circumstances based on my own personal experience. I instinctively think about making protocols robust in the face of various kinds of connectivity failures because I experience various kinds of connectivity failures myself.

(Almost) Everything Lives On

In 2002, Gopher turned 10 years old. It had probably been about 9 or 10 years since I had first used Gopher, which was the first way I got on live Internet from my house. It was hard to believe. By that point, I had an always-on Internet link at home and at work. I had my Alpha, and probably also at least PCMCIA Ethernet for a laptop (many laptops had modems by the 90s also). Despite its popularity in the early 90s, less than 10 years after it came on the scene and started to unify the Internet, it was mostly forgotten.

And it was at that moment that I decided to try to resurrect it. The University of Minnesota finally released it under an Open Source license. I wrote the first new gopher server in years, pygopherd, and introduced gopher to Debian. Gopher lives on; there are now quite a few Gopher clients and servers out there, newly started post-2002. The Gemini protocol can be thought of as something akin to Gopher 2.0, and it too has a small but blossoming ecosystem.

Archie, the old FTP search tool, is dead though. Same for WAIS and a number of the other pre-web search tools. But still, even FTP lives on today.

And BBSs? Well, they didn’t go away either. Jason Scott’s fabulous BBS documentary looks back at the history of the BBS, while Back to the BBS from last year talks about the modern BBS scene. FidoNet somehow is still alive and kicking. UUCP still has its place and has inspired a whole string of successors. Some, like NNCP, are clearly direct descendents of UUCP. Filespooler lives in that ecosystem, and you can even see UUCP concepts in projects as far afield as Syncthing and Meshtastic. Usenet still exists, and you can now run Usenet over NNCP just as I ran Usenet over UUCP back in the day (which you can still do as well). Telnet, of course, has been largely supplanted by ssh, but the concept is more popular now than ever, as Linux has made ssh be available on everything from Raspberry Pi to Android.

And I still run a Gopher server, looking pretty much like it did in 2002.

This post also has a permanent home on my website, where it may be periodically updated.

I Finally Found a Solid Debian Tablet: The Surface Go 2

I have been looking for a good tablet for Debian for… well, years. I want thin, light, portable, excellent battery life, and a servicable keyboard.

For a while, I tried a Lenovo Chromebook Duet. It meets the hardware requirements, well sort of. The problem is with performance and the OS. I can run Debian inside the ChromeOS Linux environment. That works, actually pretty well. But it is slow. Terribly, terribly, terribly slow. Emacs takes minutes to launch. apt-gets also do. It has barely enough RAM to keep its Chrome foundation happy, let alone a Linux environment also. But basically it is too slow to be servicable. Not just that, but I ran into assorted issues with having it tied to a Google account – particularly being unable to login unless I had Internet access after an update. That and my growing concern over Google’s privacy practices led me sort of write it off.

I have a wonderful System76 Lemur Pro that I’m very happy with. Plenty of RAM, a good compromise size between portability and screen size at 14.1″, and so forth. But a 10″ goes-anywhere it’s not.

I spent quite a lot of time looking at thin-and-light convertible laptops of various configurations. Many of them were quite expensive, not as small as I wanted, or had dubious Linux support. To my surprise, I wound up buying a Surface Go 2 from the Microsoft store, along with the Type Cover. They had a pretty good deal on it since the Surface Go 3 is out; the highest-processor model of the Go 2 is roughly similar to the Go 3 in terms of performance.

There is an excellent linux-surface project out there that provides very good support for most Surface devices, including the Go 2 and 3.

I put Debian on it. I had a fair bit of hassle with EFI, and wound up putting rEFInd on it, which mostly solved those problems. (I did keep a Windows partition, and if it comes up for some reason, the easiest way to get it back to Debian is to use the Windows settings tool to reboot into advanced mode, and then select the appropriate EFI entry to boot from there.)

Researching on-screen keyboards, it seemed like Gnome had the most mature. So I wound up with Gnome (my other systems are using KDE with tiling, but I figured I’d try Gnome on it.) Almost everything worked without additional tweaking, the one exception being the cameras. The cameras on the Surfaces are a known point of trouble and I didn’t bother to go to all the effort to get them working.

With 8GB of RAM, I didn’t put ZFS on it like I do on other systems. Performance is quite satisfactory, including for Rust development. Battery life runs about 10 hours with light use; less when running a lot of cargo builds, of course.

The 1920×1280 screen is nice at 10.5″. Gnome with Wayland does a decent job of adjusting to this hi-res configuration.

I took this as my only computer for a trip from the USA to Germany. It was a little small at times; though that was to be expected. It let me take a nicely small bag as a carryon, and being light, it was pleasant to carry around in airports. It served its purpose quite well.

One downside is that it can’t be powered by a phone charger like my Chromebook Duet can. However, I found a nice slim 65W Anker charger that could charge it and phones simultaneously that did the job well enough (I left the Microsoft charger with the proprietary connector at home).

The Surface Go 2 maxes out at a 128GB SSD. That feels a bit constraining, especially since I kept Windows around. However, it also has a micro SD slot, so you can put LUKS and ext4 on that and use it as another filesystem. I popped a micro SD I had lying around into there and that felt a lot better storage-wise. I could also completely zap Windows, but that would leave no way to get firmware updates and I didn’t really want to do that. Still, I don’t use Windows and that could be an option also.

All in all, I’m pretty pleased with it. Around $600 for a fully-functional Debian tablet, with a keyboard is pretty nice.

I had been hoping for months that the Pinetab would come back into stock, because I’d much rather support a Linux hardware vendor, but for now I think the Surface Go series is the most solid option for a Linux tablet.

Pipe Issue Likely a Kernel Bug

Saturday, I wrote in Pipes, deadlocks, and strace annoyingly fixing them about an issue where a certain pipeline seems to have a deadlock. I described tracing it into kernel code. Indeed, it appears to be kernel bug 212295, which has had a patch for over a year that has never been merged.

After continuing to dig into the issue, I eventually reported it as a bug in ZFS. One of the ZFS people connected this to an older issue my searching hadn’t uncovered.

rincebrain summarized:

I believe, if I understand the bug correctly, it only triggers if you F_SETPIPE_SZ when the writer has put nonzero but not a full unit’s worth in yet, which is why the world isn’t on fire screaming about this – you need to either have a very slow but nonzero or otherwise very strange write pattern to hit it, which is why it doesn’t come up in, say, the CI or most of my testbeds, but my poor little SPARC (440 MHz, 1c1t) and Raspberry Pis were not so fortunate.

You might recall in Saturday’s post that I explained that Filespooler reads a few bytes from the gpg/zstdcat pipeline before spawning and connecting it to zfs receive. I think this is the critical piece of the puzzle; it makes it much more likely to encounter the kernel bug. zfs receive is calls F_SETPIPE_SZ when it starts. Let’s look at how this could be triggered:

In the pre-Filespooler days, the gpg|zstdcat|zfs pipeline was all being set up at once. There would be no data sent to zfs receive until gpg had initialized and begun to decrypt the data, and then zstdcat had begun to decompress it. Those things almost certainly took longer than zfs receive’s initialization, meaning that usually F_SETPIPE_SZ would have been invoked before any data entered the pipe.

After switching to Filespooler, the particular situation here has Filespooler reading somewhere around 100 bytes from the gpg|zstdcat part of the pipeline before ever invoking zfs receive. zstdcat generally emits more than 100 bytes at a time. Therefore, when Filespooler invokes zfs receive and hooks the pipeline up to it, it has a very high chance of there already being data in the pipeline when zfs receive uses F_SETPIPE_SZ. This means that the chances of encountering the conditions that trigger the particular kernel bug are also elevated.

ZFS is integrating a patch to no longer use F_SETPIPE_SZ in zfs receive. I have applied that on my local end to see what happens, and hopefully in a day or two will know for sure if it resolves things.

In the meantime, I hope you enjoyed this little exploration. It resulted in a new bug report to Rust as well as digging up an existing kernel bug. And, interestingly, no bugs in filespooler. Sometimes the thing that changed isn’t the source of the bug!

Pipes, deadlocks, and strace annoyingly fixing them

This is a complex tale I will attempt to make simple(ish). I’ve (re)learned more than I cared to about the details of pipes, signals, and certain system calls – and the solution is still elusive.

For some time now, I have been using NNCP to back up my files. These backups are sent to my backup system, which effectively does this to process them (each ZFS send is piped to a shell script that winds up running this):

gpg -q -d | zstdcat -T0 | zfs receive -u -o readonly=on "$STORE/$DEST"

This processes tens of thousands of zfs sends per week. Recently, having written Filespooler, I switched to sending the backups using Filespooler over NNCP. Now fspl (the Filespooler executable) opens the file for each stream and then connects it to what amounts to this pipeline:

bash -c 'gpg -q -d 2>/dev/null | zstdcat -T0' | zfs receive -u -o readonly=on "$STORE/$DEST"

Actually, to be more precise, it spins up the bash part of it, reads a few bytes from it, and then connects it to the zfs receive.

And this works well — almost always. In something like 1/1000 of the cases, it deadlocks, and I still don’t know why. But I can talk about the journey of trying to figure it out (and maybe some of you will have some ideas).

Filespooler is written in Rust, and uses Rust’s Command system. Effectively what happens is this:

  1. The fspl process has a File handle, which after forking but before invoking bash, it dup2’s to stdin.
  2. The connection between bash and zfs receive is a standard Unix pipe.

I cannot get the problem to duplicate when I run the entire thing under strace -f. So I am left trying to peek at it from the outside. What happens if I try to attach to each component with strace -p?

  • bash is blocking in wait4(), which is expected.
  • gpg is blocking in write().
  • If I attach to zstdcat with strace -p, then all of a sudden the deadlock is cleared and everything resumes and completes normally.
  • Attaching to zfs receive with strace -p causes no output at all from strace for a few seconds, then zfs just writes “cannot receive incremental stream: incomplete stream” and exits with error code 1.

So the plot thickens! Why would connecting to zstdcat and zfs receive cause them to actually change behavior? strace works by using the ptrace system call, and ptrace in a number of cases requires sending SIGSTOP to a process. In a complicated set of circumstances, a system call may return EINTR when a SIGSTOP is received, with the idea that the system call should be retried. I can’t see, from either zstdcat or zfs, if this is happening, though.

So I thought, “how about having Filespooler manually copy data from bash to zfs receive in a read/write loop instead of having them connected directly via a pipe?” That is, there would be two pipes going there: one where Filespooler reads from the bash command, and one where it writes to zfs. If nothing else, I could instrument it with debugging.

And so I did, and I found that when it deadlocked, it was deadlocking on write — but with no discernible pattern as to where or when. So I went back to directly connected.

In analyzing straces, I found a Rust bug which I reported in which it is failing to close the read end of a pipe in the parent post-fork. However, having implemented a workaround for this, it doesn’t prevent the deadlock so this is orthogonal to the issue at hand.

Among the two strange things here are things returning to normal when I attach strace to zstdcat, and things crashing when I attach strace to zfs. I decided to investigate the latter.

It turns out that the ZFS code that is reading from stdin during zfs receive is in the kernel module, not userland. Here is the part that is triggering the “imcomplete stream” error:

                int err = zfs_file_read(fp, (char *)buf + done,
                    len - done, &resid);
                if (resid == len - done) {
                        /*
                         * Note: ECKSUM or ZFS_ERR_STREAM_TRUNCATED indicates
                         * that the receive was interrupted and can
                         * potentially be resumed.
                         */
                        err = SET_ERROR(ZFS_ERR_STREAM_TRUNCATED);
                }

resid is an output parameter with the number of bytes remaining from a short read, so in this case, if the read produced zero bytes, then it sets that error. What’s zfs_file_read then?

It boils down to a thin wrapper around kernel_read(). This winds up calling __kernel_read(), which calls read_iter on the pipe, which is pipe_read(). That’s where I don’t have the knowledge to get into the weeds right now.

So it seems likely to me that the problem has something to do with zfs receive. But, what, and why does it only not work in this one very specific situation, and only so rarely? And why does attaching strace to zstdcat make it all work again? I’m indeed puzzled!

Update 2022-06-20: See the followup post which identifies this as likely a kernel bug and explains why this particular use of Filespooler made it easier to trigger.

Fast, Ordered Unixy Queues over NNCP and Syncthing with Filespooler

It seems that lately I’ve written several shell implementations of a simple queue that enforces ordered execution of jobs that may arrive out of order. After writing this for the nth time in bash, I decided it was time to do it properly. But first, a word on the why of it all.

Why did I bother?

My needs arose primarily from handling Backups over Asynchronous Communication methods – in this case, NNCP. When backups contain incrementals that are unpacked on the destination, they must be applied in the correct order.

In some cases, like ZFS, the receiving side will detect an out-of-order backup file and exit with an error. In those cases, processing in random order is acceptable but can be slow if, say, hundreds or thousands of hourly backups have stacked up over a period of time. The same goes for using gitsync-nncp to synchronize git repositories. In both cases, a best effort based on creation date is sufficient to produce a significant performance improvement.

With other cases, such as tar or dar backups, the receiving cannot detect out of order incrementals. In those situations, the incrementals absolutely must be applied with strict ordering. There are many other situations that arise with these needs also. Filespooler is the answer to these.

Existing Work

Before writing my own program, I of course looked at what was out there already. I looked at celeary, gearman, nq, rq, cctools work queue, ts/tsp (task spooler), filequeue, dramatiq, GNU parallel, and so forth.

Unfortunately, none of these met my needs at all. They all tended to have properties like:

  • An extremely complicated client/server system that was incompatible with piping data over existing asynchronous tools
  • A large bias to processing of small web requests, resulting in terrible inefficiency or outright incompatibility with jobs in the TB range
  • An inability to enforce strict ordering of jobs, especially if they arrive in a different order from how they were queued

Many also lacked some nice-to-haves that I implemented for Filespooler:

  • Support for the encryption and cryptographic authentication of jobs, including metadata
  • First-class support for arbitrary compressors
  • Ability to use both stream transports (pipes) and filesystem-like transports (eg, rclone mount, S3, Syncthing, or Dropbox)

Introducing Filespooler

Filespooler is a tool in the Unix tradition: that is, do one thing well, and integrate nicely with other tools using the fundamental Unix building blocks of files and pipes. Filespooler itself doesn’t provide transport for jobs, but instead is designed to cooperate extremely easily with transports that can be written to as a filesystem or piped to – which is to say, almost anything of interest.

Filespooler is written in Rust and has an extensive Filespooler Reference as well as many tutorials on its homepage. To give you a few examples, here are some links:

Basics of How it Works

Filespooler is intentionally simple:

  • The sender maintains a sequence file that includes a number for the next job packet to be created.
  • The receiver also maintains a sequence file that includes a number for the next job to be processed.
  • fspl prepare creates a Filespooler job packet and emits it to stdout. It includes a small header (<100 bytes in most cases) that includes the sequence number, creation timestamp, and some other useful metadata.
  • You get to transport this job packet to the receiver in any of many simple ways, which may or may not involve Filespooler’s assistance.
  • On the receiver, Filespooler (when running in the default strict ordering mode) will simply look at the sequence file and process jobs in incremental order until it runs out of jobs to process.

The name of job files on-disk matches a pattern for identification, but the content of them is not significant; only the header matters.

You can send job data in three ways:

  1. By piping it to fspl prepare
  2. By setting certain environment variables when calling fspl prepare
  3. By passing additional command-line arguments to fspl prepare, which can optionally be passed to the processing command at the receiver.

Data piped in is added to the job “payload”, while environment variables and command-line parameters are encoded in the header.

Basic usage

Here I will excerpt part of the Using Filespooler over Syncthing tutorial; consult it for further detail. As a bit of background, Syncthing is a FLOSS decentralized directory synchronization tool akin to Dropbox (but with a much richer feature set in many ways).

Preparation

First, on the receiver, you create the queue (passing the directory name to -q):

sender$ fspl queue-init -q ~/sync/b64queue

Now, we can send a job like this:

sender$ echo Hi | fspl prepare -s ~/b64seq -i - | fspl queue-write -q ~/sync/b64queue

Let’s break that down:

  • First, we pipe “Hi” to fspl prepare.
  • fspl prepare takes two parameters:
    • -s seqfile gives the path to a sequence file used on the sender side. This file has a simple number in it that increments a unique counter for every generated job file. It is matched with the nextseq file within the queue to make sure that the receiver processes jobs in the correct order. It MUST be separate from the file that is in the queue and should NOT be placed within the queue. There is no need to sync this file, and it would be ideal to not sync it.
    • The -i option tells fspl prepare to read a file for the packet payload. -i - tells it to read stdin for this purpose. So, the payload will consist of three bytes: “Hi\n” (that is, including the terminating newline that echo wrote)
  • Now, fspl prepare writes the packet to its stdout. We pipe that into fspl queue-write:
    • fspl queue-write reads stdin and writes it to a file in the queue directory in a safe manner. The file will ultimately match the fspl-*.fspl pattern and have a random string in the middle.

At this point, wait a few seconds (or however long it takes) for the queue files to be synced over to the recipient.

On the receiver, we can see if any jobs have arrived yet:

receiver$ fspl queue-ls -q ~/sync/b64queue
ID                   creation timestamp          filename
1                    2022-05-16T20:29:32-05:00   fspl-7b85df4e-4df9-448d-9437-5a24b92904a4.fspl

Let’s say we’d like some information about the job. Try this:

receiver$ $ fspl queue-info -q ~/sync/b64queue -j 1
FSPL_SEQ=1
FSPL_CTIME_SECS=1652940172
FSPL_CTIME_NANOS=94106744
FSPL_CTIME_RFC3339_UTC=2022-05-17T01:29:32Z
FSPL_CTIME_RFC3339_LOCAL=2022-05-16T20:29:32-05:00
FSPL_JOB_FILENAME=fspl-7b85df4e-4df9-448d-9437-5a24b92904a4.fspl
FSPL_JOB_QUEUEDIR=/home/jgoerzen/sync/b64queue
FSPL_JOB_FULLPATH=/home/jgoerzen/sync/b64queue/jobs/fspl-7b85df4e-4df9-448d-9437-5a24b92904a4.fspl

This information is intentionally emitted in a format convenient for parsing.

Now let’s run the job!

receiver$ fspl queue-process -q ~/sync/b64queue --allow-job-params base64
SGkK

There are two new parameters here:

  • --allow-job-params says that the sender is trusted to supply additional parameters for the command we will be running.
  • base64 is the name of the command that we will run for every job. It will:
    • Have environment variables set as we just saw in queue-info
    • Have the text we previously prepared – “Hi\n” – piped to it

By default, fspl queue-process doesn’t do anything special with the output; see Handling Filespooler Command Output for details on other options. So, the base64-encoded version of our string is “SGkK”. We successfully sent a packet using Syncthing as a transport mechanism!

At this point, if you do a fspl queue-ls again, you’ll see the queue is empty. By default, fspl queue-process deletes jobs that have been successfully processed.

For more

See the Filespooler homepage.


This blog post is also available as a permanent, periodically-updated page.

KDE: A Nice Tiling Envieonment and a Surprisingly Awesome DE

I recently wrote that managing an external display on Linux shouldn’t be this hard. I went down a path of trying out some different options before finally landing at an unexpected place: KDE. I say “unexpected” because I find tiling window managers are just about a necessity.

Background: xmonad

Until a few months ago, I’d been using xmonad for well over a decade. Configurable, minimal, and very nice; it suited me well.

However, xmonad is getting somewhat long in the tooth. xmobar, which is commonly used with it, barely supports many modern desktop environments. I prefer DEs for the useful integrations they bring: everything from handling mount of USB sticks to display auto-switching and sound switching. xmonad itself can’t run with modern Gnome (whether or not it runs well under KDE 5 seems to be a complicated question, according to wikis, but in any case, there is no log applet for KDE 5). So I was left with XFCE and such, but the isues I identified in the “shouldn’t be this hard” article were bad enough that I just could not keep going that way.

An attempt: Gnome and PaperWM

So I tried Gnome under Wayland, reasoning that Wayland might stand a chance of doing things well where X couldn’t. There are several tiling window extensions available for the Gnome 3 shell. Most seemed to be rather low-quality, but an exception was PaperWM and I eventually decided on it. I never quite decided if I liked its horizontal tape of windows or not; it certainly is unique in any case.

I was willing to tolerate my usual list of Gnome problems for the sake of things working. For instance:

  • The Windows-like “settings are spread out in three different programs and some of them require editing the registry[dconf]”. Finding all the options for keybindings and power settings was a real chore, but done.
  • Some file dialog boxes (such as with the screenshot-taking tool) just do not let me type in a path to save a file, insisting that I first navigate to a directory and then type in a name.
  • General lack of available settings or hiding settings from people.
  • True focus-follows-mouse was incompatible with keyboard window switching (PaperWM or no); with any focus-follows-mouse enabled, using Alt-Tab or any other method to switch to other windows would instantly have focus returned to whatever the mouse was over.

Under Wayland, I found a disturbing lack of logs. There was nothing like /var/log/Xorg.0.log, nothing like ~/.xsession-errors, just nothing. Searching for answers on this revealed a lot of Wayland people saying “it’s a Gnome issue” and the trail going cold at that point.

And there was a weird problem that I just could not solve. After the laptop was suspended and we-awakened, I would be at a lock screen. I could type in my password, but when hitting Enter, the thing would then tend to freeze. Why, I don’t know. It seemed related to Gnome shell; when I switched Gnome from Wayland to X11, it would freeze but eventually return to the unlock screen, at which point I’d type in my password and it would freeze again. I spent a long time tracking down logs to see what was happening, but I couldn’t figure it out. All those hard resets were getting annoying.

Enter KDE

So I tried KDE. I had seen mentions of kwin-tiling, a KDE extension for tiling windows. I thought I’d try this setup.

I was really impressed by KDE’s quality. Not only did it handle absolutely every display-related interaction correctly by default, with no hangs ever, all relevant settings were clearly presented in one place. The KDE settings screens were a breath of fresh air – lots of settings available, all at one place, and tons of features I hadn’t seen elsewhere.

Here are some of the things I was pleasantly surprised by with KDE:

  • Applications can declare classes of notifications. These can be managed Android-style in settings. Moreover, you can associate a shell command to run with a notification in any class. People use this to do things like run commands when a display locks and so forth.
  • KDE Connect is a seriously impressive piece of software. It integrates desktops with Android devices in a way that’s reminiscent of non-free operating systems – and with 100% Free Software (the phone app is even in F-Droid!). Notifications from the phone can appear on the desktop, and their state is synchronized; dismiss it on the desktop and it dismisses on the phone, too. Get a SMS or Signal message on the phone? You can reply directly from the desktop. Share files in both directions, mount a directory tree from the phone on the desktop, “find my phone”, use the phone as a presentation remote for the desktop, shared clipboard, sending links between devices, control the phone media player from the desktop… Really, really impressive.
  • The shortcut settings in KDE really work and are impressive. Unlike Gnome, if you try to assign the same shortcut to multiple things, you are warned and prevented from doing this. As with Gnome, you can also bind shorcuts to arbitrary actions.
  • This shouldn’t be exciting, but I was just using Gnome, so… The panel! I can put things wherever I want them! I can put it at the top of the screen, the bottom, or even the sides! It lets all my regular programs (eg, Nextcloud) put their icons up there without having to install two different extensions, each of which handles a different set of apps! I shouldn’t be excited about all this, because Gnome actually used to have these features years ago… [gripe gripe]
  • Initially I was annoyed that Firefox notifications weren’t showing up in the notification history as they did in Gnome… but that was, of course, a setting, easily fixed!
  • There is a Plasma Integration plugin for Firefox (and other browsers including Chrome). It integrates audio and video playback, download status, etc. with the rest of KDE and KDE Connect. Result: if you like, when a call comes in to your phone, Youtube is paused. Or, you can right-click to share a link to your phone via KDE Connect, and so forth. You can right click on a link, and share via Bluetooth, Nextcloud (it must have somehow registered with KDE), KDE Connect, email, etc.

Tiling

So how about the tiling system, kwin-tiling? The out of the box experience is pretty nice. There are fewer built-in layouts than with xmonad, but the ones that are there are doing a decent job for me, and in some cases are more configurable (those that have a large window pane are configurable on its location, not forcing it to be on the left as with many systems.) What’s more, thanks to the flexibility in the KDE shortcut settings, I can configure it to be nearly keystroke-identical to xmonad!

Issues Encountered

I encountered a few minor issues:

  • There appears to be no way to tell it to “power down the display immediately after it is locked, every time” instead of waiting for some timeout to elapse. This is useful when I want to switch monitor inputs to something else.
  • Firefox ESR seems to have some rendering issues under KDE for some reason, but switching to the latest stable release direct from Mozilla seems to fix that.

In short, I’m very impressed.

Managing an External Display on Linux Shouldn’t Be This Hard

I first started using Linux and FreeBSD on laptops in the late 1990s. Back then, there were all sorts of hassles and problems, from hangs on suspend to pure failure to boot. I still worry a bit about suspend on unknown hardware, but by and large, the picture of Linux on laptops has dramatically improved over the last years. So much so that now I can complain about what would once have been a minor nit: dealing with external monitors.

I have a USB-C dock that provides both power and a Thunderbolt display output over the single cable to the laptop. I think I am similar to most people in wanting the following behavior from the laptop:

  • When the lid is closed, suspend if no external monitor is connected. If an external monitor is connected, shut off the built-in display and use the external one exclusively, but do not suspend.
  • Lock the screen automatically after a period of inactivity.
  • While locked, all connected displays should be powered down.
  • When an external display is connected, begin using it automatically.
  • When an external display is disconnected, stop using it. If the lid is closed when the external display is disconnected, go into suspend mode.

This sounds so simple. But somehow on Linux we’ve split up these things into a dozen tiny bits:

  • In /etc/systemd/logind.conf, there are settings about what to do when the lid is opened or closed.
  • Various desktop environments have overlapping settings covering the same things.
  • Then there are the display managers (gdm3, lightdm, etc) that also get in on the act, and frequently have DIFFERENT settings, set in different places, from the desktop environments. And, what’s more, they tend to be involved with locking these days.
  • Then there are screensavers (gnome-screensaver, xscreensaver, etc.) that also enter the picture, and also have settings in these areas.

Problems I’ve Seen

My problems don’t even begin with laptops, but with my desktop, running XFCE with xmonad and lightdm. My desktop is hooked to a display that has multiple inputs. This scenario (reproducible in both buster and bullseye) causes the display to be unusable until a reboot on the desktop:

  1. Be logged in and using the desktop
  2. Without locking the desktop screen, switch the display input to another device
  3. Keep the display input on another device long enough for the desktop screen to auto-lock
  4. At this point, it is impossible to re-awaken the desktop screen.

I should not here that the problems aren’t limited to Debian, but also extend to Ubuntu and various hardware.

Lightdm: which greeter?

At some point while troubleshooting things after upgrading my laptop to bullseye, I noticed that while both were running lightdm, I had different settings and a different appearance between the two. Upon further investigation, I realized that one hat slick-greeter and lightdm-settings installed, while the other had lightdm-gtk-greeter and lightdm-gtk-greeter-settings installed. Very strange.

XFCE: giving up

I eventually gave up on making lightdm work. No combination of settings or greeters would make things work reliably when changing screen configurations. I installed xscreensaver. It doesn’t hang, but it does sometimes take a few tries before it figures out what device to display on.

Worse, since updating from buster to bullseye, XFCE no longer automatically switches audio output when the docking station is plugged in, and there seems to be no easy way to convince Pulseaudio to do this.

X-Based Gnome and derivatives… sigh.

I also tried Gnome, Mate, and Cinnamon, and all of them had various inabilities to configure things to act the way I laid out above.

I’ve long not been a fan of Gnome’s way of hiding things from the user. It now has a Windows-like situation of three distinct settings programs (settings, tweaks, and dconf editor), which overlap in strange ways and interact with systemd in even stranger ways. Gnome 3 make it quite non-intuitive to make app icons from various programs work, and so forth.

Trying Wayland

I recently decided to set up an older laptop that I hadn’t used in awhile. After reading up on Wayland, I decided to try Gnome 3 under Wayland. Both the Debian and Arch wikis note that KDE is buggy on Wayland. Gnome is the only desktop environment that supports it then, unless I want to go with Sway. There’s some appeal to Sway to this xmonad user, but I’ve read of incompatibilities of Wayland software when Gnome’s not available, so I opted to try Gnome.

Well, it’s better. Not perfect, but better. After finding settings buried in a ton of different Settings and Tweaks boxes, I had it mostly working, except gdm3 would never shut off power to the external display. Eventually I found /etc/gdm3/greeter.dconf-defaults, and aadded:

sleep-inactive-ac-timeout=60
sleep-inactive-ac-type='blank'
sleep-inactive-battery-timeout=120
sleep-inactive-battery-type='suspend'

Of course, these overlap with but are distinct from the same kinds of things in Gnome settings.

Sway?

Running without Gnome seems like a challenge; Gnome is switching audio output appropriately, for instance. I am looking at some of the Gnome Shell tiling window manager extensions and hope that some of them may work for me.