Tag Archives: kernel

Pipes, deadlocks, and strace annoyingly fixing them

This is a complex tale I will attempt to make simple(ish). I’ve (re)learned more than I cared to about the details of pipes, signals, and certain system calls – and the solution is still elusive.

For some time now, I have been using NNCP to back up my files. These backups are sent to my backup system, which effectively does this to process them (each ZFS send is piped to a shell script that winds up running this):

gpg -q -d | zstdcat -T0 | zfs receive -u -o readonly=on "$STORE/$DEST"

This processes tens of thousands of zfs sends per week. Recently, having written Filespooler, I switched to sending the backups using Filespooler over NNCP. Now fspl (the Filespooler executable) opens the file for each stream and then connects it to what amounts to this pipeline:

bash -c 'gpg -q -d 2>/dev/null | zstdcat -T0' | zfs receive -u -o readonly=on "$STORE/$DEST"

Actually, to be more precise, it spins up the bash part of it, reads a few bytes from it, and then connects it to the zfs receive.

And this works well — almost always. In something like 1/1000 of the cases, it deadlocks, and I still don’t know why. But I can talk about the journey of trying to figure it out (and maybe some of you will have some ideas).

Filespooler is written in Rust, and uses Rust’s Command system. Effectively what happens is this:

  1. The fspl process has a File handle, which after forking but before invoking bash, it dup2’s to stdin.
  2. The connection between bash and zfs receive is a standard Unix pipe.

I cannot get the problem to duplicate when I run the entire thing under strace -f. So I am left trying to peek at it from the outside. What happens if I try to attach to each component with strace -p?

  • bash is blocking in wait4(), which is expected.
  • gpg is blocking in write().
  • If I attach to zstdcat with strace -p, then all of a sudden the deadlock is cleared and everything resumes and completes normally.
  • Attaching to zfs receive with strace -p causes no output at all from strace for a few seconds, then zfs just writes “cannot receive incremental stream: incomplete stream” and exits with error code 1.

So the plot thickens! Why would connecting to zstdcat and zfs receive cause them to actually change behavior? strace works by using the ptrace system call, and ptrace in a number of cases requires sending SIGSTOP to a process. In a complicated set of circumstances, a system call may return EINTR when a SIGSTOP is received, with the idea that the system call should be retried. I can’t see, from either zstdcat or zfs, if this is happening, though.

So I thought, “how about having Filespooler manually copy data from bash to zfs receive in a read/write loop instead of having them connected directly via a pipe?” That is, there would be two pipes going there: one where Filespooler reads from the bash command, and one where it writes to zfs. If nothing else, I could instrument it with debugging.

And so I did, and I found that when it deadlocked, it was deadlocking on write — but with no discernible pattern as to where or when. So I went back to directly connected.

In analyzing straces, I found a Rust bug which I reported in which it is failing to close the read end of a pipe in the parent post-fork. However, having implemented a workaround for this, it doesn’t prevent the deadlock so this is orthogonal to the issue at hand.

Among the two strange things here are things returning to normal when I attach strace to zstdcat, and things crashing when I attach strace to zfs. I decided to investigate the latter.

It turns out that the ZFS code that is reading from stdin during zfs receive is in the kernel module, not userland. Here is the part that is triggering the “imcomplete stream” error:

                int err = zfs_file_read(fp, (char *)buf + done,
                    len - done, &resid);
                if (resid == len - done) {
                        /*
                         * Note: ECKSUM or ZFS_ERR_STREAM_TRUNCATED indicates
                         * that the receive was interrupted and can
                         * potentially be resumed.
                         */
                        err = SET_ERROR(ZFS_ERR_STREAM_TRUNCATED);
                }

resid is an output parameter with the number of bytes remaining from a short read, so in this case, if the read produced zero bytes, then it sets that error. What’s zfs_file_read then?

It boils down to a thin wrapper around kernel_read(). This winds up calling __kernel_read(), which calls read_iter on the pipe, which is pipe_read(). That’s where I don’t have the knowledge to get into the weeds right now.

So it seems likely to me that the problem has something to do with zfs receive. But, what, and why does it only not work in this one very specific situation, and only so rarely? And why does attaching strace to zstdcat make it all work again? I’m indeed puzzled!

Update 2022-06-20: See the followup post which identifies this as likely a kernel bug and explains why this particular use of Filespooler made it easier to trigger.

Now I’m a little annoyed at HP

So, a little while ago, I wrote about why I like HP. This week, I’m starting to be annoyed at them.

My employer just bought nearly $100,000 worth of HP hardware. We get a new MSA1500cs Fibre Channel SAN (with redundant controllers, FC switches, disks, etc), a new blade enclosure system, three blades to start with (all of them, at minimum, dual dual-core Opterons with 4GB RAM, and some considerably more), a rack to put all this in, etc.

So we’re starting to set all this stuff up. I’ve got Debian installed on an NFS root for testing the blades and how they interact with the SAN.

The blades have an integrated dual-port QLogic QLA2312 Fibre Channel adapter. The Linux kernel has a built-in driver for this (qla2xxx), which detects it and, so far at least, works fine. We want to run kernel 2.6.17 because it’s the first version where XFS has decent semantics for write ordering to prevent corruption after a power failure. Plus we want at least a 2.6.16.x kernel because we want to run the latest Xen 3.0 on these blades. (Live migration of virtual servers from blade to blade — this will be great.)

But we learn that HP does not support the kernel qla2xxx driver. HP does not say WHY they don’t support it, just that their own driver is the only one that they support.

After plowing through several annoying scripts to get to their driver, I realize why it fails to install: it is OLD. At BEST, 2.6.14 is the most recent kernel it would even compile against (release date: October 2005), and I think the most recent version it supports is more like 2.6.8 (almost TWO YEARS OLD now). They reference a whole bunch of kernel symbols and macros that were removed somewhere between 2.6.8 and 2.6.17.

I sent a ticket to HP support. Their first request was to run their system information gathering tool and send them the results. Fine, that’s reasonable. I did so. Next they say, gee, you’re running Debian, and we don’t support that.

Argh…. If they tried to compile it against 2.6.17.1 on RedHat or SuSE, they’d get the exact same problem. I told them what symbols they were erroneously using, and a simple grep would have showed them that.

Besides, how many customers are going to be pleased with no upgrade path available for 2 years? I wouldn’t want our kernel version to be held hostage to HP’s slow driver development process.

Sigh.