Silent Data Corruption Is Real

Here’s something you never want to see:

ZFS has detected a checksum error:

   eid: 138
 class: checksum
  host: alexandria
  time: 2017-01-29 18:08:10-0600
 vtype: disk

This means there was a data error on the drive. But it’s worse than a typical data error — this is an error that was not detected by the hardware. Unlike most filesystems, ZFS and btrfs write a checksum with every block of data (both data and metadata) written to the drive, and the checksum is verified at read time. Most filesystems don’t do this, because theoretically the hardware should detect all errors. But in practice, it doesn’t always, which can lead to silent data corruption. That’s why I use ZFS wherever I possibly can.

As I looked into this issue, I saw that ZFS repaired about 400KB of data. I thought, “well, that was unlucky” and just ignored it.

Then a week later, it happened again. Pretty soon, I noticed it happened every Sunday, and always to the same drive in my pool. It so happens that the highest I/O load on the machine happens on Sundays, because I have a cron job that runs zpool scrub on Sundays. This operation forces ZFS to read and verify the checksums on every block of data on the drive, and is a nice way to guard against unreadable sectors in rarely-used data.

I finally swapped out the drive, but to my frustration, the new drive now exhibited the same issue. The SATA protocol does include a CRC32 checksum, so it seemed (to me, at least) that the problem was unlikely to be a cable or chassis issue. I suspected motherboard.

It so happened I had a 9211-8i SAS card. I had purchased it off eBay awhile back when I built the server, but could never get it to see the drives. I wound up not filling it up with as many drives as planned, so the on-board SATA did the trick. Until now.

As I poked at the 9211-8i, noticing that even its configuration utility didn’t see any devices, I finally started wondering if the SAS/SATA breakout cables were a problem. And sure enough – I realized I had a “reverse” cable and needed a “forward” one. $14 later, I had the correct cable and things are working properly now.

One other note: RAM errors can sometimes cause issues like this, but this system uses ECC DRAM and the errors would be unlikely to always manifest themselves on a particular drive.

So over the course of this, had I not been using ZFS, I would have had several megabytes of reads with undetected errors. Thanks to using ZFS, I know my data integrity is still good.

15 thoughts on “Silent Data Corruption Is Real”

The HAMMER filesystem from DragonflyBSD also writes checksums with data blocks.

https://en.wikipedia.org/wiki/HAMMER

Back in the day I had something equally as unpleasant with a dodgy SCSI cable. The checksum that caught the problem was in my SCCS source control file.

Whoops!

Rgds

Damon

shitty disk, even though a bitter taste stays if the problem always occured during scrub.
that’s why on every halfway serious disk array * you were able to throttle the “scrub/sniff/patrol/verify” rate.
yes the disk was smelly, but a scrub should be a scrub and not a stress test at *the same* time.

(you know they also had per block crc/ecc since ages, plus long enough of T10 error detection, right?)

Crest says:

March 13, 2017 at 7:00 am

ZFS does implement throttling. There are two APIs to influence the load of scrub and resilver. The first is a sleep interval in “ticks” (multiples of 1ms on a FreeBSD system unless you changed kern.hz). The more relevant one happens on the per VDEV I/O scheduling. ZFS issues I/O requests in batches and implements its own scheduling. I/O requests are typed ({ async, sync } x { read, write} + { scrub }). For each type of I/O there is a reservation and upper bound per scheduler invocation. This can be changed to optimise ZFS to a workload and the default allows limits scrubbing to two I/O requests at a time per VDEV.

Reply

I once had a SATA drive with some sort of hardware problem related to EDAC. I used it on and off (at work) for a few months and kept getting strange crashes on the code I was building. My co-worker kept snickering and mumbling that my code must be bad, but it worked fine in the debugger. Eventually I suspected the disk and began comparing large binary files that should have been identical. I saw random errors every few hundred kilobytes that were different each time I read the file. That one bad drive cost me a lot of time. NEVER ASSUME ANYTHING.

Your experience shows we’re not past the problems that were reported in previous years:

2007 CERN paper talking about userspace checksum survey to identify data integrity problems: http://indico.cern.ch/event/13797/contributions/1362288/attachments/115080/163419/Data_integrity_v3.pdf

ZFS can only repair corrupted data if it has another drive with parity data on it. Do you run mirrored drives or RAIDZ-1 or 2?

Crest says:

March 13, 2017 at 7:03 am

ZFS requires redundancy to recover corrupted data. By default at least two copies of metadata are maintained on top of VDEVs (mirroring, RAID-Z). The storage allocator tries to spread copies over different VDEVs.

Reply
John Goerzen says:

March 13, 2017 at 9:17 am

Yes, these drives happened to be mirrored. But this is not the only situation in which recovery could be done; see the copies property in zfs(8).

Reply

How were you notified of ZFS errors? Did /usr/bin/zed notify you?

John Goerzen says:

March 13, 2017 at 9:16 am

Yes, that’s exactly it.

Reply

An article on this problem in distributed filesystems (yep, bad there too!):
https://blog.acolyer.org/2017/03/08/redundancy-does-not-imply-fault-tolerance-analysis-of-distributed-storage-reactions-to-single-errors-and-corruptions/

Perhaps it was Cosmic Rays? :)

http://tekhead.it/blog/2016/06/data-corruption-the-silent-killer/

Pingback: It’s world backup day – bilhays

Pingback: Jared

Reposts

The Changelog

Comments on family, technology, and society

Silent Data Corruption Is Real

15 thoughts on “Silent Data Corruption Is Real”

Reposts

Leave a Reply Cancel reply