Unreported Disk Data Corruption – Kernel Bug?

March 3rd, 2012

Well this is new, and I’m utterly baffled. Here’s a file that’s not in use by anything.


$ md5sum xppro.vdi
589cbb5501dcddda047344a3550aaa95 xppro.vdi
$ md5sum xppro.vdi
a69806ec60d39e06473edbb0abd71637 xppro.vdi

Every time I run md5sum on it, I get a different answer. Same story with sha256sum. If I grab just the first 100MB, it gives the same answer each time. dmesg doesn’t show any sort of errors whatsoever during the time I’m running the tools. The file is 13GB, and was copied from one laptop to another (the new one being a Thinkpad T420s). The old laptop gives the same answer every time. The new one doesn’t.

I’ve put the file on different ext4 filesystems on the same machine (one using LUKS encryption, the other not, both under LVM) – same result. This will have also guaranteed different placement on the underlying hard disk.

I verified that nothing is modifying the file by using lsof and inotify. The system is a freshly-installed Debian wheezy running kernel 3.2.0-1-amd64. Any ideas how I go about troubleshooting/fixing this? So far I don’t know if it’s hardware or software, though my gut says software; SMART isn’t showing issues here, and the kernel didn’t log hardware issues, either.

Categories: Uncategorized

Leave a comment

Comments Feed9 Comments

  1. John Goerzen

    Here’s an update. All of a sudden the problem went away. There are three things I did right around then, having given up for the day. One, I power-cycled the 802.11n router/switch/access point that’s about 2ft from the laptop in question. Two, I moved a different laptop, and three, I unplugged the Ethernet port. Oh, also I unloaded the VirtualBox kernel modules.

    Putting all those things back where they were doesn’t cause the problem to recur. I can get the same md5sum on every run, and copying the file back produces the correct result.

    However, I still have a corrupted file on disk. I can md5sum it and get the same result each time, but what’s stored isn’t correct.

    Scary.

    Reply

    John Goerzen Reply:

    Spoke too soon. I guess I had a respite, but now it’s all broken again.

    Reply

  2. Jim

    Sounds like a hardware problem to me. You could try something like booting from a live usb and doing a md5sum of the full disk (while unmounted, ie. just the raw disk itself). Also check the usual culprits: ram (memtest) is probably the most likely. Try to figure out what the actual corruption is… Maybe with something as simple as ‘cp bigfile bigfile2; cmp -l bigfile bigfile2′.

    Reply

  3. John

    I had similar problems on a 6 month old machine.

    Run md5sum 50 times on a file – get the same answer
    Run md5sum 50 tinmes again – get 3 or 4 different answers
    Tried running from tmpfs – same problem
    Tried different fs, differrent disk – still happened
    memtest didn’t find anyhting.
    Turned out to be dodgy RAM.
    Changed memory for new RAM, OK now.

    Reply

  4. John Goerzen

    Thanks for the tips. Over on the G+ discussion, I got a similar one. https://plus.google.com/107171595803164194992/posts/TnxZM4agwuS I ran memtest86+ and got errors almost immediately. Yeow.

    Now I’m debating whether to nuke my Debian install and restart from scratch. That’s a lot of time down the drain but might be worth it. I have no idea what could have been corrupted. debsums could verify my installed files, of course. I don’t have much in /home that would go unnoticed, or so I hope. and maybe some fsck’s would catch any fs issues. What do you think?

    Reply

    Chow Loong Jin Reply:

    If you’re getting errors in memtest86+, it would be best to replace your memory first, and then run a debsums later to see how it goes.

    Reply

  5. Rohit

    this happens to me too in debian/ubuntu…but same file works fine on Windows Vista on the same hdd…..I wrote to WD about this and they changed my hdd but the problem still exists…..this problem usually happens while copying….may be it is because of RAM problem but RAM successfully passed the test @ boot time…sometime I download files then it shows corrupted then I have to download it again and then it successfully extract that file….& iso file they always get corrupted in debian….I have to use visit to download & to write them to cd/dvd.

    I posted this at debian forum a long ago: http://forums.debian.net/viewtopic.php?f=30&t=63313

    Reply

    Rohit Reply:

    error correction:
    I have to use “Win Vista” to download & to write them to cd/dvd

    Reply

    John Goerzen Reply:

    I wouldn’t count on the boot-time memory test catching issues. It is not particularly thorough. Try apt-get install memtest86+, then reboot and select memory test from the GRUB menu, and let it run. See what it finds.

    Reply

Leave a comment

 

Feed

http://changelog.complete.org / Unreported Disk Data Corruption – Kernel Bug?