Today, my workstation sent me this email:
The following warning/error was logged by the smartd daemon:
Device: /dev/sda [SAT], 1 Currently unreadable (pending) sectors
and then a little later, this one:
The following warning/error was logged by the smartd daemon:
Device: /dev/sda [SAT], 1 Offline uncorrectable sectors
From the hard disk’s SMART data, this is a clue that the drive is failing or will soon. Sigh. Incidentally, if smartmontools isn’t installed on your machine, whether it’s a laptop, desktop, or server, it should be.
Although most of you know I run Linux on the metal on my machines almost exclusively, I do maintain a small drive with a Windows installation that I boot into every few months for various reasons. This is that drive.
The drive is non-redundant (no RAID), and although it is backed up, the backup is made via backuppc from the NTFS filesystem mounted on Linux, and is a partial backup – backing up certain data, not the OS. There are, of course, bare metal Windows backup solutions, but I generally don’t want to back up Windows from within Windows on this machine. Restoring Windows isn’t quite as simple as an mkfs, an untar, and a grub-install, either.
So my first thought is: immediately save whatever of the drive I can. So I ran apt-get install gddrescue to install the GNU ddrescue tool. ddrescue is somewhat similar to dd, but deals much more intelligently with bad blocks on the drive. It will try to read them repeatedly, with decreasing block sizes, in an effort to get every last good byte off the disk that it can. If it ultimately fails to get certain bytes read, it will write placeholder data to the output file in place of the missing data, so that the output file maintains proper size and alignment. It also saves a log file that notes what it found (see info ddrescue for more on that.)
So I created an LVM volume for the purpose (not enough free space on /home, and didn’t want to have to shrink it somehow later), and ran:
ddrescue /dev/sda /mnt/sdasave.ddrescue /mnt/sdasave.logfile
Then I went to dinner.
When I got back, I discovered there were 1 or 2 bad sectors, about halfway through the disk, but everything else was fine. So now, the question became: did I lose any data? If so, what? I needed to know if I had to revert to a backup for anything or not.
To answer THAT question, first I had to figure out the offset of the bad spots on the disk. That’s not too hard; the logfile gives it to me:
# Rescue Logfile. Created by GNU ddrescue version 1.15
# Command line: ddrescue /dev/sda /mnt/sdasave.ddrescue sdasave.logfile
# current_pos current_status
0x3BBB8BFC00 +
# pos size status
0x00000000 0x3BBB8BF000 +
0x3BBB8BF000 0x00001000 -
0x3BBB8C0000 0x38B5346000 +
what we see is that the bad sector starts at byte 0x3BBB8BF000 (256549580800 decimal) and extends for 0x1000 bytes (4096 decimal). Both the drive and NTFS use 512-byte sectors. So dividing by 512, we get sector 501073400 – 501073407 (4096 bytes is 8 sectors).
As a check, I ran grep sector /var/log/kern.log and turned up a bunch of lines like this:
Jun 14 21:39:11 hephaestus kernel: [35346.929957] end_request: I/O error, dev sda, sector 501073404
Which is within my calculated range.
But this is an absolute sector on the disk. We need the sector within the partition, so for that, we have to enlist fdisk to make that calculation.
fdisk shows, among other things:
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
Device Boot Start End Blocks Id System
/dev/sda1 * 2048 976771071 488384512 7 HPFS/NTFS/exFAT
So the Windows partition starts at disk sector 2048.
Let’s just confirm that. If I use dd if=/dev/sda1 bs=512 count=1 | hd | head, I see a line beginning with “.R.NTFS”. Exactly the same as with dd if=/dev/sda bs=512 count=1 skip=2048 | hd | head, so I read the partition table information correctly.
Subtract offset of 2048 from the earlier values, and I get relative sectors 501071352-501071359.
That’s enough to get some solid info from the filesystem via ntfscluster, part of Debian’s ntfs-3g package. I pass -s to it, and ignoring some irrelevant stuff, get my answer:
ntfscluster -s 501071352-501071359 /dev/sda1
Inode 190604 /System Volume Information/{b4816feb-b609-11e1-a908-50e549b934f7}{3808876b-c176-4e48-b7ae-04046e6cc752}/$DATA
I even reran it with a much larger sector range, just to be absolutely sure I had wiggle room in case calculations had an off-by-one error or something somewhere.
This is really great news, because the file in question is pretty much useless – I believe it’s a system restore point, which I won’t be needing anyhow.
So at this point, all that remains is to reinstall this on a different drive. For that, I could just use my ddrescue image. I thought I would take a second image, just to be very extra careful, and use that; I used:
partclone.ntfs --rescue -c -s /dev/sda1 -o sda1.partclone
although ntfsclone would work just as well. This captures only the partition; I’ll need the partition table as well, and perhaps also the space between the partition table and the first partition. I could capture it separately with dd, but it’s already in the ddrescue image, so there’s no need. (GRUB is installed on this drive, but there is no Linux filesystem on it, so it may well exceed the size of the MBR).
Note that for Linux ext[234] filesystems, debugfs can provide the same (and more) info as I got from ntfscluster.
I happen to have a drive of the right size sitting here, which I was about to install in a different machine. So a wipe and a swap and a restore later, and I should be good to go.
This scenario is commonplace enough that I thought I’d post how I dealt with it, in case anyone else ever has hard drive issues.