Category Archives: Technology

Windows & a dying hard disk: Solving with Linux

Today, my workstation sent me this email:

The following warning/error was logged by the smartd daemon:

Device: /dev/sda [SAT], 1 Currently unreadable (pending) sectors

and then a little later, this one:

The following warning/error was logged by the smartd daemon:

Device: /dev/sda [SAT], 1 Offline uncorrectable sectors

From the hard disk’s SMART data, this is a clue that the drive is failing or will soon. Sigh. Incidentally, if smartmontools isn’t installed on your machine, whether it’s a laptop, desktop, or server, it should be.

Although most of you know I run Linux on the metal on my machines almost exclusively, I do maintain a small drive with a Windows installation that I boot into every few months for various reasons. This is that drive.

The drive is non-redundant (no RAID), and although it is backed up, the backup is made via backuppc from the NTFS filesystem mounted on Linux, and is a partial backup – backing up certain data, not the OS. There are, of course, bare metal Windows backup solutions, but I generally don’t want to back up Windows from within Windows on this machine. Restoring Windows isn’t quite as simple as an mkfs, an untar, and a grub-install, either.

So my first thought is: immediately save whatever of the drive I can. So I ran apt-get install gddrescue to install the GNU ddrescue tool. ddrescue is somewhat similar to dd, but deals much more intelligently with bad blocks on the drive. It will try to read them repeatedly, with decreasing block sizes, in an effort to get every last good byte off the disk that it can. If it ultimately fails to get certain bytes read, it will write placeholder data to the output file in place of the missing data, so that the output file maintains proper size and alignment. It also saves a log file that notes what it found (see info ddrescue for more on that.)

So I created an LVM volume for the purpose (not enough free space on /home, and didn’t want to have to shrink it somehow later), and ran:

ddrescue /dev/sda /mnt/sdasave.ddrescue /mnt/sdasave.logfile

Then I went to dinner.

When I got back, I discovered there were 1 or 2 bad sectors, about halfway through the disk, but everything else was fine. So now, the question became: did I lose any data? If so, what? I needed to know if I had to revert to a backup for anything or not.

To answer THAT question, first I had to figure out the offset of the bad spots on the disk. That’s not too hard; the logfile gives it to me:

# Rescue Logfile. Created by GNU ddrescue version 1.15
# Command line: ddrescue /dev/sda /mnt/sdasave.ddrescue sdasave.logfile
# current_pos  current_status
0x3BBB8BFC00     +
#      pos        size  status
0x00000000  0x3BBB8BF000  +
0x3BBB8BF000  0x00001000  -
0x3BBB8C0000  0x38B5346000  +

what we see is that the bad sector starts at byte 0x3BBB8BF000 (256549580800 decimal) and extends for 0x1000 bytes (4096 decimal). Both the drive and NTFS use 512-byte sectors. So dividing by 512, we get sector 501073400 – 501073407 (4096 bytes is 8 sectors).

As a check, I ran grep sector /var/log/kern.log and turned up a bunch of lines like this:

Jun 14 21:39:11 hephaestus kernel: [35346.929957] end_request: I/O error, dev sda, sector 501073404

Which is within my calculated range.

But this is an absolute sector on the disk. We need the sector within the partition, so for that, we have to enlist fdisk to make that calculation.

fdisk shows, among other things:

Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
   Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *        2048   976771071   488384512    7  HPFS/NTFS/exFAT

So the Windows partition starts at disk sector 2048.

Let’s just confirm that. If I use dd if=/dev/sda1 bs=512 count=1 | hd | head, I see a line beginning with “.R.NTFS”. Exactly the same as with dd if=/dev/sda bs=512 count=1 skip=2048 | hd | head, so I read the partition table information correctly.

Subtract offset of 2048 from the earlier values, and I get relative sectors 501071352-501071359.

That’s enough to get some solid info from the filesystem via ntfscluster, part of Debian’s ntfs-3g package. I pass -s to it, and ignoring some irrelevant stuff, get my answer:

ntfscluster -s 501071352-501071359 /dev/sda1
Inode 190604 /System Volume Information/{b4816feb-b609-11e1-a908-50e549b934f7}{3808876b-c176-4e48-b7ae-04046e6cc752}/$DATA

I even reran it with a much larger sector range, just to be absolutely sure I had wiggle room in case calculations had an off-by-one error or something somewhere.

This is really great news, because the file in question is pretty much useless – I believe it’s a system restore point, which I won’t be needing anyhow.

So at this point, all that remains is to reinstall this on a different drive. For that, I could just use my ddrescue image. I thought I would take a second image, just to be very extra careful, and use that; I used:

partclone.ntfs --rescue -c -s /dev/sda1 -o sda1.partclone

although ntfsclone would work just as well. This captures only the partition; I’ll need the partition table as well, and perhaps also the space between the partition table and the first partition. I could capture it separately with dd, but it’s already in the ddrescue image, so there’s no need. (GRUB is installed on this drive, but there is no Linux filesystem on it, so it may well exceed the size of the MBR).

Note that for Linux ext[234] filesystems, debugfs can provide the same (and more) info as I got from ntfscluster.

I happen to have a drive of the right size sitting here, which I was about to install in a different machine. So a wipe and a swap and a restore later, and I should be good to go.

This scenario is commonplace enough that I thought I’d post how I dealt with it, in case anyone else ever has hard drive issues.

How to debugging Linux failure to resume from suspend?

I’m running a computer with a Gigabyte Z68A-D3H-B3 motherboard, and have never been able to get it to properly resume from suspend to RAM in Linux. It has worked fine on the rare occasion I’ve tried it in Windows 7.

My somewhat limited usual for debugging aren’t particularly helpful. The system appears to suspend perfectly fine. It just doesn’t resume. To be more precise, when I push the button to resume, the power comes up (fans whir, HDD spins up, etc.) but nothing happens. The USB keyboard and mouse don’t respond, Caps Lock doesn’t toggle any LEDs, it doesn’t respond on the wired LAN, and the display stays off.

Although it’s a desktop, I’d really like to save power on this thing by suspending it when it’s not in use. There’s no sense in wasting power I don’t need to be consuming.

I’ve tried what I used to try on laptops. I tried running in single-user mode, without X, or even the kernel modules for video acceleration loaded. I tried unloading whatever hardware modules I thought I could without completely destabilizing the system. I updated the BIOS to the latest release. I tried various combinations of video tweaks. I tried using s2ram from uswsusp instead of pm-suspend. Nothing made any difference. They all behaved exactly the same.

Googling showed a lot of resources for people that had trouble getting their machines to go to sleep. And also for people whose machines would wake up but just wouldn’t re-activate the display. But precious little for people with my particular symptoms.

What’s a good place to start looking to fix something like this?

Some details…

CPU is Core i5-2400. Kernel is wheezy’s 3.2.0-2-amd64, though this problem has persisted as long as I’ve had this machine, which was running squeeze at install time. Video is NVidia GeForce GTX 560 (GF114). Hard drives are SATA, Ethernet is integrated RTL8111/8168B. Userland is up-to-date amd64 wheezy.

XMPP for Children

When Jacob was just born, I wondered how I might introduce them to computing. I thought over various things, but that wasn’t really the most pressing thing right then.

I don’t suppose that I could have predicted installing an XMPP IM server (Prosody) for the boys. And I certainly couldn’t have predicted creating accounts named: jacob, oliver, butterfly, bear. Because, as Jacob pointed out to me, if (Jacob’s favorite toy) butterfly is typing with his wings, then he shouldn’t be logged in as Jacob. I admire my 5-year-old’s security consciousness…

Anyhow, as I mentioned yesterday, Jacob and Oliver enjoy “their” computer, which I recently put on the LAN. The firewall does not pass any of its traffic to the Internet, though, with very limited exceptions.

Jacob can read, and is starting to enjoy typing as well. So I thought he would enjoy sending IMs to me. As his computer has no GUI, I needed a text-mode client. Something with an IRC-like interface that could be scripted to open up a window with me directly sounded perfect. Initially I tried irssi’s XMPP plugin, but it proved to be too buggy (wanting to always latch on to a particular resource on the remote end, not having very predictable window behavior, etc.) So I switched to mcabber. With a couple of quick configuration bits to get him automatically logged in, remove superflous windows, and connect him directly to a chat with me, it was set. And well-loved. He sent me a mix of real words and random things he created by replacing letters in “Jacob” or by holding down keys.

In the mcabberrc, besides the obvious setting of username and password, there is:


set log_win_height = 1
set hook-post-connect = source ~/.mcabber/post-connect.rc

The hook is simply:


roster search Dad
roster hide

After awhile, Jacob wanted to switch computers. He wanted to use my laptop, and me use his computer. He refused to switch back. I asked him why. “Because on your computer, my name is red.” I should have known. I set it to bright white on his computer, but I think tomorrow we may need to upgrade him to the color monitor I’ve been saving for just such an occasion… It will be a whole new set of discoveries, I’m sure.

Update: I also tried out freetalk, which looked like it would meet my goals nicely. The problem was it didn’t have a dedicated “everything typed goes to this person” mode. It did have a mode where it put the person’s JID on the command line by default, but excessive use of backspace key by a 5-year-old could wipe that out and leave it in a state where he’d be confused.

Shell Scripts For Preschoolers

It probably comes as no surprise to anybody that Jacob has had a computer since he was 3. Jacob and I built it from spare parts, together.

It may come as something of a surprise that it has no graphical interface, and Jacob uses the command line and loves it — and did even before he could really read.

A few months ago, I wrote about the fun Jacob had with speakers and a microphone, and posted a copy of the cheat sheet he has with his computer. Lately, Jacob has really enjoyed playing with the speech synthesizer — both trying to make it say real words and nonsense words. Sometimes he does that for an hour.

I was asked for a copy of the scripts I wrote. They are really simple. I gave them names that would be easy for a preschooler to remember and spell, even if they conflicted with existing Unix/Linux commands. I put them in /usr/local/bin, which occurs first on the PATH, so it doesn’t matter if they conflict.

First, for speech systhesis, /usr/local/bin/talk:


#!/bin/bash
echo "Press Ctrl-C to stop."
espeak -v en-us -s 150

espeak comes from the espeak package. It seemed to give the most consistenly useful response.

Now, on to the sound-related programs. Here’s /usr/local/bin/ssl, the “sound steam locomotive”. It starts playing a train sound if one isn’t already playing:


#!/bin/bash
pgrep mpg321 > /dev/null || mpg321 -q /usr/local/trainsounds/main.mp3 &
sl "$@"

And then there’s /usr/local/bin/record:


#!/bin/bash
cd $HOME/recordings
echo "Now recording. Press Ctrl-C to stop."
DATE=`date +%Y-%m-%dT%H-%M-%S`
FILENAME="$DATE-$$.wav"
chmod a-w *.wav
exec arecord -c 1 -f S16_LE -c 1 -r 44100 "$FILENAME"

This simply records in a timestamped file. Then, its companion, /usr/local/bin/play. Sorry about the indentation; for whatever reason, it is being destroyed by the blog, but you get the idea.


#!/bin/bash
case "$1" in
train)
mpg321 /usr/local/trainsounds/main.mp3
;;
song)
/usr/bin/play /usr/local/trainsounds/traindreams.flac
;;
*)
cd $HOME/recordings
exec aplay `ls -tr| tail -n 1`
;;
esac

So, Jacob can run just “play”, which will play back his most recent recording. As something of a bonus, the history of recordings is saved for us to listen to later. If he types “play train”, there is the sound of a train passing. And, finally, “play song” plays Always a Train in My Dreams by Steve Gillette (I heard it on the radio once and bought the CD).

Some of these commands kick off sound playing in the background, so here is /usr/local/bin/bequiet:


#!/bin/bash
killall mpg321 &> /dev/null
killall play &> /dev/null
killall aplay &> /dev/null
killall cw &> /dev/null

A 4-year-old, Linux command line, and microphone

There are certain times when I’m really glad that we have Linux on the house for our boys to play with. I’ve already written how our 4-year-old Jacob has fun with bash and can chain together commands to draw ASCII animated steam locomotives. Today I thought it might be fun to install cw, a program that can take text on standard input and play it on the console speaker or sound card as Morse code. Just the sort of thing that I could see Jacob eventually getting a kick out of.

But his PC was mute. We opened it up and discovered it didn’t have a console speaker. So we traipsed downstairs, dug out an external speaker, and I figured out how to enable the on-board audio chipset in the BIOS. So now the cw command worked, but also there were a lot of other possibilities. We also brought up a microphone.

While Jacob was busy with other things, I set to work getting things hooked up, volume levels adjusted, and wrote some shell scripts for him. I also printed out this reference sheet for Jacob:

He is good at reading but not so good at spelling. I intentionally didn’t write down what the commands do, hoping that this would provide some avenue for exploration for him. He already is generally familiar with the ones under the quiet category.

I wrote a shell script called “record”. It simply records from the microphone and drops a timestamped WAV file in a holding directory. He can then type “play” to simply play back whatever he recorded most recently. Easy enough.

But what he really wanted was sound for his ASCII steam locomotive. So with the help of a Google search for “steam train mp3”, I wrote a script “ssl” (sound steam locomotive) that starts playing the sound in the background if it isn’t already going, and then runs sl to show the animation. This was a big hit.

I also set it up so he can type “play train” to hear that audio, or “play song” to play our favorite train song (Always a Train in My Dreams by Steve Gillette). Jacob typed that in and sat still for the entire 3 minutes listening to it.

I had to hook up an Ethernet cable to his machine to do all this, and he was very interested that I was hooking his computer up to mine in some way. He thought all the stuff about cables in the walls was quite exciting.

The last thing I did was install flite, a speech synthesis program. I wrote a small shell script called “talk” which reads a line at a time from stdin and invokes flite for each one (to give more instant feedback rather than not starting playback until after having read a large block from stdin). He had some fun hearing it say his name and other favorite words, but predictably the most fun was when he typed gibberish at it, and heard it try to pronounce or spell nonsense words.

In all, he was so excited about this new world of computer sound opened up to him. I’m sure there will be lots of happy experimentation and discovery going on.

Update Feb 10, 2012: I have posted the shell scripts behind this.

Geeks, Hobbies, and Free/Open Source: Feedback Wanted

I’ve been thinking lately about ways to improve ways in which I interact with Free Software projects, and ways in which they interact with me. Before I proceed to take steps or make suggestions, I’d like to see if others share my traits and observations.

Here are some questions I have been thinking of. If you’d like to help give me anecdotal evidence, please post a comment below this post. Identify the question numbers you are answering. It helps me if you can give specific examples, but if you don’t have the time or memory for that, no problem.

I will post my own answers in a day or two, but the point of this post is listening, not talking, so I’ll not post them immediately.

Hobbies (General – any geeks)

  • H1: To what degree do you like your hobbies to be challenging vs. easy? If something isn’t challenging, does that make it a good, bad, or indifferent candidate for a hobby
  • H2: To what degree do you like your hobbies to be educational or enlightening?
  • H3: How do you pick up new hobbies? Do you go looking for them? Do you stumble upon them? What excites you to commit time and/or money to them at the beginning?
  • H4: How does your interest wane? What causes you to lose interest in hobbies?
  • H5: For how long do you tend to maintain hobbies? Sub-hobbies?
  • H6: Are your hobbies or sub-hobbies cyclical? In other words, do you lose interest in a hobby for a time, then regain interest for a time, then lose it again? What is the length of time of these cycles, if any?
  • H7: Do you prefer social hobbies or solitary hobbies? (Note that many hobbies, including programming, video gaming, reading, knitting, etc. could be either social or solitary, depending on the inclination of individuals.)
  • H8: Have you ever felt guilt about wanting to stop a hobby or sub-hobby? (For instance, from stopping supporting users of your software project, readers of your e-zine, etc) Did the guilt keep you going? Was that a good thing?

Examples: video games might be a challenging hobby (depending on the person) but in most cases aren’t educational.

A hobby might be “video game playing” or “being a Debian developer.” A sub-hobby might be “playing GTA IV”, “playing RPGs”, or “maintaining mutt”.

Free/Open Source Hobbies

  • F1: Considering your answers above, do your FLOSS activities follow the same general pattern as your other hobbies/interests, or are there differences? If there are differences, what are they?
  • F2: Has concern for being expected to support software longer than you will have an interest in it ever been a factor in a decision whether to release source code publicly, or how public to make a release?
  • F3: Has concern over the long-term interest of a submitter in maintaining their patch/contribution ever caused you to consider rejecting it? (Or caused you to avoid using software over the same concern about its author)
  • F4: In general, do you find requirements FLOSS projects place on first-time contributors to be too stringent, not stringent enough, or about right?
  • F5: Have you ever continued contributing to a project past the point where your interest would otherwise motivate you to do so? If so, what caused you to do this? Do you believe that cause is a general positive or negative force for members of the FLOSS community?
  • F6: Have there ever been factors that caused you to stop contributing to a project even though you still had an active interest in doing so? What were they?
  • F7: Have you ever wanted to be able to take a break as a contributor or maintainer of a project, and be able to return to contributing to it later? If so, have you found it easy to do so?
  • F8: What is your typical length of engagement with FLOSS projects (such as Debian) and sub-projects (such as maintaining a particular package)?
  • F9: Does a change in social group ever encourage or discourage you from changing hobbies or sub-hobbies?
  • F10: Have you ever wanted to stop working on a project/sub-project because the problems involved were no longer challenging or educational to you?
  • F11: Have you ever wanted to stop working on a project/sub-project because of issues with the people involved?

Examples on F9: If, say, you are a long-time Perl user and have gone to Perl conferences, but now you are interested in Ruby, would your involvement with the Perl community cause you to avoid taking up the Ruby programming hobby? Or would it cause you to cut your ties with Perl less quickly than your changing interest might dictate? (This is a completely arbitrary example and isn’t meant to start a $LANGUAGE thread.)

Changes over time

  • C1: Do you believe that your answers to any of the above questions have changed over time? If yes, then:
  • C2: What kinds of changes have happened?
  • C3: What caused the change?
  • C4: Do you believe the changes produced positive results for you? For the community?

APRS: World’s Best Social Mapping and Wide-Area Ad-Hoc Wireless Mesh Network

That was quite a headline, and I’m going to try to back it up below.

APRS is the Automatic Packet Reporting System. It’s a system for exchanging brief packets of information. It is most frequently used for mapping applications, but it really does a lot more than that. It has its biggest home in the amateur radio world, but isn’t limited to that, either.

The most common way to use APRS is to have some device hooked up to a GPS transmit packets with the GPS information in them. These packets can then be plotted on a map in real-time or with history. That in itself isn’t particularly newsworthy these days.

An interesting thing about APRS is that it’s not just positioning. Let’s say that there was a search-and-rescue operation. A person could draw a rectangle on the map indicating the search area, and within about 3 seconds everyone else’s map also shows that rectangle. People have even been known to play chess by sharing and moving objects on APRS!

The next piece that makes this interesting is that APRS is an ad-hoc mesh network. In its traditional implementation, VHF amateur radio, a radio emits a packet with a geolocation in it (a “beacon”) and any other radio within direct range of that can receive it. Radios can display basic information (such as distance to the other radio, heading, etc.) or hooked up to a laptop or mapping device for a better display. So if everyone is within a few miles, APRS works without any pre-existing infrastructure at all. This makes it wonderful for use in disaster areas, and was put to heavy use in Joplin after the tornado there.

But what about radios that are too far away? Any APRS station could also be a digipeater. When a packet is transmitted, it has a maximum hop count. A digipeater hears the packet, decrements the maximum hop count, and re-transmits it. With this mechanism, packets can travel hundreds of miles. It creates a highly resilient network, one that can route around trouble without even having to have an explicit backup route. I could bring in a digipeater in my car — it can be small enough to hold in my hand — and instantly improve APRS reception in an area.

One interesting aspect is that packets can be digipeated more than the maximum hop count. For instance, if a packet leaves my radio and is picked up by a digipeater to my west and one to my east, it can keep on traveling in both directions. This is part of what leads to resiliency.

APRS also functions over the Internet. There is a large network of interconnected Internet servers that exchange all global APRS traffic amongst themselves. Gateways between the radio (RF) and Internet (APRS-IS) services exist, and are called iGates. They are not generally required, but make useful websites like aprs.fi and email gateways possible. As long as an iGate is within a reasonable number of hops from you, you’re effectively linked. And again, if one iGate drops off, another iGate is probably monitoring your traffic too and you never notice. It’s an ad-hoc mesh network that is actually reliable – how about that?

On the PC side, there are many programs for using APRS. The most common one for Windows is called UI-View, but I don’t use Windows so I can’t comment. On Linux, there are programs (such as aprx) for running your digipeater, but the best-known program is Xastir. Xastir lets you download map files to your local disk, and can interface with the APRS-IS Internet service, radios, weather stations, or simply other arbitrary machines to exchange information. Xastir is a very nice program and is well worth the install, despite its somewhat dated-looking interface.

APRS clients, such as APRSDroid, exist for Android and iOS platforms as well.

So let’s say you’re doing something like helping handle food/water stations for a long bike ride. Even if you don’t have anybody with an amateur radio license, you can use APRS to great effect. At your headquarters, you can run Xastir and turn on its “server mode”. This puts everyone on a map. Then you can have everyone turn on APRS on their phone, and have it report to your custom server instead of APRS-IS. Now you have instant visibility into your entire team’s location and status. If you have transport people driving supplies between locations, that’s especially helpful.

In an amateur radio scenario, you would instead have people with radios at each location, and one laptop hooked up to a radio at HQ. This provides an added bonus of not relying on third-party infrastructure such as cellphone towers.

APRS also has a messaging system, similar in concept to text messaging. It works the same as other things. If I want to send a message to Jane, my radio simply emits a packet that lists the message and Jane as a recipient. It’s digipeated up to its maximum hop count. If Jane is within RF range of one of those digipeaters, she gets the message and her radio ACKs it. Otherwise, it’s delivered into the APRS-IS network — probably several times, which isn’t a problem — and the APRS-IS network delivers it to the iGate closest to her, and from there it gets digipeated the rest of the way to her.

Here’s an example of something created with APRS. While I was on a bus choir tour last weekend, I had a radio with me that was beaconing all the while. Now it was a small handheld radio inside a large metal bus, so it didn’t always have a digipeater in range. But still, you can go see a detailed map with the trail and even see exactly what path each packet took before it hit the Internet.

If you want to try out Xastir, please grab at least version 2.0 – the version in squeeze has some bugs.

A Proud Dad

I saw this on my computer screen the other day, and I’ve got to say it really warmed my heart. I’ll explain below if it doesn’t provoke that reaction for you.

Evidence a 4-year-old has been using my computer

So here’s why that made me happy. Well for one, it was the first time Jacob had left stuff on my computer that I found later. And of course he left his name there.

But moreover, he’s learning a bit about the Unix shell. sl is a command that displays an animated steam locomotive. I taught him how to use the semicolon to combine commands. So he has realized that he can combine calls to sl with the semicolon to get a series of a LOT of steam trains all at once. And was very excited about this discovery.

Also he likes how error messages start with the word “bash”.

How do you hire programmers and sysadmins? How should employers evaluate you?

Reading job listings for any sort of IT job is depressing. It’s been quite some time since I’ve had to do that, but how many times have you seen something like this:

  • “5 years of Java experience required.”
  • “3 years of Java experience with modules X, Y, Z required.”
  • “6 years of experience administering Linux machines running RHEL 4 on a Windows 2000 domain with 1500 clients in an educational setting preferred.”

I could go on and on. As a job seeker, that sort of thing is fundamentally devaluing to someone who has strengths in being adaptable and quickly learning new tools, languages, or even entire environments. As an employer, it sends a message that you’re not interested in more than a surface look at someone’s strengths, and probably don’t care to hire the best and the brightest. After all, would you turn away a rockstar programmer simply because he or she had been writing filesystem code in C the last 3 years instead of the latest whizbang Java web widget that will probably be obsolete in a year and unsupported in two? I am quite certain that there are plenty of managers that do. Even if you are a company large enough to have an entire team of people that do nothing but work on that whizbang app, don’t you still want the best you can find, realizing that some of the best people to work on that app may not have even heard of it yet? (And that when the app goes obsolete in 5 years, you’d rather not have to lay off a large team of single-skill people)

Some of you may know that I work in IT at a manufacturing company. We have a small IT team here, about seven people, and are a heavy Debian shop. And we have a vacancy open up in our development/Linux admin group. I’m the manager of that group, which is why I’m thinking about this right now.

We’re too small for single-subject specialists to make sense, yet we’re big enough to appreciate skill, experience, flexibility, and rigor. Consequently, when the occasion arises for me to look for new employees, I don’t prepare a laundry list of things we use in-house and would like experience with.

The list of almost-required things generally begins with “Linux” and ends with “experienced”, and has nothing else in between. In other words, I’d like it if I don’t have to explain to you what a symlink or a hardlink is, but I’d be willing to do so if I think you’d internalize it quickly. On the “experienced” side, it would be nice if you already have a well-developed sense of fear at running rm when you’re root, or have designed a storage infrastructure for a network before, or are paranoid about security. But again, if people can pick up those traits on the job, we are usually still interested. If learning how to package up software for Debian, fix bugs in software you’ve never seen in a language you’ve never heard of, raise good questions about things you may not have lots of experience with, and write documentation for it all on a wiki sounds like fun, then that’s probably the kind of person I want, even if you’ve never used our particular tools before.

If I were to judge based on the stuff I normally see in job postings, I guess you might conclude I’m nuts. I don’t think I am, but then again I’m also the only person I know that formats his own resume in hand-crafted LaTeX. What do you all think?

The next question is: how should one evaluate candidates given this sort of philosophy? I’m not a fan of canned tests, or even “whiteboard tests” that tend to be some sort of canned topic that may test the applicant’s specific knowledge base more than overall skill and flexibility. Similarly, as an applicant in years past, I’ve struggled with how to present the “I’ve never used $LANGUAGE, but I know I could pick it up quickly and do it very well” vibe. To certain people, that might sound like BS. To the more geeky managers, perhaps it sounds like what they want.

We’ve built a fairly diverse team on the back of this approach, and it’s worked out well for us so far. I’m interested to hear your thoughts.

Oh, and if you’d like to work for us, you should probably be sending me an email. No, I’m not going to list the address here on this blog post. If you can’t figure it out, I don’t want to hear from you <grin>

Unix Password and Authority Management

One of the things that everyone seems to do different is managing passwords. We haven’t looked at that in quite some time, despite growth both of the company and the IT department.

As I look to us moving some things to the cloud, and shifting offsite backups from carrying tapes to a bank to backups via the Internet, I’m aware that the potential for mischief — whether intentional or not — is magnified. With cloud hosting, a person could, with the press of a button, wipe out the equivalent of racks of machines in a few seconds. With disk-based local and Internet-based offsite backups, the potential for malicious behavior may be magnified; someone could pretty quickly wipe out local and remote backups.

Add to that the mysterious fact that many enterprise-targeted services allow only a single username/password for an account, and make no provision for ACLs to delegate permissions to others. Even Rackspace Cloud has this problem, as do their JungleDisk backup product, and many, many other offsite backup products. Amazon AWS seems to be the only real exception to this rule, and their ACL support is more than a little complicated.

So one of the questions we will have to address is the balance of who has these passwords. Too many people and the probability of trouble, intentional or not, rises. Too few and productivity is harmed, and potentially also the ability to restore. (If only one person has the password, and that person is unavailable, company data may be as well.) The company does have some storage locations, including locked vaults and safe deposit boxes, that no IT people have access to. I am thinking that putting a record of passwords in those locations may be a good first step, as putting the passwords in the control of those that can’t use them seems a reasonable step.

But we’ve been thinking of this as it pertains to our local systems as well. We have, for a number of years now, assigned a unique root password to every server. These passwords are then stored in a password-management tool, encrypted with a master password, and stored on a shared filesystem. Everyone in the department therefore can access every password.

Many places where I worked used this scheme, or some variant of it. The general idea was that if root on one machine was compromised and the attacker got root’s password, it would prevent the person from being able to just try that password on the other servers on the network and achieve a greater level of intrusion.

However, the drawback is that we now have more servers than anyone can really remember the passwords for. So many people are just leaving the password tool running. Moreover, while the attack described above is still possible, these days I worry more about automated intrusion attempts that most likely won’t try that attack vector.

A couple of ways we could go may include using a single root password everywhere, or a small set of root passwords. Another option may be to not log in to root accounts at all — possibly even disabling their password — and requiring the use of user accounts plus sudo. This hasn’t been practical to date. We don’t want to make a dependency on LDAP from a bunch of machines just to be able to use root, and we haven’t been using a tool such as puppet or cfengine to manage this stuff. Using such a tool is on our roadmap and could let us manage that approach more easily. But this approach has risks too. One is that if user accounts can get to root on many machines, then we’re not really more secure than a standard root password. Second is that it makes it more difficult to detect and enforce password expiration and systematic password changes.

I’m curious what approaches other people are taking on this.