The Python Unicode Mess

Unicode has solved a lot of problems. Anyone that remembers the mess of ISO-8859-* vs. CP437 (and of course it’s even worse for non-Western languages) can attest to that. And of course, these days they’re doing the useful work of…. codifying emojis.

Emojis aside, things aren’t all so easy. Today’s cause of pain: Python 3. So much pain.

Python decided to fully integrate Unicode into the language. Nice idea, right?

But here come the problems. And they are numerous.

gpodder, for instance, frequently exits with tracebacks due to Python errors converting podcast titles with smartquotes into ASCII. Then you have the case where the pexpect docs say to use logfile = sys.stdout to show the interaction with the virtual terminal. Only that causes an error these days.

But processing of filenames takes the cake. I was recently dealing with data from 20 years ago, before UTF-8 was a filename standard. These filenames are still valid on Unix. tar unpacks them, and they work fine. But you start getting encoding errors from Python trying to do things like store filenames in strings. For a Python program to properly support all valid Unix filenames, it must use “bytes” instead of strings, which has all sorts of annoying implications. What’s the chances that all Python programs do this correctly? Yeah. Not high, I bet.

I recently was processing data generated by mtree, which uses octal escapes for special characters in filenames. I thought this should be easy in Python, eh?

That second link had a mention of an undocumented function, codecs.escape_decode, which does it right. I finally had to do this:

    if line.startswith(b'#'):
        continue
    fields = line.split()
    filename = codecs.escape_decode(fields[0])[0]
    filetype = getfield(b"type", fields[1:])
    if filetype == b"file":

And, whatever you do, don’t accidentally write if filetype == "file" — that will silently always evaluate to False, because "file" tests different than b"file". Not that I, uhm, wrote that and didn’t notice it at first…

So if you want to actually handle Unix filenames properly in Python, you:

  • Must have a processing path that fully avoids Python strings.
  • Must use sys.{stdin,stdout}.buffer instead of just sys.stdin/stdout
  • Must supply filenames as bytes to various functions. See PEP 0471 for this comment: “Like the other functions in the os module, scandir() accepts either a bytes or str object for the path parameter, and returns the DirEntry.name and DirEntry.path attributes with the same type as path. However, it is strongly recommended to use the str type, as this ensures cross-platform support for Unicode filenames. (On Windows, bytes filenames have been deprecated since Python 3.3).” So if you want to be cross-platform, it’s even worse, because you can’t use str on Unix nor bytes on Windows.

Update: Would you like to receive filenames on the command line? I’ll hand you this fine mess. And the environment? it’s not even clear.

24 thoughts on “The Python Unicode Mess

  1. Norbert Preining says:

    Thanks a lot, I was fighting with similar things when trying to read filenames encoded in ShiftJIS from old Japanese computers … I really don’t understand what Python developers were thinking!

    1. John Goerzen says:

      So good to know I’m not alone!

      As near as I can tell, they were assuming an ordered world. Not a machine with files that date back years, probably with multiple different encodings that would have to be sorted out manually if someone ever bothered. I remember encountering this issue with JFS way back, probably 15 years ago.

      1. Orivej Desh says:

        > So good to know I’m not alone!

        This topic has been covered well in http://www.catb.org/esr/faqs/practical-python-porting/ and http://lucumr.pocoo.org/2014/5/12/everything-about-unicode/

        1. John Goerzen says:

          Thanks for that. ESR’s article was thorough and depressingly long.

  2. Matt Campbell says:

    > the useful work of…. codifying emojis.

    I think that might have been sarcasm. But it actually is useful, for accessibility. If we grant that people will want to use emojis in their messages, however silly it may seem, then they should at least be accessible by default to blind users. Having them as part of the Unicode standard accomplishes that.

    That said, it does sound like Python’s filename handling is a mess.

    1. John Goerzen says:

      Thanks for mentioning that, Matt. I hadn’t realized that Unicode emojis were helpful for accessibility.

  3. Banpei-kun says:

    I’m not going to defend how Python 3 handles Unicode, but things are not as brain-damaged as your final bullet point and the command line/environment update suggest (they’re just, you know, very stinky).

    Python has the “surrogateescape” encode/decode error handler that lets you do a bytes -> unicode -> bytes round trip without losing data. For example, b’\xaa’ becomes ‘\udcaa’.

    You can give a str like that to open() and friends and they will operate on the right file. I.e. open(‘\udcaa’) is equivalent to open(b’\xaa’).

    sys.argv and os.environ contain strings that were decoded using surrogateescape, functions like os.scandir() will use it to escape binary filenames and os.fsencode()/os.fsdecode() are there for convenience.

    The end result is that an open(sys.argv[1]) should work even if you passed a binary filename on the command line.

    But yeah, you’ll have to use sys.stdout.buffer to print the binary filename.

    1. John Goerzen says:

      I read up on that yesterday, and was overall thoroughly confused about how it works, and how it interacts with the locale set in the environment. If it just always uses surrogateescape for os.environ and sys.args, then I guess it works to read but various operations could be broken (though that may not matter in some cases.) Still, rather confusing.

  4. Christoph Pfister says:

    I cannot agree with your conclusion that “Must have a processing path that fully avoids Python strings.”; as mentioned in the above comment, “surrogateescape” is a solution that is (a) consistent and (b) used implicitly:

    lxuser@debian:/tmp/jg$ vdir
    total 4
    drwxr-xr-x 2 lxuser lxuser 4096 Oct 6 11:00 tmp
    -rw-r–r– 1 lxuser lxuser 0 Oct 6 11:00 \377
    lxuser@debian:/tmp/jg$ python3
    Python 3.5.3 (default, Sep 27 2018, 17:25:39)
    [GCC 6.3.0 20170516] on linux
    Type “help”, “copyright”, “credits” or “license” for more information.
    >>> import os
    >>> filenames = os.listdir()
    >>> filenames
    [‘tmp’, ‘\udcff’]
    >>> os.remove(filenames[1])
    >>>
    lxuser@debian:/tmp/jg$ vdir
    total 4
    drwxr-xr-x 2 lxuser lxuser 4096 Oct 6 11:00 tmp
    lxuser@debian:/tmp/jg$

    Of course, interfacing with the external world needs some care, but this is unavoidable — there is no magic solution to that problem (think for example about the zip filename encoding mess).

    1. John Goerzen says:

      I stand corrected. All I can say is it was completely unclear from the docs how this worked (and is harder to get right than it should be). Leads to the weird thing that reading from stdin produces different results than reading from the command line.

  5. Christoph says:

    With newer Python 3 using str for filenames works just fine on all platforms.

    If you want to support Python 2+3 have a look at https://senf.readthedocs.io

    1. John Goerzen says:

      How newer is “newer”? I have 3.5.3

      1. Ewen McNeill says:

        Python 3.6 or later has more backwards compatibility. I’d suggest going straight from 2.7 to 3.7 if you can. I’ve had pain on Python 3.5 that didn’t happen on 2.6/2.7/3.6/3.7. It took a while but eventually Python 3’s handling of the real world got better, including surrogate escapes. IIRC most of the important bits landed by/in Python 3.6.

        Ewen

  6. nemo says:

    http://m8y.org/tmp/ibmfilter_tmp.txt this simple ibmfilter program in python2 has resisted the attempts of all pythonistas I’ve asked to translate to python3. It’s used for tradewars2002 and nethack and converts single byte stream to utf-8 with optional character replacement.

    ~/bin/ibmfilter.tmp mplayer -vo caca -really-quiet -ao null hw_bounce.ogv
    http://m8y.org/tmp/hw_bounce.ogv
    http://m8y.org/tmp/ibmfilter_tmp.txt
    PASS – video plays normally, all ‘t’ are replaced with ‘-‘
    (and you can quit at any time with q key)
    FAIL – video does not play normally, you cannot quit with q key, some or all ‘t’ are not replaced with ‘-‘

    Prior requests for this have resulted in blocking or thought that it would take a large amount of python3 to replicate – I didn’t write this so if python2 goes away I’ll probably just rewrite it in something else.

    1. Andrew says:

      Launching python just to shell out and read a pipe seems not a great approach to this. That said: add a “b” prefix to your byte literals on lines 9 (the key, not the value) and 19, set bufsize=0 in the Popen constructor, and remove the call to encode() on line 29 and your program will work as described on python3.

      1. nemo says:

        Uh… Removing the call to encode defeats the entire purpose of the program.
        It’s intended to convert ibm437 to utf-8 (tradewars 2002, nethack, other old programs).

        My simple testcase was just one in case you didn’t have those handy or didn’t feel like firing up a game.

        Here’s how I used it for tradewars on one old mirror.
        alias tw2002=’kbdfix ibmfilter telnet sk-twgs.com 2002′
        http://m8y.org/tmp/ibmfilter_nh.txt here’s one with more minecraft mappings.

        But yeah, the whole point is single byte to multibyte, so that’s cheating 😉

        1. nemo says:

          er… nethack, not minecraft ofc ☺

        2. nemo says:

          For example │ is 0xE2 0x94 0x82 in UTF-8 but 0xB3 in IBM437 – and that script handles that conversion just fine, making old stuff usable.

        3. Andrew says:

          Nope, you’re not understanding how sys.stdout works in py3 — it is implicitly encoding based on your system and locale. The standard io streams are always file objects in text mode, so write() should take a string, not bytes, and it will automatically encode that string with the encoding of the stdout stream.

          The default encoding is system dependent, but at least on any modern linux should be using utf-8 by default. If you need to override the default you can use the PYTHONIOENCODING environment variable to control it.

          If you *really* want to do the encoding yourself, you need to write the encoded bytes to the underlying buffer instead, so it would be sys.stdout.buffer.write(out.encode(‘utf-8’)) But doing that risks mixing encodings on stdout if it isn’t utf-8, so I wouldn’t recommended it unless you absolutely must.

          1. nemo says:

            ok. I made your changes as described and it just hung. had to kill the process – tested nethack and mplayer. Did you try them?
            tested against python 3.4.5

          2. Andrew says:

            Your original mplayer example works as described for me w/ every version of python3 I have handy — 3.4, 3.7, 3.8; I don’t have any of those games set up to test with. (Nor, to be honest, any particular interest in debugging your gaming setup unless you are planning to hire me to do so. ;)

            Back to my original point that this might not be the best approach — I’ve seen issues just throwing stdout from long running interactive tui type applications into a pipe like this, and am mildly surprised it’s worked correctly for you at all. There are potential issues with signal and terminal handling that often crop up, and to really handle this robustly you’d probably want to be setting up a full blown pty for the subprocess. But this is a bigger topic than belongs in blog comments, and neither a python3 nor a unicode issue, so rather off topic.

          3. nemo says:

            Heh. I didn’t write it as noted, was written a decade ago by someone on #nethack , and has worked perfectly for me over the last decade (in python 2) So definitely not going to be hiring anyone to rewrite anything ☺ I just thought it was interesting in how a simple python2 program did not seem to translate to python 3 at all. If python2 goes away I’ll just rewrite it in some other language.
            But. Yeah.
            ~/bin/ibmfilter.tmp mplayer -vo caca -really-quiet -ao null hw_bounce.ogv

            Does absolutely nothing. Just sits there until I interrupt it. Python2 variant works perfectly, as it always has, the other few thousand time I’ve launched it. And, yeah, I wasn’t the only one to use his program by far, so it hasn’t done anything surprising or unusual w/ stdout (in python2 anyway).

  7. mirabilos says:

    You seem to have missed PEP 383. While not as nice as my OPTU-8/16 encoding scheme (which uses an 128-codepoint block in the private use area, even actually registred with the ConScript Unicode Registry), PEP 383 allows you to encode raw octets (as broken surrogates), so I’d wager there must be a way to generate them programmatically.

    And indeed, there is:

    tglase@tglase:~ $ rm m?h; cat x.py
    # coding: UTF-8
    with open(‘m\uDCE4h’, ‘w’) as f:
    print(‘hi’, file=f)
    tglase@tglase:~ $ python3 x.py
    tglase@tglase:~ $ ls m?h | hd
    00000000 6d e4 68 0a |m.h.|
    00000004

  8. Anon says:

    This is really great information. I keep running into this same sort of Python problem. Sigh. Maybe Rust?

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.