“In the beginning the Universe was created. This has made a lot of people very angry and been widely regarded as a bad move.”
– Douglas Adams
This expands on my recent post The Incredible Disaster of Python 3. I seem to have annoyed the Internet…
Back in the mists of time, Unix was invented. Today the descendants of Unix, whether literal or in spirit, power the majority of the world’s cell phones, most of the most popular sites on the Internet, etc. And among this very popular architecture, there lies something that has made people very angry at times: on a Unix filesystem, 254 bytes are valid in filenames. The two that are not are 0x00 and the slash character. Otherwise, they are valid in virtually any combination (the special entries “.” and “..” being the exception).
This property has led to a whole host of bugs, particularly in shell scripts. A filename with a leading dash might look like a parameter to a tool. Filenames can contain newline characters, space characters, control characters, and so forth; running ls in a directory with maliciously-named files could certainly scramble one’s terminal. These bugs continue to persist, though modern shells offer techniques that — while optional — can be used to avoid most of these classes of bugs.
It should be noted here that not every valid stream of bytes constitutes a stream of bytes that can be decoded as UTF-8. This is a departure from earlier encoding schemes such as iso-8859-1 and cp437; you might get gibberish, but “garbage in, garbage out” was a thing and if your channel was 8-bit clean, your gibberish would survive unmodified.
Unicode brings many advantages, and has rightly become the predominant standard for text encoding. But the previous paragraph highlights one of the challenges, and this challenge, along with some others, are at the heart of the problem with Python 3. That is because, at a fundamental level, Python 3’s notion of a filename is based on a fiction. Not only that, but it tries to introduce strongly-typed strings into what is fundamentally a weakly-typed, dynamic language.
A quick diversion: The Rust Perspective
The Unicode problem is a problematic one, and it may be impossible to deal with it with complete elegance. Many approaches exist; here I will describe Rust’s string-related types, of which there are three for our purposes:
- The String (and related &str) is used for textual data and contains bytes that are guaranteed to be valid UTF-8 at all times
- The Vec<u8> (and related [u8]) is a representation of pure binary bytes, of which all 256 possible characters are valid in any combination, whether or not it forms valid UTF-8
- And the Path, which represents a path name on the system.
The Path uses the underlying operating system’s appropriate data type (here I acknowledge that Windows is very different from POSIX in this regard, though I don’t go into that here). Compile-time errors are generated when these types are mixed without proper safe conversion.
The Python Fiction
Python, in contrast, has only two types; roughly analogous to the String and the Vec<u8> in Rust. Critically, most of the Python standard library treats a filename as a String – that is, a sequence of valid Unicode code points, which is a subset of the valid POSIX filenames.
Do you see what we just did here? We’ve set up another shell-like situation in which filenames that are valid on the system create unexpected behaviors in a language. Only this time, it’s not \n, it’s things like \xF7.
From a POSIX standpoint, the correct action would have been to use the bytes type for filenames; this would mandate proper encode/decode calls by the user, but it would have been quite clear. It should be noted that some of the most core calls in Python, such as open(), do accept both bytes and strings, but this behavior is by no means consistent in the standard library, and some parts of the library that process filenames (for instance, listdir in its most common usage) return strings.
The Plot Thickens
At some point, it was clearly realized that this behavior was leading to a lot of trouble on POSIX systems. Having a listdir() function be unable (in its common usage; see below) to handle certain filenames was clearly not going to work. So Python introduced its surrogate escape. When using surrogate escapes, when attempting to decode a binary byte that is not valid in UTF-8, it is replaced with a multibyte UTF-8 sequence from Unicode code space that is otherwise rarely used. Then, when converted back to a binary sequence, this Unicode code point is converted to the same original byte. However, this is not a systemwide default and in many cases must be specifically requested.
And now you see this is both an ugly kludge and a violation of the promise of what a string is supposed to be in Python 3, since this doesn’t represent a valid Unicode character at all, but rather a token for the notion that “there was a byte here that we couldn’t convert to Unicode.” Now you have a string that the system thinks is Unicode, that looks like Unicode, that you can process as Unicode — substituting, searching, appending, etc — but which is actually at least partially representing things that should rightly be unrepresentable in Unicode.
And, of course, surrogate escapes are not universally used by even the Python standard library either. So we are back to the problem we had in Python 2: what the heck is a string, anyway? It might be all valid Unicode, it might have surrogate escapes in it, it might have been decoded from the wrong locale (because life isn’t perfect), and so forth.
Unicode Realities
The article pragmatic Unicode highlights these facts:
- Computers are built on bytes
- The world needs more than 256 symbols
- You cannot infer the encoding of bytes — you must be told, or have to guess
- Sometimes you are told wrong
I have no reason to quibble with this. How, then, does that stack up with this code from Python? (From zipfile.py, distributed as part of Python)
if flags & 0x800: # UTF-8 file names extension filename = filename.decode('utf-8') else: # Historical ZIP filename encoding filename = filename.decode('cp437')
There is a reason that Python can’t extract a simple ZIP file properly. The snippet above violated the third rule by inferring a cp437 encoding when it shouldn’t. But it’s worse; the combination of factors leads extracall() to essentially convert a single byte from CP437 to a multibyte Unicode code point on extraction, rather than simply faithfully reproducing the bytestream that was the filename. Oh, and it doesn’t use surrogate escapes. Good luck with that one.
It gets even worse
Let’s dissect Python’s disastrous documentation on Unicode filenames.
First, we begin with the premise that there is no filename encoding in POSIX. Filenames are just blobs of bytes. There is no filename encoding!
What about $LANG and friends? They give hints about the environment, languages for interfaces, and terminal encoding. They can often be the best HINT as to how we should render characters and interpret filenames. But they do not subvert the fundamental truth, which is that POSIX filenames do not have to conform to UTF-8.
So, back to the Python documentation. Here are the problems with it:
- It says that there will be a filesystem encoding if you set LANG or LC_CTYPE, falling back to UTF-8 if not specified. As we have already established, UTF-8 can’t handle POSIX filenames.
- It gets worse: “The os.listdir() function returns filenames, which raises an issue: should it return the Unicode version of filenames, or should it return bytes containing the encoded versions? os.listdir() can do both”. So we are somewhat tacitly admitting here that str was a poor choice for filenames, but now we try to have it every which way. This is going to end badly.
- And then there’s this gem: “Note that on most occasions, you should can just stick with using Unicode with these APIs. The bytes APIs should only be used on systems where undecodable file names can be present; that’s pretty much only Unix systems now.” Translation: Our default advice is to pretend the problem doesn’t exist, and will cause your programs to be broken or crash on POSIX.
Am I just living in the past?
This was the most common objection raised to my prior post. “Get over it, the world’s moved on.” Sorry, no. I laid out the case for proper handling of this in my previous post. But let’s assume that your filesystems are all new, with shiny UTF-8 characters. It’s STILL a problem. Why? Because it is likely that an errant or malicious non-UTF-8 sequence will cause a lot of programs to crash or malfunction.
We know how this story goes. All the shell scripts that do the wrong thing when “; rm” is in a filename, for instance. Now, Python is not a shell interpreter, but if you have a program that crashes on a valid filename, you have — at LEAST — a vector for denial of service. Depending on the circumstances, it could turn into more.
Conclusion
- Some Python 3 code is going to crash or be unable to process certain valid POSIX filenames.
- Some Python 3 code might use surrogate escapes to handle them.
- Some Python 3 code — part of Python itself even — just assumes it’s all from cp437 (DOS) and converts it that way.
- Some people recommend using latin-1 instead of surrogate escapes – even official Python documentation covers this.
The fact is: A Python string is the WRONG data type for a POSIX filename, and so numerous, incompatible kludges have been devised to work around this problem. There is no consensus on which kludge to use, or even whether or not to use one, even within Python itself, let alone the wider community. We are going to continue having these problems as long as Python continues to use a String as the fundamental type of a filename.
Doing the right thing in Python 3 is extremely hard, not obvious, and rarely taught. This is a recipe for a generation of buggy code. Easy things should be easy; hard things should be possible. Opening a file correctly should be easy. Sadly I fear we are in for many years of filename bugs in Python, because this would be hard to fix now.
Resources
- Everything you did not want to know about Unicode in Python 3
- Practical Python porting for systems programmers
- Pragmatic Unicode
(For even more fun, consider command line parameters and environment variables! I’m annoyed enough with filenames to leave those alone for now.)
Ok, I want strict unicode mount option on my filesystems. Just fail to create garbage.
Ditto parsing archives. If they contain garbage, I don’t want it.
And make interface names strict ASCII. Turns out one can call their NIC 💩.
After mirroring some stuff with wget I ended up with an invalid Unicode filename. Perhaps this is a wget bug, although it was actually a different encoding (incorrectly) sent as UTF-8 by the server, but the simple fact is that there isn’t a single program on my computer that cares. I only found out because a Python 3 program crashed while scanning the folder it’s in.
So yeah, garbage, whatever, a filename doesn’t display as nicely as it should. Big whoop. Python 3 program, broken.
This is EXACTLY the kind of experience I’ve had all too often with Python 3 programs. Sigh.
Indeed, this is one of the most common problems I face here. Here in Japan, the encodings in use have changed very recently, and far too many stuff is still in iso-2022-1 or some other encodings. There is a simple reason why the UTF took so long to uptake here – because the encoding of Japanese characters just doubled in length (from 2 to 4 bytes).
For those who remember, the initial internal encoding for MULE, the multilingual part of Emacs, was iso-2022-1 AFAIR.
Anyway, fortunately Perl has no problems reading file names in arbitrary encodings, but Python made me stumble …
As long as you work in your pet projects on your basement sure, demand whatever you want from filenames.
If you work in the real world, and have to process stuff sent by users, or run your program on users’ machines, this wont fly.
Right, so Python sucks. The problem is, when it comes to file names pretty much all programming languages suck to various degrees. Remember that various parts of the file system can use different encodings for file names, and nobody can tell which is which because certainly Linux file systems themselves aren’t generally interested in remembering that sort of information in the first place. So even in Rust, if you want to present a file name you just read from the disk to the user then unless the language is unusually clairvoyant there will be some amount of guesswork involved. And even if your complete file system is UTF-8, someone can bring an USB drive along that has some file names in old Macintosh encoding and everything goes haywire. Life is hard.
As somebody who writes software for other people, the best solution certainly is to stick to POSIX-portable file names (i.e., ASCII letters and digits plus a few select special characters like the dot and dash) and exhort users to do the same unless they really know what they’re doing. If they absolutely must dig their own graves by using outlandish code points then on their own heads be it. Perhaps 10 or 20 years from now everyone will be using UTF-8 for file names and this problem will become insignificant but I’m not holding my breath.
+1
Is the assumption that because POSIX supports these types of filenames, zip does too? I don’t think that’s the case.
I think the Python implementation is adhering to the zip specification.
From the specification v6.3.6 (Revised: April 26, 2019):
If general purpose bit 11 is unset, the file name and comment SHOULD conform to the original ZIP character encoding. If general purpose bit 11 is set, the filename and comment MUST support The Unicode Standard, Version 4.1.0 or greater using the character encoding form defined by the UTF-8 storage specification.
https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT
I can tell you that the zip(1) on Unix systems has never done re-encoding to cp437; on a system that uses latin-1 (or any other latin-* for that matter) the filenames in the ZIP will be encoded in latin-1. Furthermore, this doesn’t explain the corruption that extractall() causes.
Wait, I’m not sure this is correct.
open(file, mode=’r’, buffering=-1, encoding=None, errors=None, newline=None, closefd=True, opener=None)
Open file and return a stream. Raise OSError upon failure.
file is either a text or byte string giving the name (and the path
if the file isn’t in the current working directory) of the file to
be opened or an integer file descriptor of the file to be
wrapped. (If a file descriptor is given, it is closed when the
returned I/O object is closed, unless closefd is set to False.)
You’re free to use Unicode strings or byte strings for file names across a wide set of Python 3 APIs.
I specifically mentioned that open() accepts both bytes and strings. However, this is not the case across even the standard library included with Python 3. I highlighted two common examples: zipfile and dbm.
I was really surprised that this was posted recently — it sure seemed like a rant from the past, pre-py3.5 era or so. Is this really still such a big issue? (note, never been one for me :-) ).
But the key thing that this post ALMOST does, that I haven’t seem before is acknowledge that POSIX IS BROKEN. Sure, “a filename is simply a set of bytes” works great with the old C char* way of thinking about the world, but if you are not a programmer, you want filenames to be: readable, printable, storable in and readable from a text file, usable on a command line, etc, etc. Having filenames that will break shell use, terminals, and who knows what else, was always a really bad idea.
That being said, posix has been around a long time, so it would be nice if Python was able to deal with it as it is, not as it should be.
So yeah, it probably would have been better to have a “filename” type in Python that would be able to enforce the local rules, and allow essentially broken filesystems not to break Python programs. Maybe even the Path object could do that. As far as I know, no one has written such a object — maybe because the stdlib wouldn’t have been able to deal with it. But we now have the __fspath__ protocol — so *maybe* one could write a PATH object that worked with arbitrary byte strings on a filesystem now.
But the biggest issue with these rants is that they don’t acknowledge that filenames with weird-ass characters in them have always caused problems (this one does, but then kinda glosses over the significance of that) — sure, they don’t cause problems when passed around as a char*, but pretty much any situation where a Python3 program will break now, there would have been breakage somewhere else instead. So maybe it’s not SO important that your Python programs be able to hide what will be problems later on anyway.
I big example is really, really common in my work: I need to store a filename in a text file, to be read and the file opened later. Unless I declare that all my text files are arbitrarily encode (that it, not text files :-) ), I need filenames to be valid Unicode.
And heck, I’m still fighting with programs that can’t handle f-ing spaces in filenames!
Hi Chris,
I agree with you, and acknowledged in the post, that the fact that things like \n are valid in POSIX filenames were not a good move. Even today, shell code is written that breaks with spaces in filenames. Fun.
I also fully support applications putting restrictions on filenames – especially ones doing automated processing, etc. But the language shouldn’t. Anyhow, thanks for the conversation!
“So yeah, it probably would have been better to have a “filename” type in Python that would be able to enforce the local rules, and allow essentially broken filesystems not to break Python programs.”
So yeah, it probably would have been better if the core developers had listened to people telling them that various abstractions and the standard library needed improving, instead of them getting lots of ambitious ideas about tweaking the language, with various fundamental CPython runtime deficiencies remaining unaddressed over a decade later, and Python generally having missed the boat completely in whole realms of modern computing.
(And for the Debian angle, given that I read these articles – not rants, by the way – via Planet Debian, I see that PyPy is in danger of being kicked out of Debian due to the paper-shuffling exercise that is the Python 2 removal “transition”. It must be very fulfilling to dedicate so much time to improving the Python ecosystem and to develop a modern runtime for Python only to be berated for “not keeping up”, or whatever the armchair commentators tend to say, and seeing your contribution towards a better language ecosystem withheld from the people who might benefit from it.)
For what it is worth, systems like Plan 9 – the effective origin of UTF-8 encoding – showed the way for POSIX, but obviously those responsible for the systems we use tend to cherry-pick the ideas they like from such systems while badmouthing their originators.
@jgoerzen Neat post, thx for the link!Didn’t come up often for me (possibly because I’m very disciplined with file names), but from reading what you write it’s pretty obvious and sad that a big breaking change like Python 2 to 3 didn’t handle this right.
@holger @cadadr Hah, well that’s one way :-) Personally I have become a fan of #Rust, but then my #Haskell background perhaps makes me predisposed to safety-focused languages with elegant type systems. There’s always elisp if you get bored 🙂
haskell
rust
@jgoerzen @holger Elisp is the programming language supreme for me. Maybe not the language itself but the paradigm its in, it’s been really productive experience.I love Haskell but two things make it a no-go for me: it’s packaging / build stuff is slow and hard to get right, and it’s hard to do interactive programming / debugging. Esp. when you’re not familiar enough to not need much trial and error.
@feoh @ashwinvis Well I said that in the context of “what’s your go-to scripting language?”.So bash has a lot of similar (or worse) bugs handling filenames. People still use it. It does have workarounds, though you have to know how to use them carefully.Python 3 doesn’t even have workarounds in many places. Do you want something that will crash if xF7 is in a filename? That’s just not robust coding practices. You’ve got to handle the unexpected-but-valid. And it’s HARD in python 3.
@joeyh @ngate Also I am bothered almost DAILY that Github has turned a distributed system into a centralized one in so many people’s minds.
@liw Also the next step would be to verify that those that accept a Path work with strings that aren’t valid UTF8. Maybe they just convert to UTF8 internally.
@jgoerzen Wow! Looking forward to reading this.
@JonYoder If you want a coffee with your reading, after that first link, you can check out https://changelog.complete.org/archives/10053-the-incredible-disaster-of-python-3 and https://changelog.complete.org/archives/9938-the-python-unicode-mess . The upshot of it is, as far as I can tell, it is impossible to write cross-platform #Python code that handles filenames correctly on both POSIX and Windows. #Rust gets this right, and Python’s attempt to assume the whole world has used #Unicode since the beginning of time is a real pain.
Python
Rust
unicode
The Incredible Disaster of Python 3
@JonYoder It is astonishing to me that #Python still has a Global Interpreter Lock in 2022. https://wiki.python.org/moin/GlobalInterpreterLock Multithreading in Python is mostly a fiction. There are kludges like https://docs.python.org/3/library/multiprocessing.html which use fork, pipes, pickling, and message passing to simulate threads. But there are so many dragons down that path — performance and platform-specific ones (different things can be pickled on Windows vs. Linux) that it is a poor substitute. 3/
Python
GlobalInterpreterLock – Python Wiki
@JonYoder Sure, people use #Python for things like #AI work. In this case, Python is merely a shell; the real multithreaded code is in a different language (often C). The way to get performant multithreading out of Python is to not use Python at all. 4/
AI
Python
@JonYoder When I started using #Python more than 20 years ago now, it was an attractive alternative to Perl: like Perl, you don’t have to worry about memory management as with C, but Python code was more maintainable. By now, though, even writing a Unix-style cat command in Python is extraordinarily complicated https://lucumr.pocoo.org/2014/5/12/everything-about-unicode/ . All the “foo-like objects” are an interesting abstraction until they break horribly, and the lack of strong types makes it hard to scale code size. 5/
Python
Everything you did not want to know about Unicode in Python 3
@JonYoder These days, we have credible alternatives to #Python: #Rust, #Go, and #Haskell (among many others). All three of these are performant, avoid all the manual legwork of #C or the boilerplate of #Java, and provide easy ways to do simple things. 6/
C
Go
Haskell
Java
Python
Rust
@JonYoder The one place I still see #Python being used is situations where the #REPL is valuable. (Note, #Haskell also has this). #Jupyter is an example of this too. People use #Python for rapid testing of things and interactive prototyping. For a time, when I had date arithmetic problems, I’d open up the Python CLI and write stuff there. Nowadays it’s simpler to just write a Rust program to do it for me, really. 7/
Haskell
Jupyter
Python
repl
@JonYoder So that leaves me thinking: We’re thinking about #Python wrong these days. Its greatest utility is as a shell, not a language to write large programs in. As a shell, it is decent, especially for scientific work. Like other shells, most of the serious work is farmed out to code not written in Python, but there is utility in having it as a shell anyhow. And like a shell, once your requirements get to a certain point, you reach for something more serious. end/
Python
@jgoerzen @JonYoder I refuse to learn a new language unless it has some sort of multithreading built into it from the start. Even C / C++ multithreading is a nightmare hack. Julia and Go are my favorites these days
@Wraptile @JonYoder The fundamental point I was trying to make is not that Python is bad for all tasks. Just that it makes simple things (dealing with filenames and stdin) extremely difficult, has semantics that lead to vastly counterintuitive results with comparisons, and with more modern type inference, the weakly-typed nature of it isn’t holding up well. This is why I personally would not reach for Python, given a choice.
@Wraptile @JonYoder I have long since passed having an emotional attachment to languages. I have had books published (by O’Reilly, APress, and others) on C, Python, and Haskell. I’ve contributed significant libraries to OCaml and Haskell, including a popular database layer for Haskell. And here I am writing code in Rust. Use what makes you happy, I don’t care. But don’t try to tell me Python’s filename handling is correct, because it objectively isn’t.
@jgoerzen @JonYoder I’m not saying it’s correct I’m saying it’s statistically insignificant. Having more people being able to work and solve problems in an elegant, sustainable fashion is more important than handling 1 edge case that will be encountered by one program and that can be patched with a line of code or a community package. If you want to play this game of edge cases then you can put any language down even Rust – it’s an absurdist, pointless, argument.
@Wraptile @JonYoder Except none of your premises are correct. The problem is pervasive in the Python ecosystem. Community packages get it wrong, the standard library gets it wrong, and it has significant real-world consequences (eg, gpodder crashing), and it is not a matter of a one-line fix. Using the wrong data type for a filename is a pretty fundamental problem. And it’s not the only problem; the GIL is another fundamental problem; multithreading in Python is a fiction.
@Wraptile @JonYoder That’s not to say there are zero cases where Python is a nice fit; I pointed to, eg, rapid prototyping with Jupyter. But what I’m saying is if a modern language is incapable of multithreading and has a community-wide problem around such basics as filenames and comparisons, and better alternatives exist, why reach for the one where it is freaking difficult to open a file properly or parallelize algorithms in the general case?