The Incredible Disaster of Python 3

November 20, 2019SoftwarepythonJohn Goerzen

Update 2019-11-22: A successor article to this one dives into some of the underlying complaints.

I have long noted issues with Python 3’s bytes/str separation, which is designed to have a type “bytes” that is a simple list of 8-bit characters, and “str” which is a Unicode string. After apps started using Python 3, I started noticing issues: they couldn’t open filenames that were in ISO-8859-1, gpodder couldn’t download podcasts with 8-bit characters in their title, etc. I have files on my system dating back to well before widespread Unicode support in Linux.

Due to both upstream and Debian deprecation of Python 2, I have been working to port pygopherd to Python 3. I was not looking forward to this task. It turns out that the string/byte types in Python 3 are even more of a disaster than I had at first realized.

Background: POSIX filenames

On POSIX platforms such as Unix, a filename consists of one or more 8-bit bytes, which may be any 8-bit value other than 0x00 or 0x2F (‘/’). So a file named “test\xf7.txt” is perfectly acceptable on a Linux system, and in ISO-8859-1, that filename would contain the division sign ÷. Any language that can’t process valid filenames has serious bugs – and Python is littered with these bugs.

Inconsistencies in Types

Before we get to those bugs, let’s look at this:

>>> "/foo"[0]
'/'
>>> "/foo"[0] == '/'
True
>>> b"/foo"[0]
47
>>> b"/foo"[0] == '/'     # this will fail anyhow because bytes never equals str
False
>>> b"/foo"[0] == b'/'
False
>>> b"/foo"[0] == b'/'[0]
True

Look at those last two items. With the bytes type, you can’t compare a single element of a list to a single character, even though you still can with a str. I have no explanation for this mysterious behavior, though thankfully the extensive tests I wrote in 2003 for pygopherd did cover it.

Bugs in the standard library

A whole class of bugs arise because parts of the standard library will accept str or bytes for filenames, while other parts accept only str. Here are the particularly egregious examples I ran into.

Python 3’s zipfile module is full of absolutely terrible code. As I reported in Python bug 38861, even a simple zipfile.extractall() fails to faithfully reproduce filenames contained in a ZIP file. Not only that, but there is egregious code like this in zipfile.py:

            if flags & 0x800:
                # UTF-8 file names extension
                filename = filename.decode('utf-8')
            else:
                # Historical ZIP filename encoding
                filename = filename.decode('cp437')

I can assure you that zip on Unix was not mystically converting filenames from iso-8859-* to cp437 (which was from DOS, and almost unheard-of on Unix). Or how about this gem:

    def _encodeFilenameFlags(self):
        try:
            return self.filename.encode('ascii'), self.flag_bits
        except UnicodeEncodeError:
            return self.filename.encode('utf-8'), self.flag_bits | 0x800

This combines to a situation where perfectly valid filenames cannot be processed by the zipfile module, valid filenames are mangled on extraction, and unwanted and incorrect character set conversions are performed. zipfile has no mechanism to access ZIP filenames as bytes.

How about the dbm module? It simply has no way to specify a filename as bytes, and absolutely can’t open a file named “text\x7f”. There is simply no way to make that happen. I reported this in Python bug 38864.

Update 2019-11-20: As is pointed out in the comments, there is a way to encode this byte in a Unicode string in Python, so “absolutely can’t open” was incorrect. However, I strongly suspect that little code uses that approach and it remains a problem.

I should note that a simple open(b"foo\x7f.txt", "w") works. The lowest-level calls are smart enough to handle this, but the ecosystem built atop them is uneven at best. It certainly doesn’t help that things like b"foo" + "/" are runtime crashers.

Larger Consequences of These Issues

I am absolutely convinced that these are not the only two modules distributed with Python itself that are incapable of opening or processing valid files on a Unix system. I fully expect that these issues are littered throughout the library. Nobody appears to be testing for them. Nobody appears to care about them.

It is part of a worrying trend I have been seeing lately of people cutting corners and failing to handle valid things that have been part of the system for years. We are, by example and implementation, teaching programmers that these shortcuts are fine, that it’s fine to use something that is required to be utf-8 to refer to filenames on Linux, etc. A generation of programmers will grow up writing code that is incapable of processing files with perfectly valid names. I am thankful that grep, etc. aren’t written in Python, because if they were, they’d crash all the time.

Here are some other examples:

When running “git status” on my IBM3151 terminal connected to Linux, I found it would clear the screen each time. Huh. Apparently git assumes that if you’re using it from a terminal, the terminal supports color, and it doesn’t bother using terminfo; it just sends ANSI sequences assuming that everything uses them. The IBM3151 doesn’t by default. (GNU tools like ls get this right) This is but one egregious example of a whole suite of tools that fail to use the ncurses/terminfo libraries that we’ve had for years to properly abstract these things.
A whole suite of tools, including ssh, tmux, and so forth, blindly disable handling of XON/XOFF on the terminal, neglecting the fact that this is actually quite important for some serial lines. Thankfully I can at least wrap things in GNU Screen to get proper XON/XOFF handling.
The Linux Keyspan USB serial driver doesn’t even implement XON/XOFF handling at all.

Now, you might make an argument “Well, ISO-8859-* is deprecated. We’ve all moved on to Unicode!” And you would be, of course, wrong. Unix had roughly 30 years of history before xterm supported UTF-8. It would be quite a few more years until UTF-8 reached the status of default for many systems; it wasn’t until Debian etch in 2007 that Debian used utf-8 by default. Files with contents or names in other encoding schemes exist and people find value in old files. “Just rename them all!” you might say. In some situations, that might work, but consider — how many symlinks would it break? How many scripts that refer to things by filenames would it break? The answer is most certainly nonzero. There is no harm in having files laying about the system in other encoding schemes — except to buggy software that can’t cope. And this post doesn’t even concern the content of files, which is a whole additional problem, though thankfully the situation there is generally at least somewhat better.

There are also still plenty of systems that can’t handle multibyte characters (and in various embedded or mainframe contexts, can’t even handle 8-bit characters). Not all terminals support ANSI. It requires only correct thinking (“What is a valid POSIX filename? OK, our datatypes better support that then”) to do the right thing.

Update 1, 2019-11-21: Here is an article dating back to 2014 about the Unicode issues in Python 3, which goes into quite a bit of detail about it. It lays out a compelling case for the issues with its attempt to implement a replacement for cat in python 2 and 3. The Practical Python porting for systems programmers is also relevant and, like me, highlights many of these same issues. Finally, this is not the first time I raised issues; I wrote The Python Unicode Mess more than a year ago. Unfortunately, as I am now working to port a larger codebase, the issues I raised before are more acute, and I have discovered more. At this point, I am extremely unlikely to use Python for any new project due to these issues.

38 thoughts on “The Incredible Disaster of Python 3”

uau says:

November 20, 2019 at 12:08 pm

The description of Python behavior is technically inaccurate. It’s not quite as naive about arbitrary-byte filenames as described. There’s an explicitly designed way to embed arbitrary byte escapes for filenames in an unicode string type. An example:

>>> f=open(b”foo\x7f.txt”, “w”)
>>> f=open(b”foo\xf7.txt”, “w”)
>>> import os
>>> os.listdir()
[‘foo\udcf7.txt’, ‘foo\x7f.txt’]

Note that the return type of os.listdir() is not bytes but an unicode string. And you can use that unicode string to open the file.

Reply
1. John Goerzen says:
  
  November 20, 2019 at 1:54 pm
  
  But this only goes so far. Sure, if you’re taking the output from os.listdir() it might work. But we get filenames from all sorts of other sources as well: other files, network requests (as is the case here), etc., which are not going to be encoded in that way. In the case of zipfile, it does not transform filenames using this method and thus the problem persists. I suppose with dbm one could perhaps work that way… But still, I maintain that this is a hack on top of bad design rather than a proper approach.
  
  Contrast with Rust’s Path type, which is always explicitly clear.
  
  Reply
uau says:

November 20, 2019 at 12:38 pm

Addition:
The way to use unicode strings to open an arbitrary filename given as a bytes object with the dbm module:
>>> file = b”test\xf7″
>>> dbm.open(file.decode(‘utf-8’, ‘surrogateescape’)) #works

Reply
Adam says:

November 20, 2019 at 3:39 pm

Also puzzling:

>>> “/foo”[0:1]
‘/’
>>> “/foo”[0]
‘/’
>>> b”/foo”[0:1]
b’/’
>>> b”/foo”[0]
47

Reply
1. Sour Caustic says:
  
  November 22, 2019 at 11:23 am
  
  Let’s demystify it for you then.
  
  First forget the unicode strings, focus only on the byte strings. They don’t obey the same laws.
  
  >>> b’/foo’
  
  The above is a string of bit bytes, i.e. a string of number that may or may not have a corresponding ascii character. The fact that you can input the string using ascii literal characters is only a convenience. What’s important is that you should consider this as nothing other than a string of 8 bit bytes.
  
  >>> b”
  
  Another string of 8 bit bytes, which happens to be empty.
  
  >>> b’/foo'[0:1]
  
  A string of 8 bit bytes starting at the first byte of some other string of bytes and ending (exclusively) at the second byte.
  
  >>> b’/foo'[0:1]==b’/’
  
  The two strings of bytes contain the same bytes.
  
  >>> b’/'[0]
  
  The first byte in a string of 8 bit bytes.
  
  >>> b’/’
  
  A string of 8 bit bytes that you created with only one ascii literal, but a string of bytes is not a byte and so…
  
  >>> b’/’==b’/'[0]
  False
  
  A string of bytes is not a byte.
  
  Reply
Norman (((factchequeUK))) WilsonⓂ️ says: @ twitter.com

November 20, 2019 at 10:38 am

You make an eternal point. We keep encouraging all sorts of bad habits by example, from ignoring potential buffer overflows to failing to handle errors. I don’t know how to cure this given that it’s decades-old. (See Elements of Programming Style for mid-1970s exemplars.)

Reply
Leo says:

November 20, 2019 at 4:47 pm

b”/foo”[0] is a byte (int, for lack of more specific typing)
b”/” is a byte slice

That’s why b”/foo”[0:1] == b”/”, but b”/foo[0] != b”/”

It may be slightly surprising it you’re used to py2’s types, but if you take a step back from what “feels” right purely because of comfort, py3’s behavior turns out to be more consistent (at least in this specific case).

As for the stdlib, yeah, there’s a lot of sub-optimally maintained code out there, but IMHO it’s been a side effect of the stdlib’s size since well before py3.

Reply
1. Adam says:
  
  November 21, 2019 at 3:18 pm
  
  I would expect indexing a single element of a list to yield the same element as a slice that contains that single element, yes.
  
  This seems to happen for strings, but not for bytes.
  
  That is indeed surprising to me.
  
  Note: I have never programmed in Python 2. I came to Python 3 from primarily Perl and before that C.
  
  How is b”/”[0] being 47 and b”/”[0:1] being something different “more consistent”? To me it “feels” like both should be the same thing, a single element of the original list.
  
  What if I had a list of objects? Would a slice of one element also be different than indexing that same element?
  
  Reply
  1. Anselm Lingnau says:
    
    November 21, 2019 at 8:01 pm
    
    It might help you to think of the `bytes` type as similar to the `tuple` type. If `t = (1, 2, 3)`, then `t[0]` is `1`, an `int`, but `t[0:1]` is `(1,)`, a length-1 tuple. The same would apply to tuples or lists of any objects.
    
    Strings are weird in that respect because the individual elements of a `str` are length-1 `str` objects – Python does not have a `char` type.
    
    If you’ve been programming Perl you should be aware that in Perl, `$foo[0]` and `@foo[0]` are completely different beasts even though they look deceptively similar. Python’s foibles take a lot less getting used to than Perl’s.
    
    Reply
Maik Zumstrull says: @ twitter.com

November 20, 2019 at 1:42 pm

This is a fascinating combination of completely correct and completely divorced from reality, and I’m gonna need a minute to decide how I feel about it.

changelog.complete.org/archives/10053…

Reply
NOKUBI Takatsugu野首貴嗣 says: @ twitter.com

November 20, 2019 at 8:00 pm

The Incredible Disaster of Python 3 | The Changelog : changelog.complete.org/archives/10053…
ファイル名の扱いは今でも面倒だなあ

Reply
nobody says:

November 21, 2019 at 2:14 am

Incredible how people wait to the very last minute (of a long lasting end-of-life announcement) before they bother looking at the new version (in this special case of Python 3 has been available only so short).
All these questions could have been raised long before and perhaps have lead to improvement – or to explanations (thanks, uau) how things are meant to be used.
The incredible disaster of procrastination…

Reply
1. John Goerzen says:
  
  November 21, 2019 at 8:57 am
  
  I first wrote about this more than a year ago: https://changelog.complete.org/archives/9938-the-python-unicode-mess This is the REASON I was waiting with porting pygopherd — I was hoping some sanity might arrive to this situation.
  
  That post highlights the filename issues, but also some other issues — that many answers on Stackoverflow are wrong, the difficulties in handling environment variables. Reviewing some of the links from that post, I see os.fsencode() and os.fsdecode() which look to be perhaps close to the right answers. Unfortunately, these seem to be almost universally ignored; they’re not used in zipfile, not used in most answers to these questions I see, etc.
  
  So perhaps core Python gives us a workaround for a bad situation. But if this workaround is used rarely by commonly-used libraries — even those included with Python itself — how useful is it?
  
  The problem with the current design is that it’s **broken by default**. You have to KNOW to do things like surrogateescape or os.fsencode() and almost no code I’ve seen does. Even things like zipfile that are aware of the problem have an incorrect solution.
  
  Reply
Anselm Lingnau says:

November 21, 2019 at 1:30 pm

I’m in the process of helping update a large program from Python 2.7 to Python 3. This is a very unsavoury exercise simply because the people who wrote the code originally had been playing fast and loose with strings vs. binary data, and it is now upon us to clean up this mess.

On the whole I’m way happier with the way Python 3 does things because there is a clear distinction between strings (as in, sequences of Unicode code points) and sequences of arbitrary bytes, and as a programmer it’s just as well to keep the two separate. I agree that (a) legacy file names are an issue, and (b) bugs in the standard library suck, but all things considered I believe we’re better off with Python 3’s approach.

Reply
1. Paul Boddie says:
  
  November 23, 2019 at 1:58 pm
  
  “On the whole I’m way happier with the way Python 3 does things because there is a clear distinction between strings (as in, sequences of Unicode code points) and sequences of arbitrary bytes”
  
  And such a distinction existed in Python 1.6 and the entire 2.x series. The most significant difference those versions have to Python 3 is the automatic coercion between these two sequence types that has a tendency to go wrong when plain (byte) strings contain character values outside the ASCII range.
  
  But contrary to popular misconception – not stated here but annoyingly recurrent on the Web – it was always possible to support Unicode in Python 2 (and 1.6) programs. One might wonder whether Python 2 could have evolved to be more acceptable and less troublesome, but I guess people would not have had so much “fun” rearranging the furniture.
  
  In turn, evolving Python 2 would have been far less disruptive, and we would not now be seeing opportunistic finger wagging from random freeloaders about “procrastination”, nor be making those with investments in stable and mature software do make-work to keep what they have. Which is what the Debian Python 2 purge ultimately is.
  
  Reply
Petr Baudis says:

November 22, 2019 at 11:09 am

This is a nostalgic article, as underlined in the closing section about XON/XOFF and mainframe-compatible escape sequences.
The world is moving on, and while historic systems are beautiful (I still have a 2.11 BSD emulator running – or rather runnable – somewhere), at some point you need to weight the breakage for legacy users against the cost of maintenance of the compatibility.

Indeed, POSIX is still mandating that filenames are arbitrary byte sequences. But it is just becoming impractical, and in the end it’s up to whoever has the motivation to have it working to keep it working, and if there’s not enough people with this motivation it’s just going to inevitably rot.

It’s likely that 10 years from now, anything non-Unicode will be completely broken on modern (desktop, at least) systems and perhaps Linux even gets an opt-in mount option for enforcing filenames to be utf-8-compatible (which may change to opt-out another 10 years on, just as POSIX is going to evolve too in this regard).

Yes, it’s a pity and I likely still have some ISO-8859-2 files from 1999 on my filesystem. But I think it’s unreasonable for anyone to waste time with that support. And I wouldn’t advise anyone wasting extra 20 hours of your developer life on building things around ncurses instead of a more direct approach – build a cool feature in that time instead!

Reply
HN Front Page says: @ twitter.com

November 22, 2019 at 10:13 am

The Disaster of Python 3
L: changelog.complete.org/archives/10053…
C: news.ycombinator.com/item?id=216064…

Reply
Hacker News says: @ twitter.com

November 22, 2019 at 10:15 am

The Disaster of Python 3 : changelog.complete.org/archives/10053… #Python Comments: news.ycombinator.com/item?id=216064…

Reply
Hacker News says: @ twitter.com

November 22, 2019 at 10:20 am

The Disaster of Python 3: changelog.complete.org/archives/10053… Comments: news.ycombinator.com/item?id=216064…

Reply
Angsuman Chakraborty says: @ twitter.com

November 22, 2019 at 11:31 am

The Disaster of Python 3 changelog.complete.org/archives/10053…

Reply
Oliver Hunt says: @ twitter.com

November 22, 2019 at 9:24 pm

Ongoing saga of python3 needlessly breaking behavior that worked in p2, and ensuring that “deprecated” python2 code will continue to be necessary for an extremely long time: changelog.complete.org/archives/10053…

Reply
Hacker News 50 says: @ twitter.com

November 22, 2019 at 10:42 pm

53 – The Disaster of Python 3 changelog.complete.org/archives/10053…

Reply
xatier@命短し飲め飲め乙女リクリ says: @ twitter.com

November 22, 2019 at 11:27 pm

The Incredible Disaster of Python 3 | The Changelog
changelog.complete.org/archives/10053…

Reply
Lup Yuen Lee 李立源 says: @ twitter.com

November 23, 2019 at 5:11 am

Flashback to my struggles with Unicode strings in Python 3…

changelog.complete.org/archives/10053…

Reply
Well, actually ... says: @ twitter.com

November 23, 2019 at 6:59 am

Maybe I’m gonna have a look at C# instead🤔changelog.complete.org/archives/10053…

Reply
Jon Nalley says:

November 24, 2019 at 7:10 pm

Is the assumption that because POSIX supports these types of filenames, zip does too? I don’t think that’s the case.

I think the Python implementation is adhering to the zip specification.

From the specification v6.3.6 (Revised: April 26, 2019):

If general purpose bit 11 is unset, the file name and comment SHOULD conform
to the original ZIP character encoding. If general purpose bit 11 is set, the
filename and comment MUST support The Unicode Standard, Version 4.1.0 or
greater using the character encoding form defined by the UTF-8 storage
specification.

https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT

Reply
1. John Goerzen says:
  
  November 24, 2019 at 9:02 pm
  
  I can tell you that the zip(1) on Unix systems has never done re-encoding to cp437; on a system that uses latin-1 (or any other latin-* for that matter) the filenames in the ZIP will be encoded in latin-1. Furthermore, this doesn’t explain the corruption that extractall() causes.
  
  Reply
John Goerzen says: @ changelog.complete.org

December 9, 2020 at 10:55 pm

“In the beginning the Universe was created. This has made a lot of people very angry and been widely regarded as a bad move.”
– Douglas Adams

This expands on my recent post The Incredible Disaster of Python 3. I seem to have annoyed the Internet…
Back in the mists of time, Unix was invented. Today the descendants of Unix, whether literal or in spirit, power the majority of the world’s cell phones, most of the most popular sites on the Internet, etc. And among this very popular architecture, there lies something that has made people very angry at times: on a Unix filesystem, 254 bytes are valid in filenames. The two that are not are 0x00 and the slash character. Otherwise, they are valid in virtually any combination (the special entries “.” and “..” being the exception).
This property has led to a whole host of bugs, particularly in shell scripts. A filename with a leading dash might look like a parameter to a tool. Filenames can contain newline characters, space characters, control characters, and so forth; running ls in a directory with maliciously-named files could certainly scramble one’s terminal. These bugs continue to persist, though modern shells offer techniques that — while optional — can be used to avoid most of these classes of bugs.
It should be noted here that not every valid stream of bytes constitutes a stream of bytes that can be decoded as UTF-8. This is a departure from earlier encoding schemes such as iso-8859-1 and cp437; you might get gibberish, but “garbage in, garbage out” was a thing and if your channel was 8-bit clean, your gibberish would survive unmodified.
Unicode brings many advantages, and has rightly become the predominant standard for text encoding. But the previous paragraph highlights one of the challenges, and this challenge, along with some others, are at the heart of the problem with Python 3. That is because, at a fundamental level, Python 3’s notion of a filename is based on a fiction. Not only that, but it tries to introduce strongly-typed strings into what is fundamentally a weakly-typed, dynamic language.
A quick diversion: The Rust Perspective
The Unicode problem is a problematic one, and it may be impossible to deal with it with complete elegance. Many approaches exist; here I will describe Rust’s string-related types, of which there are three for our purposes:
The String (and related &str) is used for textual data and contains bytes that are guaranteed to be valid UTF-8 at all times
The Vec<u8> (and related [u8]) is a representation of pure binary bytes, of which all 256 possible characters are valid in any combination, whether or not it forms valid UTF-8
And the Path, which represents a path name on the system.
The Path uses the underlying operating system’s appropriate data type (here I acknowledge that Windows is very different from POSIX in this regard, though I don’t go into that here). Compile-time errors are generated when these types are mixed without proper safe conversion.
The Python Fiction
Python, in contrast, has only two types; roughly analogous to the String and the Vec<u8> in Rust. Critically, most of the Python standard library treats a filename as a String – that is, a sequence of valid Unicode code points, which is a subset of the valid POSIX filenames.
Do you see what we just did here? We’ve set up another shell-like situation in which filenames that are valid on the system create unexpected behaviors in a language. Only this time, it’s not n, it’s things like xF7.
From a POSIX standpoint, the correct action would have been to use the bytes type for filenames; this would mandate proper encode/decode calls by the user, but it would have been quite clear. It should be noted that some of the most core calls in Python, such as open(), do accept both bytes and strings, but this behavior is by no means consistent in the standard library, and some parts of the library that process filenames (for instance, listdir in its most common usage) return strings.
The Plot Thickens
At some point, it was clearly realized that this behavior was leading to a lot of trouble on POSIX systems. Having a listdir() function be unable (in its common usage; see below) to handle certain filenames was clearly not going to work. So Python introduced its surrogate escape. When using surrogate escapes, when attempting to decode a binary byte that is not valid in UTF-8, it is replaced with a multibyte UTF-8 sequence from Unicode code space that is otherwise rarely used. Then, when converted back to a binary sequence, this Unicode code point is converted to the same original byte. However, this is not a systemwide default and in many cases must be specifically requested.
And now you see this is both an ugly kludge and a violation of the promise of what a string is supposed to be in Python 3, since this doesn’t represent a valid Unicode character at all, but rather a token for the notion that “there was a byte here that we couldn’t convert to Unicode.” Now you have a string that the system thinks is Unicode, that looks like Unicode, that you can process as Unicode — substituting, searching, appending, etc — but which is actually at least partially representing things that should rightly be unrepresentable in Unicode.
And, of course, surrogate escapes are not universally used by even the Python standard library either. So we are back to the problem we had in Python 2: what the heck is a string, anyway? It might be all valid Unicode, it might have surrogate escapes in it, it might have been decoded from the wrong locale (because life isn’t perfect), and so forth.
Unicode Realities
The article pragmatic Unicode highlights these facts:
Computers are built on bytes
The world needs more than 256 symbols
You cannot infer the encoding of bytes — you must be told, or have to guess
Sometimes you are told wrong
I have no reason to quibble with this. How, then, does that stack up with this code from Python? (From zipfile.py, distributed as part of Python)

if flags & 0x800:
# UTF-8 file names extension
filename = filename.decode(‘utf-8’)
else:
# Historical ZIP filename encoding
filename = filename.decode(‘cp437’)

There is a reason that Python can’t extract a simple ZIP file properly. The snippet above violated the third rule by inferring a cp437 encoding when it shouldn’t. But it’s worse; the combination of factors leads extracall() to essentially convert a single byte from CP437 to a multibyte Unicode code point on extraction, rather than simply faithfully reproducing the bytestream that was the filename. Oh, and it doesn’t use surrogate escapes. Good luck with that one.
It gets even worse
Let’s dissect Python’s disastrous documentation on Unicode filenames.
First, we begin with the premise that there is no filename encoding in POSIX. Filenames are just blobs of bytes. There is no filename encoding!
What about $LANG and friends? They give hints about the environment, languages for interfaces, and terminal encoding. They can often be the best HINT as to how we should render characters and interpret filenames. But they do not subvert the fundamental truth, which is that POSIX filenames do not have to conform to UTF-8.
So, back to the Python documentation. Here are the problems with it:
It says that there will be a filesystem encoding if you set LANG or LC_CTYPE, falling back to UTF-8 if not specified. As we have already established, UTF-8 can’t handle POSIX filenames.
It gets worse: “The os.listdir() function returns filenames, which raises an issue: should it return the Unicode version of filenames, or should it return bytes containing the encoded versions? os.listdir() can do both”. So we are somewhat tacitly admitting here that str was a poor choice for filenames, but now we try to have it every which way. This is going to end badly.
And then there’s this gem: “Note that on most occasions, you should can just stick with using Unicode with these APIs. The bytes APIs should only be used on systems where undecodable file names can be present; that’s pretty much only Unix systems now.” Translation: Our default advice is to pretend the problem doesn’t exist, and will cause your programs to be broken or crash on POSIX.
Am I just living in the past?
This was the most common objection raised to my prior post. “Get over it, the world’s moved on.” Sorry, no. I laid out the case for proper handling of this in my previous post. But let’s assume that your filesystems are all new, with shiny UTF-8 characters. It’s STILL a problem. Why? Because it is likely that an errant or malicious non-UTF-8 sequence will cause a lot of programs to crash or malfunction.
We know how this story goes. All the shell scripts that do the wrong thing when “; rm” is in a filename, for instance. Now, Python is not a shell interpreter, but if you have a program that crashes on a valid filename, you have — at LEAST — a vector for denial of service. Depending on the circumstances, it could turn into more.
Conclusion
Some Python 3 code is going to crash or be unable to process certain valid POSIX filenames.
Some Python 3 code might use surrogate escapes to handle them.
Some Python 3 code — part of Python itself even — just assumes it’s all from cp437 (DOS) and converts it that way.
Some people recommend using latin-1 instead of surrogate escapes – even official Python documentation covers this.
The fact is: A Python string is the WRONG data type for a POSIX filename, and so numerous, incompatible kludges have been devised to work around this problem. There is no consensus on which kludge to use, or even whether or not to use one, even within Python itself, let alone the wider community. We are going to continue having these problems as long as Python continues to use a String as the fundamental type of a filename.
Doing the right thing in Python 3 is extremely hard, not obvious, and rarely taught. This is a recipe for a generation of buggy code. Easy things should be easy; hard things should be possible. Opening a file correctly should be easy. Sadly I fear we are in for many years of filename bugs in Python, because this would be hard to fix now.
Resources
Everything you did not want to know about Unicode in Python 3
Practical Python porting for systems programmers
Pragmatic Unicode
(For even more fun, consider command line parameters and environment variables! I’m annoyed enough with filenames to leave those alone for now.)

Reply
John Goerzen says: @ floss.social

July 27, 2021 at 9:56 am

@joeyh @ngate Also I am bothered almost DAILY that Github has turned a distributed system into a centralized one in so many people’s minds.

Reply
John Goerzen says: @ floss.social

June 1, 2022 at 7:47 am

@JonYoder You got me thinking in more detail why I reflexively avoid #Python now, despite the fact that I wrote two large programs (#OfflineIMAP and #pygopherd) in it, and published a book about it. 1/
Python
offlineimap
pygopherd

Reply
John Goerzen says: @ floss.social

June 1, 2022 at 7:49 am

@JonYoder@mastodon.technology Avoiding #Python – Besides the absurd inconsistencies in https://changelog.complete.org/archives/10053-the-incredible-disaster-of-python-3 and the extreme difficulty verging on the impossibility of properly handling filenames in POSIX (see https://changelog.complete.org/archives/10063-the-fundamental-problem-in-python-3 and https://changelog.complete.org/archives/9938-the-python-unicode-mess ), there is more that makes me shy away. 2/
Python
The Incredible Disaster of Python 3

Reply
John Goerzen says: @ floss.social

June 1, 2022 at 7:52 am

@JonYoder It is astonishing to me that #Python still has a Global Interpreter Lock in 2022. https://wiki.python.org/moin/GlobalInterpreterLock Multithreading in Python is mostly a fiction. There are kludges like https://docs.python.org/3/library/multiprocessing.html which use fork, pipes, pickling, and message passing to simulate threads. But there are so many dragons down that path — performance and platform-specific ones (different things can be pickled on Windows vs. Linux) that it is a poor substitute. 3/
Python
GlobalInterpreterLock – Python Wiki

Reply
John Goerzen says: @ floss.social

June 1, 2022 at 7:53 am

@JonYoder Sure, people use #Python for things like #AI work. In this case, Python is merely a shell; the real multithreaded code is in a different language (often C). The way to get performant multithreading out of Python is to not use Python at all. 4/
AI
Python

Reply
John Goerzen says: @ floss.social

June 1, 2022 at 7:55 am

@JonYoder When I started using #Python more than 20 years ago now, it was an attractive alternative to Perl: like Perl, you don’t have to worry about memory management as with C, but Python code was more maintainable. By now, though, even writing a Unix-style cat command in Python is extraordinarily complicated https://lucumr.pocoo.org/2014/5/12/everything-about-unicode/ . All the “foo-like objects” are an interesting abstraction until they break horribly, and the lack of strong types makes it hard to scale code size. 5/
Python
Everything you did not want to know about Unicode in Python 3

Reply
John Goerzen says: @ floss.social

June 1, 2022 at 7:56 am

@JonYoder These days, we have credible alternatives to #Python: #Rust, #Go, and #Haskell (among many others). All three of these are performant, avoid all the manual legwork of #C or the boilerplate of #Java, and provide easy ways to do simple things. 6/
C
Go
Haskell
Java
Python
Rust

Reply
John Goerzen says: @ floss.social

June 1, 2022 at 7:58 am

@JonYoder The one place I still see #Python being used is situations where the #REPL is valuable. (Note, #Haskell also has this). #Jupyter is an example of this too. People use #Python for rapid testing of things and interactive prototyping. For a time, when I had date arithmetic problems, I’d open up the Python CLI and write stuff there. Nowadays it’s simpler to just write a Rust program to do it for me, really. 7/
Haskell
Jupyter
Python
repl

Reply
John Goerzen says: @ floss.social

June 1, 2022 at 7:59 am

@JonYoder So that leaves me thinking: We’re thinking about #Python wrong these days. Its greatest utility is as a shell, not a language to write large programs in. As a shell, it is decent, especially for scientific work. Like other shells, most of the serious work is farmed out to code not written in Python, but there is utility in having it as a shell anyhow. And like a shell, once your requirements get to a certain point, you reach for something more serious. end/
Python

Reply
Hieronymus says: @ mastodon.sdf.org

June 1, 2022 at 9:10 am

@jgoerzen @JonYoder I refuse to learn a new language unless it has some sort of multithreading built into it from the start. Even C / C++ multithreading is a nightmare hack. Julia and Go are my favorites these days

Reply