Daily Archives: May 1, 2008

DjVu and the scourge of the PDF

A little while back, I wrote a blog post called DjVu: Almost Awesome, where I pointed out the strengths of the three DjVu family of formats, but lamented the fact that there was no Free Software to create DjVu files in the most interesting format, DjVu Document.

Well, now there is: pdf2djvu is out and works, and it’s been ITP’d to Debian, too.

As a very quick recap, DjVu is a family of raster image codecs that often creates files much smaller than PDFs, PNGs, TIFFs, etc. It has a ton of advanced features for things like partial downloads from websites. It’s pretty amazing that a raster format can create smaller files than PDFs, even at 300 or 600dpi resolutions in the output. Of course, for some ultra-high-end press work, PDF would still be needed, but DjVu is quite compelling for quite a few uses. Since it is a raster format, it is simpler to decode and is not subject to local system variations, such as installed fonts, like PDF is.

Which brings me to the scourge of PDF. Recently we got a trouble ticket at work from someone saying there was a bug with our Linux environment because Linux users didn’t see the correct results when they opened his PDF file. A quick inspection with some of the xpdf utilities (pdffonts, to be specific) revealed that the correct fonts were not embedded in the file. The user didn’t believe me, and still wanted to blame Linux, saying that it worked fine on his PC with Acrobat. So I tried opening the file on a Windows 2003 terminal server, and it looked worse there than it did with any Free Linux viewer — really quite terribly corrupted. He still wasn’t entirely convinced, until he happened to try printing the file in question, and even Acrobat couldn’t print it right.

PDF was supposed to be a “read anywhere” format that produces exact results. But it hasn’t really lived up to that. Font embedding is one reason; the spec lists a handful of fonts that are allowed to not be embedded, but it is routine for some reason to violate that and fail to embed quite a few more. Then you have to deal with font substitution on the receiving end, which is inexact at best. Then you have all sorts of complex differences between versions, and it becomes quite the mess. (And don’t even get me started on broken PDF editors, such as the ones Adobe sells…) Somehow, quite a few people seem to have this idea built up in their heads that PDF is both an exact format, and an editable format, when really it is neither. (Last week, I was asked to convert a PDF file to a Word document. Argh.)

DjVu keeps looking more and more pleasant to my eyes.