DjVu and the scourge of the PDF

A little while back, I wrote a blog post called DjVu: Almost Awesome, where I pointed out the strengths of the three DjVu family of formats, but lamented the fact that there was no Free Software to create DjVu files in the most interesting format, DjVu Document.

Well, now there is: pdf2djvu is out and works, and it’s been ITP’d to Debian, too.

As a very quick recap, DjVu is a family of raster image codecs that often creates files much smaller than PDFs, PNGs, TIFFs, etc. It has a ton of advanced features for things like partial downloads from websites. It’s pretty amazing that a raster format can create smaller files than PDFs, even at 300 or 600dpi resolutions in the output. Of course, for some ultra-high-end press work, PDF would still be needed, but DjVu is quite compelling for quite a few uses. Since it is a raster format, it is simpler to decode and is not subject to local system variations, such as installed fonts, like PDF is.

Which brings me to the scourge of PDF. Recently we got a trouble ticket at work from someone saying there was a bug with our Linux environment because Linux users didn’t see the correct results when they opened his PDF file. A quick inspection with some of the xpdf utilities (pdffonts, to be specific) revealed that the correct fonts were not embedded in the file. The user didn’t believe me, and still wanted to blame Linux, saying that it worked fine on his PC with Acrobat. So I tried opening the file on a Windows 2003 terminal server, and it looked worse there than it did with any Free Linux viewer — really quite terribly corrupted. He still wasn’t entirely convinced, until he happened to try printing the file in question, and even Acrobat couldn’t print it right.

PDF was supposed to be a “read anywhere” format that produces exact results. But it hasn’t really lived up to that. Font embedding is one reason; the spec lists a handful of fonts that are allowed to not be embedded, but it is routine for some reason to violate that and fail to embed quite a few more. Then you have to deal with font substitution on the receiving end, which is inexact at best. Then you have all sorts of complex differences between versions, and it becomes quite the mess. (And don’t even get me started on broken PDF editors, such as the ones Adobe sells…) Somehow, quite a few people seem to have this idea built up in their heads that PDF is both an exact format, and an editable format, when really it is neither. (Last week, I was asked to convert a PDF file to a Word document. Argh.)

DjVu keeps looking more and more pleasant to my eyes.

12 thoughts on “DjVu and the scourge of the PDF

  1. The DjVu compression is based on the same ideas as JBIG2 – which PDFs can include. Indeed, Google Books PDFs use exactly this system[1].

    Now, PDF certainly has issues – the file format is pretty bad from a parsing point of view. But it can compress as well as DjVu.

    (disclaimers: I work with the guy who did DjVu and the guy who edited the JBIG2 spec. Also I implemented jbig2enc and did the work on Google Books PDFs)

    [1] http://www.imperialviolet.org/binary/google-books-pdf.pdf

    Reply

    John Goerzen Reply:

    Hi Adam,

    This is interesting. From some quick reading, it appears that JBIG2 is roughly the same as DjVuBitonal. That is, for 1-bit scans, they’re competing algorithms. As far as I can tell — and I’m no expert here — PDF has no equivalent of DjVuDocument, the really slick algorithm that encodes foreground and background separately.

    Reply

    Adam Langley Reply:

    DjVuDocument (as I understand it, and that could be off) is a mixed raster format: it finds areas which wouldn’t encode well as bilevel and uses a different format for them. (I also believe that the mask for these areas is encoded as a bilevel image, but that’s somewhat beside the point).

    The same can be achieved in PDF, indeed we do the same in Google Books, by finding such areas and saving them as JPEG2000 images. Thus the page is a list of drawing instructions; one to draw the JBIG2 background and then a list of JPEG2000 images to place on top.

    Again, I find myself defending PDF, and I don’t even like PDF ;) But, sadly, I don’t believe there’s a compelling technical reason for DjVu which out weighs PDF’s numerical superiority.

    Reply

    John Goerzen Reply:

    Is there any freely-available “pdf2pdf” encoder that will do that sort of encoding, though? I don’t know of one off-hand.

  2. Hopefully pdf2djvu will land in debian soon. First package by Steve had some glitches, and I’ve sent him back my comment before being ready to upload.
    I’m actually waiting for a fixed one to proceed.

    Reply

  3. Does it allow you to convert from txt, doc, rtf, odf to djvu? If not what is the point.

    Saving a few MBs if that much on a file is not worth the trouble.

    Now, if we could create a whole workflow based on djvu… Now that would something to cheer about.

    Convert from OpenOffice.org, Firefox or any other app directly to djvu conversing bookmarks, etc…

    There’s no appeal, at least to me, converting from one “final format” to another. If the pdfs are only text, convert them to txt files.

    As I said, to make this format useful we have to recreate the whole workflow: save, edit, view, convert from and to anywhere. That’s the only way that can be useful for the general public, besides some russians…

    If not, pdfs are good enough…

    Reply

  4. DjVu’s main strength is also its main limitation: it is basically a raster image. Because of that, it is inferior to PDF. With PDF, you can embed text and blow it up to arbitrarily large sizes. You can also extract the textual data from the PDF fairly trivially, which is essential for accessibility. DjVu doesn’t have those capabilities. It also doesn’t have trivial conversion to and from PostScript, the Unix printing standard.

    Also, every typesetting program that I know of (groff, TeX, and Apache FOP) outputs PostScript or PDF. None produce DjVu. DjVu is not likely to gain in that market, where precision is essential.

    The fact that people produce broken PDFs is not unusual, nor unexpected. Even large corporations produce broken HTML (BMW comes to mind). Yet you would not suggest replacing HTML.

    DjVu is fine if what you want is basically a small raster image. But if you care about the textual content of the document, then it’s inferior to PDF.

    Reply

    John Goerzen Reply:

    There are a number of misconceptions here.

    PDF is neither a raster nor a vector format entirely; it can be either. Just because DjVu is only raster doesn’t mean it’s inferior. As I pointed out, it means that it has the ability to actually always faithfully reproduce a document — which PDF doesn’t!

    The easy PostScript to DjVu is ps2pdf14 then pdf2djvu. A 1-line shell script.

    And I guess you’re not aware of the djvups command? It can even let you specify PS level 1, 2, or 3; color or gray mode; color matching; booklet mode; etc.

    Also, DjVu can embed text to exactly the same precision and extent that PDF can. pdf2djvu will convert that information for you, in fact.

    I didn’t suggest it would be used for typesetting output to printing presses. But it makes an excellent Internet distribution and archival format.

    Reply

    Samuel Bronson Reply:

    PDF has the ability to faithfully reproduce documents — the main thing is to embed ALL fonts, even those standard Adobe ones, since Adobe themselves have stopped shipping them with their PDF reader. And since it is a vector format, it will do this to whatever resolution your printer can handle (or fake), unlike DjVu which will only go up to whatever resolution the document is rasterized to.

    That said, I think DjVu is excellent for on-screen use, especially for people with 56k modems. I think it would be great if CiteSeer would bring back this option.

    Reply

    John Goerzen Reply:

    I guess what I would say is that PDF *can* faithfully reproduce documents, but doesn’t guarantee that it always will, dpeneding on how it was created.

    Lee Reply:

    If it’s inaccessible, where the other is not, then it’s inferior, by definition. More than definitions, though: accessibility is a HUGE issue, not something to be ignored in your response. No inaccessible format should be used for documents which can be accessible. It’s discrimination, pure and simple.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *