EPUB OCF, mimetype, and kittens
RSS • Permalink • Created 07 Jun 2014 • Written by Alberto Pettarin
The EPUB Open Container Format (OCF) specification has quite a long list of requirements about the ZIP container encapsulating all the resources of an EPUB Publication, but they all boil down to some (arguably useful) agreed conventions to ease the work of processors dealing with EPUB files (e.g., reading systems).
Perhaps the first "strange" requirement novices crash into
is that the mimetype
file inside the EPUB must:
- contain only the ASCII string
application/epub+zip
, resulting in a file of exactly 20 bytes, - be stored uncompressed, and,
- be the first entry of the ZIP container.
In plain English, if you just zip assets up:
and then try to validate the resulting file with EpubCheck, you will get an error:
OK, so how should I compress the EPUB to make the validator happy?
Well, just follow the specification:
- first, just store (= no compression) the mimetype file inside a new ZIP file (for convenience, with
.epub
extension); - then, add the remaining files (which the specs allow you to store in compressed form and in any order).
Using the console:
Great, mission accomplished, let's have a break watching a kitten video!
Wait a minute.
Do you know what -X0
and -rX9
mean?
Or are you one of those bad, bad folks who try stuff randomly googled from the Internet
until they find the "magic one" which solves their problem?
Perusing the zip
manual (man zip
) is always very informative,
but here a summary for you, lazy-ass:
-0
means "do not compress, just store the file"-9
means "use the highest compression possible"-r
means "add files and directories recursively"-X
means "do not save extra file attributes inside the ZIP"
I suggest using -X
(unless you are an evil person, knowing to be so...),
because OCF processors are not required to honor extra file information,
and including those extra bits might create troubles to some processors. KISS.
(-D
is also interesting, but I will leave you, as an exercise,
finding what it does and why you might want to use it.)
Thank you for the lesson, can we watch kitten videos now?
No!
I promised to explain the reason for this strange convention: thanks to the way a ZIP file is created (and some more additional constraints from the EPUB OCF specs that I would not touch with a ten foot pole), we will have the following magic numbers:
0x50 0x4b 0x03 0x04
(at bytes 0-3)mimetype
(at bytes 30-37)application/epub+zip
(at bytes 38-57)
Indeed, if you open an EPUB with an hex editor, you will see something like:
In theory, this allows an OCF processor to recognize that a given file might be an EPUB file, by just looking at those magic number bytes.
(Note: the usefulness and modernity of such a convention might be the subject of a heated debate. I personally think that relying on these "old days tricks" is dangerous, but it is also true that complying is quite easy and cheap, once you know the "trick".)
Finally, the application/epub+zip
Media Type is registered here.
You said: "Finally", so are we done now?
Well, the constraints on the mimetype
and the OCF zipping are just some
of the many rules governing the EPUB OCF.
If you want to delve into the details,
or you need specific features (e.g., asset obfuscation),
go read the specs.
If you want a TL;DR version, follow this recipe:
- download this
mimetype
file and always use it - download this
container.xml
file, put it into yourMETA-INF
directory, and always use it - the previous point implies that your OPF file should be
OEBPS/content.opf
- put your eBook assets inside the
OEBPS
directory - stick to ASCII names for your asset files, using only [0-9A-Za-z.] to avoid characters that might need escaping and/or might not be supported by (old) EPUB processors (e.g., space, slash, non-ASCII, etc.)
- decide a naming convention for your asset files (e.g.,
OEBPS/Text
,p001.xhtml
, etc.) - compress the ZIP container properly, as explained above
If you made until down here, you earned it: go watch some cute kitten videos.