Script usage
The script must be invoked with at least the following three parameters:
- -p prefix_list : prefix_list is a list of comma-separated names of your input dictionaries, without extension
- -f xx : xx is the ISO 639-1 code of the language "from" of the dictionary
- -t yy : yy is the ISO 639-1 code of the language "to" of the dictionary
The following optional parameters are available:
- -h : print usage message and exit
- -d : enable debug mode and do not delete temporary files
- -i : ignore word case while building the dictionary index
- -z : create the .install zip file containing the dictionary and the index
- --sd : input dictionary in StarDict format (default)
- --odyssey : input dictionary in Bookeen Cybook Odyssey format
- --xml : input dictionary in XML format
- --kobo : input dictionary in Kobo format (reads the index only!)
- --csv : input dictionary in CSV format
- --output-odyssey : output dictionary in Bookeen Cybook Odyssey format (default)
- --output-sd : output dictionary in StarDict format
- --output-xml : output dictionary in XML format
- --output-kobo : output dictionary in Kobo format
- --output-csv : output dictionary in CSV format
- --output-epub : output EPUB file containing the index of the input dictionary
- --title string : set the title string shown on the Odyssey screen to string
- --license string : set the license string to string
- --copyright string : set the copyright string to string
- --description string : set the description string to string
- --year string : set the year string to string
- --parser parser.py : use parser.py to parse the input dictionary
- --collation coll.py : use coll.py as collation function when outputting in Bookeen Cybook Odyssey format
- --fs string : use string as CSV field separator, escaping ASCII sequences (default: \t)
- --ls string : use string as CSV line separator, escaping ASCII sequences (default: \n)
The order of the parameters is irrelevant.
Examples (use penelope3.py instead of penelope.py if you have Python 3.x installed):
-
$ python penelope.py -h
Print usage message and exit
-
$ python penelope.py -p foo -f en -t en
Create English monolingual dictionary en.foo.dict and en.foo.dict.idx from StarDict files foo.*
-
$ python penelope.py -p bar -f en -t it
Create English-to-Italian dictionary en-it.dict and en-it.dict.idx from StarDict files bar.*
-
$ python penelope.py -p "bar,foo,zam" -f en -t it
Create English-to-Italian dictionary en-it.dict and en-it.dict.idx merging together StarDict dictionaries bar, foo, and zam.
-
$ python penelope.py --xml -p foo -f en -t en
Create English monolingual dictionary en.foo.dict and en.foo.dict.idx, but the input dictionary foo.xml is in XML format
-
$ python penelope.py --xml -p foo -f en -t en --output-sd
As above, but output in StarDict format instead of Bookeen Cybook Odyssey format
-
$ python penelope.py -p bar -f en -t it --output-kobo
As above, but outputs in Kobo format, creating dicthtml-en-it.zip
-
$ python penelope.py -p bar -f en -t it --output-xml -i
Reads from StarDict format and outputs in XML format, creating bar.xml, lowercasing all the keywords
-
$ python penelope.py --kobo -p bar -f it -t it --output-epub
Reads from Kobo format and outputs the XML format, creating the dictionary index in EPUB format bar.epub
-
$ python penelope.py --odyssey -p bar -f en -t en --output-epub
As above, but input is in Bookeen Cybook Odyssey format
-
$ python penelope.py -p bar -f en -t it --title "My EN-IT dictionary" --year 2012 --license "CC-BY-NC-SA 3.0"
Create English-to-Italian dictionary but also set title, year and license metadata
-
$ python penelope.py -p foo -f en -t en --parser foo_parser.py --title "Custom EN dictionary"
As above but set its title and use foo_parser.py to parse the input dictionary definitions
|
Dictionary management on the Odyssey
- Dictionaries are located in the Dictionaries/ directory
in the root directory of the Odyssey.
- Each dictionary has two files: an index ($NAME.dict.idx)
and a definition file ($NAME.dict),
where $NAME is the dictionary name.
- For a monolingual dictionary, $NAME must be $XX.$STRING,
where $XX is the ISO 639-1 code of the language,
and $STRING is an arbitrary label.
Example: en.foo.dict and en.foo.dict.idx.
- For a bilingual dictionary, $NAME must be $XX-$YY,
where $XX (resp., $YY) is the ISO 639-1 code of the language
from (resp., to) of the dictionary.
Example 1: en-it.dict and
en-it.dict.idx is the English-to-Italian dictionary.
Example 2: it-fr.dict and
it-fr.dict.idx is the Italian-to-French dictionary.
- Right now, the selection of a dictionary is done in the following way.
If the book you are reading has no language metadatum, then the default French dictionary is used.
(This dictionary is stored in the system partition of the Odyssey, which is not accessible by the user.)
Otherwise, let $XX be the language of the book,
and let $YY be the language of the Odyssey's interface.
If the bilingual dictionary $XX-$YY.dict exists, then it is used.
Otherwise, if the monolingual dictionary XX.*.dict exists, then it is used.
Finally, the default French dictionary is used.
- Right now, the user cannot directly select the dictionary to be used.
I hope Bookeen will implement this feature in a future firmware.
Apparently, you can have only one bilingual dictionary for each language pair ($XX-$YY)
while it is not clear to me which dictionary is used if you have two monolingual dictionaries
$XX.1.dict and $XX.2.dict
for the same language $XX.
-
Apparently, when you select a word, the index is queried for a stemmed version of the word.
The rules applied might vary depending on the language.
Unfortunately, I have not been able to look at this issue extensively, but I noticed
that, for example, plurals are recognized in English and French, but not in Italian.
However, you can bypass this issue by inserting declinated/conjugated forms in the index,
making them point to the definition of the base form.
|
Format of the definition file
- The dictionary file (say, en.foo.dict) is simply a zip file of plain text files,
c_1, c_2, ..., c_n.
- Each chunk file c_i contains utf-8 encoded definitions of words,
concatenated one after another. Two consecutive definitions do not need to be separated by
newlines or other special separator, since the index specifies the boundaries of
each definition as an offset and a length, in bytes, from the beginning of the chunk (see below).
- Apparently, each definition is an HTML fragment. Hence, you can use HTML tags to specify
bold or italic face, divs, etc.. I have not performed an exhaustive search for the supported tags yet.
- Each chunk file has (uncompressed) size between 2^18 = 262,144 bytes and 2^19 = 524,288 bytes.
This is probably due to the memory management of the device, and it is consistent with
the EPUB requirement of having single files of at most 300 KB.
In fact, my script closes the current chunk (and opens a new one)
whenever its size reaches 2^18 bytes.
|
Format of the index file
- The index file (say, en.foo.dict.idx)
is an sqlite3 database, with four tables
(T_DictVersion,
T_DictInfo,
T_DictIndex,
T_RefKey) and
an index (F_WordIndex) based
on a collation (IcuNoCase).
- Table T_DictVersion contains two fields:
F_Version (INTEGER) and
F_DictType (TEXT).
There is only one record, and it seems to me that
the latter is used for documentation reasons only.
- Table T_DictInfo contains the metadata associated with the dictionary.
It has only one record, with the following TEXT fields:
- F_Title
- F_Description
- F_Licence
- F_Copyright
- F_Year
- F_LanguageFrom
- F_LanguageTo
- F_Alphabet
- F_xhtmlHeader
Fields names are quite self-explanatory.
Let me observe that F_Title represents the string
shown on the Odyssey as the dictionary's heading.
I do not know what F_Alphabet means
(it always has value "Z"), perhaps it represents the encoding
used in the dictionary definitions.
- Table T_DictIndex contains the dictionary lookup table.
It has one record for each word, with the following fields:
- F_Key (INTEGER)
- F_Word (TEXT)
- F_Offset (INTEGER)
- F_Size (INTEGER)
- F_ChunckNum (INTEGER)
For example (0, foo, 350, 45, 7) means that
the definition of word foo
starts at byte 350 of file c_7
and it has length 45 bytes.
- Table T_RefKey contains two fields:
F_Key (INTEGER) and
F_RefKey (INTEGER).
In all the index files from Bookeen I have seen this table is empty,
and its meaning is unknown to me.
|
Converting a StarDict dictionary into the Odyssey format
|
Converting a StarDict dictionary into the Kobo format
|
Converting an XML dictionary into the Odyssey format
|
Converting an XML dictionary into the Kobo format
|
Converting a CSV dictionary into the Odyssey format
|
Creating the index of a dictionary as EPUB file
|
Custom parser for the input dictionary
- By default, the script will just convert the given StarDict dictionary to the Cybook Odyssey format.
In other words, it will create the same index of words as it appears in the input dictionary,
and it will simply copy the associated definitions with their original formatting.
- However, you might want to aggregate different definitions for the same word
into a single index entry, even if in the original dictionary they appeared as separate entries.
(Example: "Word (1)" and "Word (2)", etc.)
Moreover, you might want to perform some changes in the formatting of the definitions.
Clearly this operation is input-dependent, as different StarDict dictionaries have different formatting.
- To do so, you can issue the optional argument --parser parser.py
to instruct my script to process the input dictionary with the parser defined
in file parser.py.
- Your parser will contain a function
parse(data, type_sequence, ignore_case)
that will take the input dictionary data
(as a list of pairs [word, definition]),
the type_sequence of the input dictionary
and the ignore_case switch.
- The output of your parse function is a list of tuples with the following format:
[ word, include, synonyms, substitutions, definition ]
where:
- word is the index key (STRING).
- include is a BOOLEAN telling the script
if the current record should be included in the index.
- synonyms is a LIST of STRINGs that will be added
to the index and will point to the current definition.
It will be used only if include is True.
This is useful if you can extract declinated/conjugated forms from the input definition
and you want them to point to the base form word.
- substitutions is a LIST of pairs [replace_what, replace_with].
Each "replace_what" will be added to the index and will point to "replace_with",
if the latter exists in the dictionary.
It will be used only if include is False.
This is useful if you can infer that the current word is a declinated/conjugated form
and you want to directly refer to its base form instead of showing a rather un-informative
definition like "cats is the plural of cat".
- definition is the STRING containing the text
of the definition for the current word.
- Please see the included webster_parser.py
parser for the Webster 1913 StarDict dictionary
(you can find it as StarDict-comn_sdict_axm05_webster_1913-2.4.2.tar.bz2 on the Web)
to get an idea of how the parser is supposed to work.
Reading the source code of webster_parser.py will help as well.
|
Custom collation function when outputting to Bookeen Cybook Odyssey format
|
Notes and Comments
- I tried to comment every key point of my script and it should be easy to follow.
I took this as a "practical exercise" to learn Python, so please forgive me
if you find my code "naive", and drop me an email
with your advice to improve it, thanks!
-
I chose to use as few external modules as possibile to minimize possible
porting problems across different platforms. Please let me know if you experience
problems running my script, by sending an email with
a brief description of your environment (OS, Python version, ...)
and I will try to help you.
- If you find out bugs in my script or errors in this page
or if you want to contribute some code,
please let me know by sending an email, thank you!
- If you wish to convert a MOBI/PRC dictionary, use MobiUnpack to unpack it,
parse the resulting HTML file and output to the XML format described above.
Then, you can use my script to create your own Odyssey dictionary.
|
Links
|