A Practical Introduction To The aeneas Package
RSS • Permalink • Created 21 May 2015 • Written by Alberto Pettarin
This post is a practical introduction
to the aeneas
package,
with concrete examples of how to use it
to compute audio/text sync maps.
The Problem aeneas
Solves
aeneas
is a Python library and a set of tools
to automagically synchronize audio and text.
In other words, the main function of this software is to automate the computation of a synchronization map file ("sync map" for short) between an audio file and a list of text fragments. Sync maps have a variety of uses, including reflowable EPUB 3 Audio-eBooks or FXL Read-aloud EPUB 3 ebooks (SMIL files) and closed captioning videos (SRT/WebVTT/TTML files).
In abstract terms, a sync map associates each text fragment with the time interval, in the audio file, when that text fragment is spoken:
The major advantage of aeneas
is to eliminate
the need for human labor to produce the timings
(which usually involves painfully long "listen-and-mark" sessions),
while still producing a "correct" output,
that is, sync maps indistinguishable from those that a human operator
would produce manually.
The repo on GitHub includes the library source code, some "pre built" programs to compute the sync maps (which cover the most frequent use cases), unit tests and the documentation.
Installing aeneas
Assuming you have Python 2.7.x and Git in your machine,
installing aeneas
is easy:
Note that you might need to:
if you do not have ffmpeg
and espeak
installed already.
If you are running an (old) stable version of Debian,
you might get an error when installing
the scikits.audiolab
Python package.
In that case, please see this thread.
(I will see if I can remove the dependency from this library,
by switching to a less-problematic-to-get one.)
Right now the only supported OS is Linux (Debian),
but I have aeneas
configured and running on my Mac Mini (OS X) and
it was confirmed to be working on a Windows 8 machine too.
Please see the online documentation for more information.
Computing A Sync Map With execute_task
In the aeneas
jargon, a Task
represents the atomic unit of work,
that is, an audio file and a list of text fragments
to be synchronized, and for which you want to obtain a sync map file,
in the format (SMIL, SRT, TXT, etc.) you need.
To generate the sync map file, you can use the execute_task
script included in the package:
The script takes the following parameters:
- the path to the audio file (
audio.mp3
) - the path to the file containing the text fragments (
text.txt
) - the configuration string (
config_string
) - the path to the sync map file to be created (
map.smil
)
Let's examine each argument.
The Audio File
The audio file contains the narration of the text to be synchronized.
Any format readable by ffmpeg
can be used, including the popular
MP3, MP4, AAC, OGG, WAV, FLAC, WebM.
(Make sure you have the relevant codecs installed.)
The Text File
The text file contains the text fragments to be synchronized. Currently, three formats are supported:
plain
parsed
unparsed
In all three cases, the file must be encoded using UTF-8 (without BOM).
plain
Format
The first format, plain
, simply lists the fragments, one per line.
For example, if text.txt
contains the following 15 lines:
execute_task
will align 15 fragments, one for the title (1
)
and 14 others, one for each verse.
If text.txt
contains the following 107 lines:
execute_task
will align 107 fragments,
at word-level granularity.
If you specify the text fragments using the plain
text file format,
aeneas
will automatically assign to each fragment,
in the same order they appear in the input text file,
the following id
s:
f000001
, f000002
, f000003
, etc.
This is done because for certain sync map formats, like SMIL,
you need a (unique) id
for each text fragment.
parsed
Format
The second format, parsed
, is similar, but
it allows the user to explicitly provide the id
of each text fragment.
To do so, each line still corresponds to a text fragment but now
it must contain the id
, the |
(pipe) character as the separator,
and the text of the fragment.
For example, the following text.txt
:
is equivalent to the first plain example above.
Clearly, a best practice consists in generating the id
s
as valid XML id
s (i.e., as shown above,
one letter followed by a fixed number of digits,
forming progressive, consecutive numbers).
However, nothing impedes you from providing something like:
(whatever logic is behind the choice of the id
s!)
unparsed
Format
If you are working with EPUB 3 eBooks with Media Overlays,
probably you have already produced the (X)HTML file,
where each text fragment to be highlighted has its id
attribute.
If this is the case, the unparsed
text file format allows
aeneas
to extract the text fragments by directly parsing the XML DOM.
Suppose you have the following text.xhtml
file:
Clearly, you must instruct aeneas
to identify
the elements that contain the text to be actually used for the synchronization.
In the above example, you want to extract the text from elements
with an id
attribute matching the following
regular expression: f[0-9][0-9][0-9]
(an f
followed by three digits).
To do so, you will specify
in the configuration string (see below).
If not ambiguous (know your source!),
you can also use the wildcard characters +
and *
.
In the above example, you can use f[0-9]+
(an f
followed by one or more digits)
instead of f[0-9][0-9][0-9]
.
To reduce ambiguity, you might also instruct aeneas
to look for elements with
a given value in their class
attribute. If your input file is:
you might want to specify both the following requirements:
id
must matchf[0-9]+
, andclass
must match (that is, must contain the value)ra
.
Similarly to the previous case, your configuration string will contain
Finally, aeneas
asks you to specify the order in which the extracted
text fragments should be aligned.
In fact, the order in which the elements might appear in the DOM
might be different from their order in the audio file.
For example, you might have the following portion of DOM:
and you want the extracted fragments to appear in this order:
f001
(1)f002
(From fairest creatures we desire increase,)f003
(That thereby beauty's rose might never die,)f004
(But as the riper should by time decease,)f005
(His tender heir might bear his memory:)
In this case, you will specify the following parameter in the configuration string:
which will instruct aeneas
to disregard any non-digit appearing in the id
values,
and sort the text fragments according to the remaining numeric part (leading zeroes are ignored).
Other options for is_text_unparsed_id_sort
include
unsorted
(do not reorder the text fragments) and
lexicographic
(sort the id
s based on their lexicographic order).
The Configuration String
As mentioned above, there are a few parameters you must specify
to execute_task
, in order to have your input files processed correctly.
To that end, you need to write a configuration string,
which is a UTF-8 encoded string that looks like this:
The order of the key=value
pairs does not matter,
but you must use the |
(pipe) character to separate them.
(I know this syntax looks a bit clumsy and cumbersome,
but it is very compact and it can be directly passed to APIs,
like we did in ReadBeyond Sync.
If I have time, I will enhance execute_task
and execute_job
with an argument parser,
allowing the user to specify parameters using switches like --language en
or -f smil
.)
You need to specify at least three parameters:
- the language of your input materials (e.g.,
task_language=en
) - the format of the text file (e.g.,
is_text_type=plain
) - the format of the sync map to be output (e.g.,
os_task_file_format=srt
)
The resulting string is:
For example, assuming you have an audio file /tmp/audio.mp3
,
a plain text file /tmp/subs.txt
, both in English (en
),
and you want to output a file /tmp/subs.srt
in SRT format (srt
),
you will issue the following command:
If you need to run several tasks sharing the same configuration string,
you might want to assign the latter to a shell variable CONFIG_STRING
:
This mechanism is adequate as long as you have few tasks and/or
you want to run them one-by-one.
An handier mechanism leverages the execute_job
program, described below.
Optional Parameters
The configuration string might have additional, optional parameters.
The two most useful ones are:
is_audio_file_head_length=X
: ignore the firstX
seconds of the audio fileis_audio_file_process_length=Y
: synchronize onlyY
seconds of the audio file
which allow you to "cut" (for the synchronization purposes) the head of the audio file,
its tail or both. For example, if you have an audio file of total length 60s
:
is_audio_file_head_length=20
: sync from20s
to60s
in the audio fileis_audio_file_process_length=50
: sync from0s
to50s
in the audio fileis_audio_file_head_length=20|is_audio_file_process_length=30
: sync from20s
to20s+30s=50s
in the audio file
Implied Parameters
As discussed above while describing the unparsed
text format,
when you specify the is_text_type=unparsed
parameter, you must also specify:
is_text_unparsed_id_regex
is_text_unparsed_id_sort
,- optionally, you might also set
is_text_unparsed_class_regex
When you want to output in SMIL format (is_task_file_format=smil
),
you must also specify the values for the src
attribute of:
- the
<audio>
elements, withos_task_file_smil_audio_ref
- the
<text>
elements, withos_task_file_smil_page_ref
For example, you might have the following configuration string:
For the sake of clarity, I will break it down into pairs:
which will instruct aeneas
to produce a SMIL file like this:
Please note that, for <audio>
elements, the relative path ../audio/audio.mp3
has been used,
as specified in the configuration string.
References To The Documentation
- Languages: docs
- Input text formats: docs
- Output sync map formats: docs
- ID sorting algorithms: docs
- Parameter keys: docs
Please also refer to the examples you can find in the
aeneas/tests/res
and long_tests
directories
of the cloned repo.
Computing Multiple Sync Maps At Once
As briefly mentioned above, especially if you work with EPUB 3 eBooks, you might have dozens of tasks to run, all with the same configuration parameters.
In this case, you can create a Job
, that is, a set of Task
s,
and process them in batch using the aeneas.tools.execute_job
command.
In its simplest form, this command takes two arguments:
/path/to/job.zip
is a ZIP file containing all the input assets (i.e., a pair of audio/text files for each task) and a special configuration fileconfig.txt
(orconfig.xml
) containing the runtime instructions/path/to/output/dir/
is the directory where the output archive, containing the output sync maps, one for each task, should be created
Note that, instead of creating an input ZIP file, you can also pass a path
to an uncompressed directory /path/to/job/
:
In the aeneas/tests/res/example_jobs
directory you can find
several examples of job directories, with different ways
of arranging the input files inside the input container directory hierarchy,
and with different runtime parameters.
In what follows, I will describe the contents of the config.txt
textual/INI-like configuration file,
which is the simplest way of specifying a job configuration,
yet it should cover a vast majority of use cases.
If you need a finer control over the job configuration,
for example you have different tasks with different languages,
you can create a config.xml
XML configuration file:
see the documentation for more details.
Theconfig.txt
Configuration File: Flat Case
Suppose you have the following files in the flat_example
directory:
The config.txt
file contains the following:
If you run the following command:
you will get a ZIP file /tmp/output_flat.zip
containing three SMIL files,
one for each of the three tasks found:
The is_hierarchy_type=flat
tells aeneas
that the assets
in the input container are contained within the same directory,
positioned at is_hierarchy_prefix=OEBPS/Resources/
.
(Note that all the paths for the input assets
are relative to the config.txt
file.)
The audio and text files for each task are identified
by matching the is_text_file_name_regex
and is_audio_file_name_regex
regular expressions.
A task is created only if both the audio file and the text file
are matched and they share the same name prefix.
Similarly, the os_job_file_hierarchy_type=flat
and
os_job_file_hierarchy_prefix=OEBPS/Resources/
specify the desired output directory hierarchy.
Note that the $PREFIX
placeholder will be replaced
by each task name
(i.e., sonnet001
, sonnet002
, sonnet003
in the example).
Finally, please note that the language is set (for all the tasks)
to English by the job_language=en
line.
Theconfig.txt
Configuration File: Paged Case
If your tasks are divided into subdirectories of the main directory paged_example
:
you must specify the paged
hierarchy in your config.txt
:
Executing:
will create the ZIP file /tmp/output_paged.zip
containing:
Please note that if you use is_hierarchy_type=paged
,
you must provide a regex for is_task_dir_name_regex
which will be used to to identify the tasks
by matching the subdirectory names
(is_task_dir_name_regex=[0-9]+
in the example).