Solving (Generalized) Ruzzle Boards With elzzur
RSS • Permalink • Created 29 May 2016 • Written by Alberto Pettarin
Ruzzle is a word board game app which went viral a few years ago.
A 4-by-4 board of letters is shown to the player:
and she has two minutes to find as many valid words in the board as she can.
A word can be formed by connecting at least two adjacent letters (including above/below, left/right, or diagonally), and it is valid if it is present in the app dictionary for the language chosen by the user. For example, "THE" in the above board:
The total score is the sum of the scores of the words found, which in turn is the sum of the points of each letter in the word, potentially multiplied by two or three if the path forming the word includes letters with the DL/TL/DW/TW modifiers. Also, 5-letters words or longer are awarded bonus length points.
Note that the score of each letter differs from language to language, since it basically depends on the frequency of the letter in each language, as in other word board games like Scrabble. For example, a "C" is worth 4 points in English, but only 2 points in Italian, etc.
(For details on the Ruzzle rules, see the Scoring Rules Section.)
Kids Challenging The Wrong Uncle
Two of my nephews are quite addicted to this game app, which allows remote players to compete on the same board, the winner being the player achieving a score higher than the opponent.
While I like playing the game myself from time to time, especially with or against my nephews, I also see some serious problems in it. The worst one is that, given a board, it is extremely easy to achieve an astronomical score by cheating. In fact, I mentioned to my nephews (they are 7 and 10 years old) that it is rather straightforward to write a program that, in few seconds, enumerates all the words contained in the very board displayed by the app. They were skeptical, so I had to show them some magic!
Enters elzzur
I coded elzzur
, a Python package
that contains an executable script which in fact solves any given Ruzzle board.
(Yes, "elzzur" is "ruzzle" backwards...)
Unlike other solvers out there, elzzur
solves generalized boards,
that is, N-by-M boards (N, M >= 1), not just 4-by-4 boards.
Moreover, it does not just find the words, but it also computes their exact score, thanks to built-in letter scores for a handful of languages, taken from the actual Ruzzle app. The user can sort the results by score, word length, or begin/end cell. In particular, sorting by (decreasing) score ensures the highest degree of "cheatiness"...
Finally, elzzur
has built-in dictionaries
derived from the GNU aspell project.
Although the built-in dictionaries are not the Ruzzle official ones,
they are sufficiently extensive to prove my point.
(I guess you can obtain the official Ruzzle dictionaries
by decompiling the app, and maybe they are already out there,
in the dark corners of the Internet...)
Alternatively, the user can supply her own dictionary.
Using elzzur
OK, enough talk, let's play!
First, install it:
Installing via pip
will make the elzzur
script available in your shell:
(If you clone the GitHub repo or you use Windows, replace elzzur
with python -m elzzur
.)
For the impatient, elzzur
comes with a "demo" mode,
which shows a built-in 4-by-4 board,
generated by the official Ruzzle app,
and solves it.
For example, for English (-l en
), it prints:
As you can see from the output above,
you also get the (0,0)-based coordinates
of the cells of the board that generate each word.
Clearly, for Italian (-l it
)
you will get a different board
and a different list of words:
Run with the languages
command to list the supported languages,
that is, languages for which the letter score and frequency,
a demo board, and a dictionary,
are built in elzzur
:
At this point, you might want to solve your own board.
To do so, create a plain text file, say /tmp/my.board
,
containing something like the following four lines
(details on the format are here):
and then run:
In all the above examples the dictionary used is the built-in one.
However, you can pass your own dictionary
(details are here)
to the solver, using the -d
switch:
If you plan to create a full dictionary with thousands of words, you might want to compile it to MARISA format, so that loading it into memory will be faster than loading it from a plain-text file:
Finally, if you want to generate random boards,
elzzur
can do it for you, sampling letters
according to the letter frequency of the selected language:
And, as promised, the solver can take it!
Who Is MARISA? Or, A Few Words On The Solver
I really enjoyed coding elzzur
, because it gave me an excuse to use
one of my favourite data structures: a trie, also known as prefix tree.
In particular, I used a MARISA trie, a very efficient C++ implementation
of a recursive, bit-indexed trie,
wrapped by the Python binding marisa-trie
.
Given the dictionary of valid words, "solving" a Ruzzle board means returning the list of words on the board maximing the total score.
Since the total score is the sum of the scores of the words found, this optmization problem can be solved by:
- finding all the possible valid "snakes", that is, paths of adjacent cells forming words present in the dictionary; and
- computing the score of each snake: if two or more snakes form the same word, only one yielding the highest score must be kept, since the Ruzzle rules say that the player cannot form the same word twice.
While the second step is trivial, the first one requires some thinking.
Attempt 1: Brute Force Search Of The Board
A naive attempt is the following: enumerating all the possible snakes, for example by performing a depth-first search (DFS) or a breadth-first search (BFS) of the board, checking if the word corresponding to the each snake is contained in the dictionary. To check efficiently, the latter can be stored as an hash table in memory.
This naive approach takes a lot of time, since there are about 12 million possible snakes in a 4-by-4 board. To meet the 2-minutes constraint of the original app, each snake should be generated and checked against the dictionary in about 10 microseconds, which is not quite feasible in Python (or even C) with a 100k word dictionary. Even worse, the number of snakes increases exponentially with the size of the board: solving a (say) 10x10 board will take ages on a regular PC.
Attempt 2: A Smarter Search
A better approach consists in reversing the role of the dictionary and the board, since natural languages have less than a million words (including inflected forms), so looping through the dictionary, checking if the current word can be formed on the given board is faster. A further speedup stems from the observation that one can filter the dictionary, removing words that cannot be formed with the letters present on the board, for example all those containing "Z" if "Z" is not on the board, or all those containing three "E"s if the board contains only two "E"s, etc.
However, even with this optimization, the (Python) code I wrote to benchmark it took a few minutes to solve a 4-by-4 board, and not making the 2-minutes cut of the official Ruzzle app!
Attempt 3: Trie-based Dictionary Search
The two attempts described above highlight a key characteristic of the problem.
Suppose you are examining a "current" snake. First, you check if it is a valid word: if so, you store it in the list of valid snakes. Second, you need to extend it, that is, you need to generate new "candidate" snakes, to be examined later, by appending letters adjacent to the current end letter. But of course you can avoid exploring these extensions if you know that no word in the dictionary has a prefix equal to the current snake. An example will help understanding the process:
Suppose you have the snake "BURN". Since it is a valid English word, you will store it. Now, you examine the letters adjacent to the "N": except for the "R" and "U" letters, which are already part of the current snake and hence they cannot be used again, you will add "BURNT" (top), "BURNE" (top-right), "BURNG", "BURNA", "BURNT" (bottom), and "BURNE" (bottom-left) to the queue of snakes to be examined. In fact, at a later iteration:
- both "BURNT" will be added to the the list of valid snakes, since "BURNT" is in the dictionary, and it will not be explored further, as no other word starts with prefix "BURNT";
- "BURNE" is not a valid word, but the dictionary has words like "BURNED", "BURNER" or "BURNETT" which start with that prefix, hence it will be explored further: "BURNED" is indeed on the board and it will be found in a later iteration;
- "BURNA" is not a valid word, but the dictionary contains "BURNABLE", so it will not be added but it will be explored: since none of "BURNAG", "BURNAH", "BURNAE", or "BURNAT" is a prefix of a dictionary word, the exploration will be pruned in the subsequent iteration;
- "BURNG" is neither a valid word nor a prefix of a valid word, so it will not be added and it will not be explored.
This observation immediately suggests to store the dictionary in a data structure that answers the following two questions efficiently:
- deos the data structure contain word
w
? - does the data structure contain words starting with prefix
p
?
Of course, such a data structure is a trie, aka prefix-tree aka radix-tree.
A trie is a tree with an empty root, while other nodes corresponds to one or more letters. Nodes that are children of the same node represent words that share the same prefix, which is precisely the concatenation of letters in the path from the root to the common parent node, but that then differ in the next letter. The following illustration makes the concept intuitive:
The words stored in the trie depicted above are: a, an, elegant, element, elzzur, i, is, rumble, ruzzle, simple, solve, solver.
Since a dictionary might contain words that are prefixes of other words,
like "solve"/"solver" in the example, a virtual character, $
in the example,
is added at the end of a word, to indicate that a prefix ("solve") is also a full word ("solve$").
(Actual implementations use a bit field for each node, instead of adding these virtual leaves,
but I find easier to understand the $
-convention shown above.)
To determine whether a word is stored in the trie, a visit of the trie starts from the root, until a leaf node is reached (which means the word is present) or a failure occurs. When reaching an internal node, its children are scanned to see if one contains a prefix of the word: if not, the search fail; if yes, the search continues from the child. For this reason, siblings are usually kept sorted, to allow binary search. In the above trie, seaching for "elementary", "solver" and "solving" will visit the following nodes:
Answering both the above questions ("has a given word?" and "has words with given prefix?") requires to traverse at most h nodes, where h is the height of the tree, each costing a binary search over the prefixes stored in its children. (For alphabets of limited cardinality, one can use an hash table or an array, trading space to achieve access in constant time.) Due to the statistical properties of natural languages, not all sequences of letters appear with the same frequency, and hence most of the times the search of the trie is terminated early, which in turn means that the exploration of the board is heavily pruned.
In elzzur
, the solver is thus implemented as in the following Python-like pseudo-code:
Here q
is a queue simulating a BFS of the board grid,
but the type of search does not really matter,
a DFS would have worked equally well.
What really matters is the fact that dictionary
supports fast has_word()
and has_words_with_prefix()
operations.
The actual Python code for the above function is here.
Once all the valid snakes are found, it is just a matter of keeping the highest scoring one (if two or more yield the same word), and sorting them according to the method requested by the user: score, word length, or begin or end cell of the snake.
As promised, the code using a MARISA trie to store the dictionary is really fast, taking only a few hundred milliseconds to solve a 4-by-4 board:
The sharp-eyed reader will notice that the run time is roughly linear in the number of cells of the board...
More Fun (aka Open Problems And Future Work)
On the GitHub README I listed a few extensions that are worth exploring.
If you fancy combinatorics, obtaining a closed formula for the number of possible snakes in a N-by-M grid seems to be an open problem, the main difficulty being the "geometric" constraints induced by the "non-self-intersecting rule".
Did you enjoy reading this post? Let me know by email or via Twitter (@acutebit).
(I am considering writing a series of posts similar to this one, about non-trivial algorithms and data structures, and their applications to practical and/or fun problems, so feedback is very appreciated.)
Enjoy playing with elzzur
!