Spell-checking using vcsn

This notebook shows how commands from Vcsn can be used to build a spell-checker (actually a spello-fixer).

We first build a minimal automaton (on $\mathbb{N}_\text{min}$) to represent a (finite) language. To make it minimal, we built a codeterministic automaton, which we determinize.

In [1]:
import vcsn
c = vcsn.context('lan_char, nmin')
lang = c.cotrie(filename='/usr/share/dict/words', format='words').determinize().strip()

context.cotrie returns an automaton on letterset (not nullableset), which is a good idea for determinize. Since later we will want the labelset to be nullable, we convert lang to an automaton using context c. The resulting automaton is quite compact: compare the number of states with the number of words and the total number of characters.

In [2]:
lang = lang.automaton(c)
lang.info()
Out[2]:
{'is codeterministic': 'N/A',
 'is complete': 'N/A',
 'is deterministic': 'N/A',
 'is empty': False,
 'is eps-acyclic': True,
 'is normalized': False,
 'is proper': True,
 'is standard': True,
 'is trim': True,
 'is useless': False,
 'is valid': True,
 'number of accessible states': 131164,
 'number of coaccessible states': 131164,
 'number of codeterministic states': 'N/A',
 'number of deterministic states': 'N/A',
 'number of final states': 15679,
 'number of initial states': 1,
 'number of lazy states': 0,
 'number of spontaneous transitions': 0,
 'number of states': 131164,
 'number of transitions': 289227,
 'number of useful states': 131164,
 'type': 'mutable_automaton<nullableset<letterset<char_letters(\\-ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz)>>, nmin>'}
In [3]:
# The number of words, and then the total number of letters.
!wc -lc /usr/share/dict/words
  235886 2493109 /usr/share/dict/words

Then we turn this automaton into a partial-identity transducer, i.e., each single-tape transition ($s \xrightarrow{\langle n \rangle a} s'$) is turned into a double-tape transition ($s \xrightarrow{\langle n \rangle a|a} s'$). This is because we want to compose these automata.

In [4]:
lang2 = lang.partial_identity()
lang2.info()
Out[4]:
{'is codeterministic': 'N/A',
 'is complete': 'N/A',
 'is deterministic': 'N/A',
 'is empty': False,
 'is eps-acyclic': True,
 'is normalized': False,
 'is proper': True,
 'is standard': True,
 'is trim': True,
 'is useless': False,
 'is valid': True,
 'number of accessible states': 131164,
 'number of coaccessible states': 131164,
 'number of codeterministic states': 'N/A',
 'number of deterministic states': 'N/A',
 'number of final states': 15679,
 'number of initial states': 1,
 'number of lazy states': 0,
 'number of spontaneous transitions': 0,
 'number of states': 131164,
 'number of transitions': 289227,
 'number of useful states': 131164,
 'type': 'mutable_automaton<lat<nullableset<letterset<char_letters(\\-ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz)>>, nullableset<letterset<char_letters(\\-ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz)>>>, nmin>'}

Now we build the Levenhstein automaton for the same context.

In [5]:
lev = lang2.context().levenshtein()
lev
Out[5]:
%3 I0 0 0 I0->0 ⟨0⟩ F0 0->F0 ⟨0⟩ 0->0 ⟨1⟩[^\-|\-A|AB|BC|CD|DE|EF|FG|GH|HI|IJ|JK|KL|LM|MN|NO|OP|PQ|QR|RS|ST|TU|UV|VW|WX|XY|YZ|Za|ab|bc|cd|de|ef|fg|gh|hi|ij|jk|kl|lm|mn|no|op|pq|qr|rs|st|tu|uv|vw|wx|xy|yz|z], ⟨0⟩[\-|\-A|AB|BC|CD|DE|EF|FG|GH|HI|IJ|JK|KL|LM|MN|NO|OP|PQ|QR|RS|ST|TU|UV|VW|WX|XY|YZ|Za|ab|bc|cd|de|ef|fg|gh|hi|ij|jk|kl|lm|mn|no|op|pq|qr|rs|st|tu|uv|vw|wx|xy|yz|z]

Compose the Levenshtein automaton to the language transducer. This results into a transducer which on the left-tape is accepting the language fuzzily, and on the second tape, features the exact words.

In [6]:
fixer = lev.compose(lang2).strip()
fixer.info()
Out[6]:
{'is codeterministic': 'N/A',
 'is complete': 'N/A',
 'is deterministic': 'N/A',
 'is empty': False,
 'is eps-acyclic': True,
 'is normalized': False,
 'is proper': True,
 'is standard': False,
 'is trim': True,
 'is useless': False,
 'is valid': True,
 'number of accessible states': 131164,
 'number of coaccessible states': 131164,
 'number of codeterministic states': 'N/A',
 'number of deterministic states': 'N/A',
 'number of final states': 15679,
 'number of initial states': 1,
 'number of lazy states': 0,
 'number of spontaneous transitions': 0,
 'number of states': 131164,
 'number of transitions': 22569950,
 'number of useful states': 131164,
 'type': 'mutable_automaton<lat<nullableset<letterset<char_letters(\\-ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz)>>, nullableset<letterset<char_letters(\\-ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz)>>>, nmin>'}

For any candidate word, build the linear transducer that represents it.

In [7]:
w = c.expression('automathon').automaton().partial_identity()
w
Out[7]:
%3 I0 0 0 I0->0 ⟨0⟩ F10 1 1 0->1 ⟨0⟩a|a 2 2 1->2 ⟨0⟩u|u 3 3 2->3 ⟨0⟩t|t 4 4 3->4 ⟨0⟩o|o 5 5 4->5 ⟨0⟩m|m 6 6 5->6 ⟨0⟩a|a 7 7 6->7 ⟨0⟩t|t 8 8 7->8 ⟨0⟩h|h 9 9 8->9 ⟨0⟩o|o 10 10 9->10 ⟨0⟩n|n 10->F10 ⟨0⟩

Now compose the word and the fixer to get a computation of the Levenstheim distance of w to any word on the language.

In [8]:
fix = w.compose(fixer)
fix.info()
Out[8]:
{'is codeterministic': 'N/A',
 'is complete': 'N/A',
 'is deterministic': 'N/A',
 'is empty': False,
 'is eps-acyclic': True,
 'is normalized': False,
 'is proper': True,
 'is standard': True,
 'is trim': True,
 'is useless': False,
 'is valid': True,
 'number of accessible states': 2754434,
 'number of coaccessible states': 2754434,
 'number of codeterministic states': 'N/A',
 'number of deterministic states': 'N/A',
 'number of final states': 31358,
 'number of initial states': 1,
 'number of lazy states': 0,
 'number of spontaneous transitions': 0,
 'number of states': 2754434,
 'number of transitions': 14060199,
 'number of useful states': 2754434,
 'type': 'tuple_automaton<mutable_automaton<lat<nullableset<letterset<char_letters(\\-ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz)>>, nullableset<letterset<char_letters(\\-ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz)>>>, nmin>, focus_automaton<1, mutable_automaton<lat<nullableset<letterset<char_letters(\\-ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz)>>, nullableset<letterset<char_letters(\\-ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz)>>>, nmin>>, insplit_automaton<focus_automaton<0, mutable_automaton<lat<nullableset<letterset<char_letters(\\-ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz)>>, nullableset<letterset<char_letters(\\-ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz)>>>, nmin>>>>'}

Display the proposed correction.

In [9]:
fix.lightest(1)
Out[9]:
$\left\langle 1\right\rangle \mathit{automathon}|\mathit{automaton}$

Find the lightest path in this transducer: it maps our word to its closest word in the language.

In [10]:
fix.lightest_automaton()
Out[10]:
%3 I10 10 10 I10->10 ⟨0⟩ F0 0 0 0->F0 ⟨0⟩ 1 1 1->0 ⟨0⟩n|n 2 2 2->1 ⟨0⟩o|o 3 3 3->2 ⟨1⟩h|ε 4 4 4->3 ⟨0⟩t|t 5 5 5->4 ⟨0⟩a|a 6 6 6->5 ⟨0⟩m|m 7 7 7->6 ⟨0⟩o|o 8 8 8->7 ⟨0⟩t|t 9 9 9->8 ⟨0⟩u|u 10->9 ⟨0⟩a|a

We can ask for several proposals of close words.

In [11]:
fix.lightest(10).project(1)
Out[11]:
$\left\langle 3\right\rangle \mathit{automat} \oplus \left\langle 3\right\rangle \mathit{autobahn} \oplus \left\langle 3\right\rangle \mathit{automata} \oplus \left\langle 3\right\rangle \mathit{autophon} \oplus \left\langle 3\right\rangle \mathit{automatic} \oplus \left\langle 2\right\rangle \mathit{automatin} \oplus \left\langle 1\right\rangle \mathit{automaton} \oplus \left\langle 3\right\rangle \mathit{automelon} \oplus \left\langle 3\right\rangle \mathit{automotor} \oplus \left\langle 2\right\rangle \mathit{autochthon}$