context.levenshtein

Generate a transducer encoding the Levenshtein distance for two alphabets.

The resulting transducer can be composed with transducers representing two languages to compute the distance between the two languages (among other things, like the edit-distance automaton).

Preconditions:

  • the labelset has exactly two tapes
  • the labelsets must have the empty word
  • the weightset must be nmin ($\langle \mathbb{N}, \min, +\rangle$)

References:

Examples

In [1]:
import vcsn
c = vcsn.context('lat<lan_char(abc), lan_char(bce)>, nmin')
l = c.levenshtein()
l
Out[1]:
%3 I0 0 0 I0->0 ⟨0⟩ F0 0->F0 ⟨0⟩ 0->0 ⟨1⟩[^b|bc|c], ⟨0⟩b|b, ⟨0⟩c|c

The Levenshtein automaton only has one state, but has $n^2$ transitions, for a common alphabet between tapes of size $n$.

To show its use, we will create automata for two languages, a1 and a2, and compute the edit-distance automaton.

In [2]:
a1 = vcsn.context('lan_char(abc), b').expression("bac+cab").derived_term().strip().partial_identity()
a1
Out[2]:
%3 I0 0 0 I0->0 F4 1 1 0->1 b|b 2 2 0->2 c|c 5 5 1->5 a|a 3 3 2->3 a|a 4 4 3->4 b|b 4->F4 5->4 c|c
In [3]:
a2 = vcsn.context('lan_char(bce), b').expression("bec+bebe").automaton().cominimize().strip().partial_identity()
a2
Out[3]:
%3 I0 0 0 I0->0 F4 1 1 0->1 b|b 2 2 1->2 e|e 3 3 2->3 b|b 4 4 2->4 c|c 3->4 e|e 4->F4
In [4]:
edit = a1.compose(l).compose(a2).trim()
edit
Out[4]:
%3 I0 0 (0, (0, !ε)), (0, !ε) I0->0 ⟨0⟩ F43 F49 1 (1, (0, !ε)), (0, !ε) 0->1 ⟨1⟩b|ε 2 (2, (0, !ε)), (0, !ε) 0->2 ⟨1⟩c|ε 3 (0, (0, ε)), (1, !ε) 0->3 ⟨1⟩ε|b 4 (1, (0, !ε)), (1, !ε) 0->4 ⟨0⟩b|b 5 (2, (0, !ε)), (1, !ε) 0->5 ⟨1⟩c|b 6 (5, (0, !ε)), (0, !ε) 1->6 ⟨1⟩a|ε 7 (1, (0, ε)), (1, !ε) 1->7 ⟨1⟩ε|b 8 (5, (0, !ε)), (1, !ε) 1->8 ⟨1⟩a|b 9 (3, (0, !ε)), (0, !ε) 2->9 ⟨1⟩a|ε 10 (2, (0, ε)), (1, !ε) 2->10 ⟨1⟩ε|b 11 (3, (0, !ε)), (1, !ε) 2->11 ⟨1⟩a|b 3->4 ⟨1⟩b|ε 3->5 ⟨1⟩c|ε 12 (0, (0, ε)), (2, !ε) 3->12 ⟨1⟩ε|e 13 (1, (0, !ε)), (2, !ε) 3->13 ⟨1⟩b|e 14 (2, (0, !ε)), (2, !ε) 3->14 ⟨1⟩c|e 4->8 ⟨1⟩a|ε 15 (1, (0, ε)), (2, !ε) 4->15 ⟨1⟩ε|e 16 (5, (0, !ε)), (2, !ε) 4->16 ⟨1⟩a|e 5->11 ⟨1⟩a|ε 17 (2, (0, ε)), (2, !ε) 5->17 ⟨1⟩ε|e 18 (3, (0, !ε)), (2, !ε) 5->18 ⟨1⟩a|e 19 (4, (0, !ε)), (0, !ε) 6->19 ⟨1⟩c|ε 20 (5, (0, ε)), (1, !ε) 6->20 ⟨1⟩ε|b 21 (4, (0, !ε)), (1, !ε) 6->21 ⟨1⟩c|b 7->8 ⟨1⟩a|ε 7->15 ⟨1⟩ε|e 7->16 ⟨1⟩a|e 8->21 ⟨1⟩c|ε 22 (5, (0, ε)), (2, !ε) 8->22 ⟨1⟩ε|e 23 (4, (0, !ε)), (2, !ε) 8->23 ⟨1⟩c|e 9->19 ⟨1⟩b|ε 9->21 ⟨0⟩b|b 24 (3, (0, ε)), (1, !ε) 9->24 ⟨1⟩ε|b 10->11 ⟨1⟩a|ε 10->17 ⟨1⟩ε|e 10->18 ⟨1⟩a|e 11->21 ⟨1⟩b|ε 11->23 ⟨1⟩b|e 25 (3, (0, ε)), (2, !ε) 11->25 ⟨1⟩ε|e 12->13 ⟨1⟩b|ε 12->14 ⟨1⟩c|ε 26 (0, (0, ε)), (3, !ε) 12->26 ⟨1⟩ε|b 27 (1, (0, !ε)), (3, !ε) 12->27 ⟨0⟩b|b 28 (2, (0, !ε)), (3, !ε) 12->28 ⟨1⟩c|b 29 (0, (0, ε)), (4, !ε) 12->29 ⟨1⟩ε|c 30 (1, (0, !ε)), (4, !ε) 12->30 ⟨1⟩b|c 31 (2, (0, !ε)), (4, !ε) 12->31 ⟨0⟩c|c 13->16 ⟨1⟩a|ε 32 (1, (0, ε)), (3, !ε) 13->32 ⟨1⟩ε|b 33 (5, (0, !ε)), (3, !ε) 13->33 ⟨1⟩a|b 34 (1, (0, ε)), (4, !ε) 13->34 ⟨1⟩ε|c 35 (5, (0, !ε)), (4, !ε) 13->35 ⟨1⟩a|c 14->18 ⟨1⟩a|ε 36 (2, (0, ε)), (3, !ε) 14->36 ⟨1⟩ε|b 37 (3, (0, !ε)), (3, !ε) 14->37 ⟨1⟩a|b 38 (2, (0, ε)), (4, !ε) 14->38 ⟨1⟩ε|c 39 (3, (0, !ε)), (4, !ε) 14->39 ⟨1⟩a|c 15->16 ⟨1⟩a|ε 15->32 ⟨1⟩ε|b 15->33 ⟨1⟩a|b 15->34 ⟨1⟩ε|c 15->35 ⟨1⟩a|c 16->23 ⟨1⟩c|ε 40 (5, (0, ε)), (3, !ε) 16->40 ⟨1⟩ε|b 41 (4, (0, !ε)), (3, !ε) 16->41 ⟨1⟩c|b 42 (5, (0, ε)), (4, !ε) 16->42 ⟨1⟩ε|c 43 (4, (0, !ε)), (4, !ε) 16->43 ⟨0⟩c|c 17->18 ⟨1⟩a|ε 17->36 ⟨1⟩ε|b 17->37 ⟨1⟩a|b 17->38 ⟨1⟩ε|c 17->39 ⟨1⟩a|c 18->23 ⟨1⟩b|ε 18->41 ⟨0⟩b|b 18->43 ⟨1⟩b|c 44 (3, (0, ε)), (3, !ε) 18->44 ⟨1⟩ε|b 45 (3, (0, ε)), (4, !ε) 18->45 ⟨1⟩ε|c 46 (4, (0, ε)), (1, !ε) 19->46 ⟨1⟩ε|b 20->21 ⟨1⟩c|ε 20->22 ⟨1⟩ε|e 20->23 ⟨1⟩c|e 47 (4, (0, ε)), (2, !ε) 21->47 ⟨1⟩ε|e 22->23 ⟨1⟩c|ε 22->40 ⟨1⟩ε|b 22->41 ⟨1⟩c|b 22->42 ⟨1⟩ε|c 22->43 ⟨0⟩c|c 48 (4, (0, ε)), (3, !ε) 23->48 ⟨1⟩ε|b 49 (4, (0, ε)), (4, !ε) 23->49 ⟨1⟩ε|c 24->21 ⟨1⟩b|ε 24->23 ⟨1⟩b|e 24->25 ⟨1⟩ε|e 25->23 ⟨1⟩b|ε 25->41 ⟨0⟩b|b 25->43 ⟨1⟩b|c 25->44 ⟨1⟩ε|b 25->45 ⟨1⟩ε|c 26->27 ⟨1⟩b|ε 26->28 ⟨1⟩c|ε 26->29 ⟨1⟩ε|e 26->30 ⟨1⟩b|e 26->31 ⟨1⟩c|e 27->33 ⟨1⟩a|ε 27->34 ⟨1⟩ε|e 27->35 ⟨1⟩a|e 28->37 ⟨1⟩a|ε 28->38 ⟨1⟩ε|e 28->39 ⟨1⟩a|e 29->30 ⟨1⟩b|ε 29->31 ⟨1⟩c|ε 30->35 ⟨1⟩a|ε 31->39 ⟨1⟩a|ε 32->33 ⟨1⟩a|ε 32->34 ⟨1⟩ε|e 32->35 ⟨1⟩a|e 33->41 ⟨1⟩c|ε 33->42 ⟨1⟩ε|e 33->43 ⟨1⟩c|e 34->35 ⟨1⟩a|ε 35->43 ⟨1⟩c|ε 36->37 ⟨1⟩a|ε 36->38 ⟨1⟩ε|e 36->39 ⟨1⟩a|e 37->41 ⟨1⟩b|ε 37->43 ⟨1⟩b|e 37->45 ⟨1⟩ε|e 38->39 ⟨1⟩a|ε 39->43 ⟨1⟩b|ε 40->41 ⟨1⟩c|ε 40->42 ⟨1⟩ε|e 40->43 ⟨1⟩c|e 41->49 ⟨1⟩ε|e 42->43 ⟨1⟩c|ε 43->F43 ⟨0⟩ 44->41 ⟨1⟩b|ε 44->43 ⟨1⟩b|e 44->45 ⟨1⟩ε|e 45->43 ⟨1⟩b|ε 46->47 ⟨1⟩ε|e 47->48 ⟨1⟩ε|b 47->49 ⟨1⟩ε|c 48->49 ⟨1⟩ε|e 49->F49 ⟨0⟩

The automaton can be evaluated on one tape to get the edit distance between a word and a language.

In [5]:
exp = edit.lift(1).proper().evaluate("bac")
exp
Out[5]:
$ \left\langle 3 \right\rangle \,\left(\left(b\right) \, \left(e\right) \, \left(b\right) \, \left(e\right)\right) + \left\langle 1 \right\rangle \,\left(\left(b\right) \, \left(e\right) \, \left(c + \left\langle 2 \right\rangle \,\left(\left(b\right) \, \left(e\right)\right)\right)\right) + \left\langle 2 \right\rangle \,\left(\left(b\right) \, \left(e\right) \, \left( \left\langle 1 \right\rangle \,\left(c\right) + \left\langle 2 \right\rangle \,\left(\left(b\right) \, \left(e\right)\right)\right)\right) + \left\langle 1 \right\rangle \,\left( \left\langle 3 \right\rangle \,\left(\left(b\right) \, \left(e\right) \, \left(b\right) \, \left(e\right)\right) + \left\langle 1 \right\rangle \,\left(\left(b\right) \, \left(e\right) \, \left( \left\langle 1 \right\rangle \,\left(c\right) + \left\langle 2 \right\rangle \,\left(\left(b\right) \, \left(e\right)\right)\right)\right)\right)$
In [6]:
vcsn.context("lan_char(bce), nmin").expression(exp.format('text'), 'distributive')
Out[6]:
$ \left\langle 1 \right\rangle \,\left(b \, e \, c\right) \oplus \left\langle 3 \right\rangle \,\left(b \, e \, b \, e\right)$