Tsa Lectures 1
Tsa Lectures 1
= A
+
. The complement of alphabet A for some set of symbols
B, B A is denoted B = A B. Notation a means A a. The operation
concatenation is dened on the set of strings in this way: if x and y are
strings over A, then the concatenation of these strings is xy. This operation
is associative, i.e. (xy)z = x(yz). On the other hand, it is not commutative,
i.e. xy ,= yx. Empty string is the neutral element: x = x = x. The
set of strings A
.
Denition 1.1
Set Pref(x), x A
. 2
8
Denition 1.2
Set Su(x), x A
. 2
Denition 1.3
Set Fact(x), x A
. 2
Denition 1.4
Set Sub(x), x A
, i = 0, 1, 2, . . . , m, a
j
A, j = 1, 2, . . . , m, m 0. 2
Denition 1.5
The terms proper prex, proper sux, proper factor, proper subsequence
are used for a prex, sux, factor, subsequence of string x which is not
equal to x. 2
The denitions of sets Pref, Su, Fact and Sub can be extended for nite
and innite sets of strings.
Denition 1.6
Set Pref(X), X A
. 2
Denition 1.7
Set Su(X), X A
. 2
Denition 1.8
Set Fact(X), X A
. 2
Denition 1.9
Set Sub(X), X A
,
i = 0, 1, 2, . . . , m, a
j
A, j = 1, 2, . . . , m, m 0. 2
The denition of the abovementioned sets can also be extended for approx-
imate cases. In the following denitions D is a metrics, k is the distance.
Denition 1.10
The set of approximate prexes APref of string x is:
APref(x) = u : v Pref(x), D(u, v) k. 2
Denition 1.11
The set of approximate suxes ASu of string x is:
ASu(x) = u : v Su(x), D(u, v) k. 2
9
Denition 1.12
The set of approximate factors AFact of string x is:
AFact(x) = u : v Fact(x), D(u, v) k. 2
Denition 1.13
The set of approximate subsequences ASub of string x is:
ASub(x) = u : v Sub(x), D(u, v) k. 2
The term pattern matching is used for both string matching and sequence
matching. The term subpattern matching is used for matching substrings
or subsequences of a pattern.
Denition 1.14
The dont care symbol is a special universal symbol that matches any
other symbol, including itself.
Denition 1.15 (Basic pattern matching problems)
Given text T = t
1
t
2
. . . t
n
and pattern P = p
1
p
2
. . . p
m
, we may dene:
1. String matching: verify whether string P is a substring of text T.
2. Sequence matching: verify whether sequence P is a subsequence of
text T.
3. Subpattern matching: verify whether a subpattern of P (substring or
subsequence) occurs in text T.
4. Approximate pattern matching: verify whether string x occurs in text
T so that distance D(P, x) k for given k < m.
5. Pattern matching with dont care symbols: verify if pattern P con-
taining dont care symbols occurs in text T. 2
Denition 1.16 (Matching a sequence of patterns)
Given text T = t
1
t
2
. . . t
n
and a sequence of patterns (strings and/or se-
quences) P
1
, P
2
, . . . , P
s
. Matching of a sequence of patterns P
1
, P
2
, . . . , P
s
is
a verication whether an occurrence of pattern P
i
in text T is followed by
an occurrence of P
i+1
, 1 i < s. 2
Denitions 1.15 and 1.16 dene pattern matching problems as decision
problems, because the output is a Boolean value. A modied version of these
problems consists in searching for the rst, the last, or all occurrences of a
pattern and moreover the result may be the set of positions of the pattern
in the text.
Instead of just one pattern, one can consider a nite or innite set of
patterns.
Denition 1.17 (Distances of strings - general alphabets)
Three variants of distances between two strings x and y are dened as the
minimum number of editing operations:
10
1. replace (Hamming distance, R-distance),
2. delete, insert and replace (Levenshtein distance, DIR-distance),
3. delete, insert, replace and transpose of neighbour symbols (Damerau
distance, generalized Levenshtein distance, DIRT-distance),
needed to convert string x into string y. 2
The Hamming distance is a metrics on a set of strings of equal length.
The Levenshtein distance and the generalized Levenshtein distance are met-
rics on a set of strings not necessarily of equal length.
Denition 1.18 (Distance of strings - ordered alphabet)
Let A = a
1
, a
2
, . . . , a
p
be an ordered alphabet. Let a
i
, a
j
be symbols from
alphabet A, then the -distance of a
i
, a
j
is denined as
(a
i
, a
j
) = [i j[
1. -distance:
Let x, y be strings over alphabet A such that [x[ = [y[, then the -
distance of x, y is denined as
(x, y) = max
i{1..|x|}
(x
i
, y
i
)
2. -distance:
Let x, y be strings over alphabet A such that [x[ = [y[, then the -
distance of x, y is denined as
(x, y) =
i{1..|x|}
(x
i
, y
i
)
2
Denition 1.19 (Approximate pattern matching - ordered alpha-
bet)
Given text T = t
1
t
2
. . . t
n
and pattern P = p
1
p
2
. . . p
m
over a given ordered
alphabet A, then we dene:
1. -matching:
Find all occurrences of string x in text T so that [x[ = [P[ and
(P, x) k, where k is a given positive integer.
2. -matching:
Find all occurrences of string x in text T so that [x[ = [P[ and
(P, x) k, where k is a given positive integer.
3. (, )-matching:
Find all occurrences of string x in text T so that [x[ = [P[ and
(P, x) l and (P, x) k, where k, l are given positive integers
such that l k. 2
11
1.2 Classication of pattern matching problems
Onedimensional pattern matching problems for a nite size alphabet can
be classied according to several criteria. We will use six criteria for a clas-
sication leading to a six dimensional space in which one point corresponds
to a particular pattern matching problem.
Let us make a list of all dimensions including possible values in each
dimension:
1. Nature of the pattern:
string,
sequence.
2. Integrity of the pattern:
full pattern,
subpattern.
3. Number of patterns:
one,
nite number greater than one,
innite number.
4. Way of matching:
exact,
approximate matching with Hamming distance (R-matching),
approximate matching with Levenshtein distance (DIR-matching),
approximate matching with generalized Levenshtein distance
(DIRT-matching),
-approximate matching,
-approximate matching,
(, )-approximate matching.
5. Importance of symbols in a pattern:
take care above all symbols,
dont care above some symbols.
6. Sequences of patterns:
one,
nite sequence.
The above classication is represented in Figure 1.3. If we count the
number of possible pattern matching problems, we obtain N = 223722 =
336.
In order to facilitate references to a particular pattern matching problem,
we will use abbreviations for all problems. These abbreviations are summa-
rized in Table 1.1 (D means DIRmatching and G means DIRTmatching,
generalized Levenshtein distance).
12
Figure 1.3: Classication of pattern matching problems
Using this method, we can, for example, refer to exact string matching
of one string as an SFOECO problem.
Instead of a single pattern matching problem we will use the notion of
a family of pattern matching problems. In this case we will use symbol
? instead of a particular letter. For example SFO??? is the family of all
problems concerning one full string matching.
Each pattern matching problem has several instances. For example, an
SFOECO problem has the following instances:
1. verify whether a given string occurs in the text or not,
2. nd the rst occurrence of a given string,
3. nd the number of all occurrences of a given string,
4. nd all occurrences of a given string and their positions.
If we take into account all possible instances, the number of pattern matching
problems grows further.
13
Dimension 1 2 3 4 5 6
S F O E C O
Q S F R D S
I D
G
(, )
Table 1.1: Abbreviations for pattern matching problems
1.3 Two ways of pattern matching
There are two dierent ways in which matching of patterns can be per-
formed:
- forward pattern matching,
- backward pattern matching.
The basic principle of forward pattern matching is depicted in Fig. 1.4. The
TEXT
PATTERN
Figure 1.4: Forward pattern matching
text and the pattern are matched in the forward direction. This means that
the comparison of symbols is performed from left to right. All algorithms
for forward pattern matching must compare each symbol of the text at least
once. Therefore the lowest time complexity is equal to the length of the
text.
The basic principle of backward pattern matching is depicted in Fig 1.5.
TEXT
PATTERN
Figure 1.5: Backward pattern matching
The comparison of symbols is performed from right to left. There are three
main principles of backward pattern matching:
- looking for a repeated sux of the pattern,
14
- looking for a prex of the pattern,
- looking for an antifactor (a string which is not a factor) of the pattern.
Algorithms for backward pattern matching allow us to skip some part of the
text and therefore the number of comparisons can be lower than the length
of the text.
1.4 Finite automata
We will use nite automata in all subsequent Chapters as a formalism for
the description of various aspects of pattern matching. In this Section we
introduce basic notions from the theory of nite automata and we also show
some basic algorithms concerning them. The material included is not ex-
haustive and we recommend using the special literature covering this area
in detail.
Denition 1.20 (Deterministic nite automaton)
A deterministic nite automaton (DFA) is quintuple M = (Q, A, , q
0
, F),
where
Q is a nite set of states,
A is a nite input alphabet,
is a mapping from QA to Q, (QA Q)
q
0
Q is an initial state,
F Q is the set of nal states.
Denition 1.21 (Conguration of FA)
Let M = (Q, A, , q
0
, F) be a nite automaton. A pair (q, w) Q A
(QA
) (QA
M
denote a transitive and a transitive
reexive closure of relation
M
, respectively.
Denition 1.23 (Language accepted by DFA)
We will say that input string w A
M
(q, ) for some q F.
Language L(M) = w : w T
, (q
0
, w)
M
(Q A
) (Q A
.
Denition 1.27 (Language accepted by NFA)
String w A
(q, )
for some q F. Language L(M) = w : w A
, (q
0
, w)
) (Q A
.
Denition 1.30 (CLOSURE)
Function CLOSURE for nite automaton M = (Q, A, , q
0
, F) is dened
as:
CLOSURE(q) = p : (q, )
(p, ), p Q.
Denition 1.31 (NFA with a set of initial states)
Nondeterministic nite automaton M with the set of initial states I is quin-
16
tuple M = (Q, A, , I, F), where:
Q is a nite set of states,
A is a nite input alphabet,
is a mapping from QA into the set of subsets of Q,
I Q is the non-empty set of initial states,
F Q is the set of nal states.
Denition 1.32 (Accessible state)
Let M = (Q, A, , q
0
, F) be a nite automaton. State q Q is called ac-
cessible if there exists string w A
= (Q, A,
, q
0
, F
) without -transitions
equivalent to M.
Method:
1.
(q, a) =
_
pCLOSURE(q)
(p, a).
2. F
= q : CLOSURE(q) F ,= , q Q. 2
Algorithm 1.39
Construction of a nondeterministic nite automaton with a single initial
state equivalent to a nondeterministic nite automaton with several initial
states.
Input: Finite automaton M = (Q, A, , I, F) with a nonempty set I.
Output: Finite automaton M
= (Q
, A,
, q
0
, F) with a single initial
state q
0
.
Method: Automaton M
= Q q
0
, q
0
, Q,
2.
(q
0
, ) = I,
= (Q
, A,
, q
0
, F
) such that
L(M) = L(M
).
Method:
1. Set Q
= q
0
will be dened, state q
0
= q
0
will be treated as
unmarked. (Please note that each state of a deterministic automaton
consists of a set of state of a nondeterministic automaton.)
2. If each state in Q
(q
, a) =
(p, a) for p q
= Q
(q
, a) for all a A,
(c) state q
will be marked,
(d) continue with step 2.
4. q
0
= q
0
.
5. F
= q
: q
, q
F ,= . 2
Note: Let us mention that all states of the resulting deterministic nite
automaton M
(closure)
are regular expressions over A.
Denition 1.46
The value h(x) of regular expression x is dened as follows:
20
1. h() = , h() = , h(a) = a,
2. h(x +y) = h(x) h(y),
h(x y) = h(x) h(y),
h(x
) = (h(x))
.
The value of any regular expression is a regular language, and each reg-
ular language can be represented by some regular expression. Unnecessary
parentheses in regular expressions can be avoided by the convention that
precedence is given to regular operations. The closure operator has the
highest precedence, and the union operator has the lowest precedence.
The following axioms are dened for regular expressions:
A
1
: x + (y +z) = (x +y) +z (union associativity),
A
2
: x (y z) = (x y) z (concatenation associativity),
A
3
: x +y = y +x (union commutativity),
A
4
: (x +y) z = x z +y z (distributivity from the right),
A
5
: x (y +z) = x y +x z (distributivity from the left),
A
6
: x +x = x (union idempotention),
A
7
: x = x ( is a unitary element for the operation
concatenation),
A
8
: x = ( is a zero element for the operation con-
catenation),
A
9
: x + = x ( is a zero element for the operation
union),
A
10
: x
= +x
x
A
11
: x
= ( +x)
A
12
: x = x + x =
.
2. Build up the set of start symbols
Z = x
i
: x A, some string from h(V
) may start with symbol x
i
.
3. Construct the set P of adjacent symbols:
P = x
i
y
j
: symbols x
i
and y
j
may be adjacent in some string from
h(V
).
4. Construct set of nal symbols F in the following way:
F = x
i
: some string from h(V
) may end with symbol x
i
.
5. Set of states of the nite automaton
Q = q
0
x
i
: x T, i <1, n>.
6. Mapping will be constructed in the following way:
(a) (q
0
, x) includes x
i
for each x
i
Z such that x
i
was created by
the numbering of x.
(b) (x
i
, y) includes y
j
for each couple x
i
y
j
P such that y
j
was
created by the numbering of y.
(c) Set F is the set of nal states.
22
2 Forward pattern matching
The basic principles of the forward pattern matching approach are discussed
in this Chapter. We will discuss two approaches from the classication
shown in Fig. 1.2:
1. Neither the pattern nor the text is preprocessed. We will introduce
two programs implementing elementary algorithms for exact and ap-
proximate pattern matching using Hamming distance, in both cases
for a single pattern.
2. The pattern is preprocessed but the text is not preprocessed.
The preprocessing of the pattern is divided into several steps. The rst step
is the construction of a nondeterministic nite automaton which will serve
as a model of the solution of the pattern matching problem in question.
Using this model as a basis for the next step, we can construct either ae
deterministic nite automaton or a simulator, both equivalent to the basic
model. If the result of the preprocessing is a deterministic nite automaton
then the pattern matching is preformed so that the text is read as the
input of the automaton. Both the nondeterministic nite automaton as a
model and the equivalent deterministic nite automaton are constructed as
automata that are able to read any text. An occurrence of the pattern in
the text is found when the automaton is reaching a nal state. Finding the
pattern is then reported and reading of the text continues in order to nd
all following occurrences of the pattern including overlapping cases.
This approach using deterministic nite automata has one advantage:
each symbol of the text is read just once. If we take the number of steps
performed by the automaton as a measure of the time complexity of the
forward pattern matching then it is equal to the length of the text. On
the other hand, the use of deterministic nite automata may have space
problems with space complexity. The number of states of a deterministic
nite automaton may in some cases be very large in comparison with the
length of the pattern. Approximate pattern matching in an example of this
case. It is a limitation of this approach. A solution of this space problem
is the use of simulators of nondeterministic nite automata. We will show
three types of such simulators in the next Chapters:
1. use of the fail function,
2. dynamic programming, and
3. bit parallelism.
The space complexity of all of these simulators is acceptable. The time
complexity is greater than for deterministic nite automata in almost all
cases. It is even quadratic for dynamic programming.
Let us recall that text T = t
1
t
2
. . . t
n
and pattern P = p
1
p
2
. . . p
m
, and
all symbols of both the text and the pattern are from alphabet A.
23
2.1 Elementary algorithm
The elementary algorithm compares the symbols of the pattern with the
symbols of the text. The principle of this approach is shown in Fig. 2.1.
var TEXT[1..N] of char;
PATTERN[1..M] of char;
I,J: integer;
begin
I:=0;
while I N-M do
begin
J:=0;
while J<M and PATTERN[J+1]=TEXT[I+J+1] do J:=J+1;
if J=M then output(I+1);
I:=I+1; length of shift=1
end;
end;
Figure 2.1: Elementary algorithm for exact matching of one pattern
The program presented here implements the algorithm performing the
exact matching of one pattern. The meanings of the variables used in the
program are represented in Fig. 2.2. When the pattern is found then the
Figure 2.2: Meaning of variables in the program from Fig. 2.1
value of variable I is the index of the position just before the rst symbol
of the occurrence of the pattern in the text. Comment length of shift=1
means that the pattern is shifted one position to the right after each mis-
match or nding the pattern. The term shift will be used later.
We will use the number of symbol comparisons (see expression
PATTERN[J+1]=TEXT[I+J+1]) as the measure for the complexity of the
algorithm. The maximum number of symbol comparisons for the elementary
algorithm is
NC = (n m+ 1) m, (1)
where n is the length of the text and m is the length of the pattern. We
24
assume that n m. The time complexity is O(n m). The maximum
number of comparisons NC is reached for text T = a
n1
b and for pattern
P = a
m1
c, where a, b, c A, c ,= a. Elementary algorithm has no extra
space requirements.
The experimental measurements show that the number of comparisons
of elementary algorithm for texts written in natural languages is linear with
respect to the length of the text. It has been observed that a mismatch
of the symbols is reached very soon (at the rst or second symbol of the
pattern). The number of comparisons in this case is:
NC
nat
= C
L
(n m+ 1), (2)
where C
L
is a constant given by the experiments for given language L.
The value of this constant for English is C
E
= 1.07. Thus, the elementary
algorithm has linear time complexity (O(n)) for the pattern matching in
natural language texts.
The elementary algorithm can be used for matching a nite set of pat-
terns. In this case, the algorithm is used for each pattern separately. The
time complexity is
O(n
s
i=1
m
i
),
where s is the number of patterns in the set and m
i
is the length of the ith
pattern, i = 1, 2, . . . , s.
The next variant of the elementary algorithm is for approximate pattern
matching of one pattern using Hamming distance. It is shown in Fig. 2.3.
2.2 Pattern matching automata
In this Section, we will show basic models of pattern matching algorithms.
Moreover, we will show how to construct models for more complicated prob-
lems using models of simple problems.
Notational convention:
We replace names of states q
i
, q
ij
by i, ij, respectively, in subsequent tran-
sition diagrams. The reason for this is to improve the readability.
2.2.1 Exact string and sequence matching
The model of the algorithm for exact string matching (SFOECO problem)
for pattern P = p
1
p
2
p
3
p
4
is shown in Fig. 2.4. The SFOECO nondetermin-
istic nite automaton is constructed in this way:
1. Create the automaton accepting pattern P.
2. Insert the seloop for all symbols from alphabet A in the initial state.
25
var TEXT[1..N] of char;
PATTERN[1..M] of char;
I,J,K,NERR: integer;
K:=number of errors allowed;
begin
I:=0;
while I N-M do
begin
J:=0;
NERR:=0
while J<M and NERR<K do
begin if PATTERN[J+1],= TEXT[I+J+1] then NERR:=NERR+1;
J:=J+1
end;
if J=M then output(I+1);
I:=I+1; length of shift=1
end;
end;
Figure 2.3: Elementary algorithm for approximate matching of one pattern
using Hamming distance
A
p
1
p
2
p
3
p
4
0 1 2 3 4
Figure 2.4: Transition diagram of NFA for exact string matching (SFOECO
automaton) for pattern P = p
1
p
2
p
3
p
4
Algorithm 2.1 describes the construction of the SFOECO automaton in
detail.
Algorithm 2.1
Construction of the SFOECO automaton.
Input: Pattern P = p
1
p
2
. . . p
m
.
Output: SFOECO automaton M.
Method: NFA M = (q
0
, q
1
, . . . , q
m
, A, , q
0
, q
m
), where mapping is
constructed in the following way:
26
1. q
i+1
(q
i
, p
i+1
) for 0 i < m,
2. q
0
(q
0
, a) for all a A. 2
The SFOECO automaton has m+ 1 states for a pattern of length m.
A model of the algorithm for exact sequence matching (QFOECO prob-
lem) for pattern P = p
1
p
2
p
3
p
4
is shown in Fig. 2.5. The QFOECO nonde-
p
2
p
3
p
4
A
p
1
p
2
p
3
p
4
0 1 2 3 4
Figure 2.5: Transition diagram of NFA for exact sequence matching
(QFOECO automaton) for pattern P = p
1
p
2
p
3
p
4
terministic nite automaton is constructed as the SFOECO automaton with
the addition of some new seloops. The new seloops are added in all states
but the initial and nal ones for all symbols, with the exception of the sym-
bol for which there is already the transition to the next state. Algorithm 2.2
describes the construction of the QFOECO automaton in detail.
Algorithm 2.2
Construction of the QFOECO automaton.
Input: Pattern P = p
1
p
2
. . . p
m
.
Output: QFOECO automaton M.
Method: NFA M = (q
0
, q
1
, . . . , q
m
, A, , q
0
, q
m
), where mapping is
constructed in the following way:
1. q
i
(q
i
, a) for 0 < i < m and all a A and a ,= p
i+1
.
2. q
0
(q
0
, a) for all a A,
3. q
i+1
(q
i
, p
i+1
) for 0 i < m. 2
The QFOECO automaton has m+ 1 states for a pattern of length m.
2.2.2 Substring and subsequence matching
A model of the algorithm for exact substring matching (SSOECO problem)
for pattern P = p
1
p
2
p
3
p
4
is shown in Fig. 2.6.
Notational convention:
The following nondeterministic nite automata have a regular structure. For
clarity of expandion, we will use the following terminology:
State q
ij
is at depth i (a position in the pattern) and on level j.
27
The SSOECO nondeterministic nite automaton is constructed by com-
posing the collection of m copies of SFOECO automata. The composition
is done by inserting -transitions. These -transitions are inserted in the
diagonal direction. They start from the initial state of the zero level and
are directed to the next level of it. The next -transitions always start from
the end state of the previous -transition. As the nal step, the inacces-
sible states are removed. Algorithm 2.3 describes the construction of the
SSOECO automaton in detail.
Figure 2.6: Transition diagram of NFA for exact substring matching
(SSOECO automaton) for pattern P = p
1
p
2
p
3
p
4
Algorithm 2.3
Construction of the SSOECO automaton.
Input: Pattern P = p
1
p
2
. . . p
m
, SFOECO automaton M
= (Q
, A,
, q
0
, F
)
for P.
Output: SSOECO automaton M.
Method:
1. Create a sequence of m instances of SFOECO automata for pattern P
M
j
= (Q
j
, A,
j
, q
0j
, F
j
) for j = 0, 1, 2, . . . , m1. Let the states in Q
j
be
q
0j
, q
1j
, . . . , q
mj
.
2. Construct automaton M = (Q, A, , q
0
, F) as follows:
Q =
m1
j=0
Q
j
,
28
(q, a) =
j
(q, a) for all q Q, a A, j = 0, 1, 2, . . . , m1,
(q
00
, ) = q
11
,
(q
11
, ) = q
22
,
.
.
.
(q
m2,m2
, ) = q
m1,m1
,
q
0
= q
00
,
F = Q q
00
, q
11
, . . . , q
m1,m1
.
3. Remove all states which are inaccessible from state q
0
. 2
The SSOECO automaton has (m+1)+m+(m1)+. . .+2 =
m(m+3)
2
states.
The SSOECO automaton can be minimized. The direct construction of the
main part of the minimized version of this automaton is described by Algo-
rithm 3.14 (construction of factor automaton) and shown in Example 3.15.
It is enough to add seloops in the initial state for all symbols of the al-
phabet in order to obtain the SSOECO automaton. The advantage of the
nonminimized SSOECO automaton is that the unique state corresponds to
each substring of the pattern.
A model of the algorithm for exact subsequence matching (QSOECO
problem) for pattern P = p
1
p
2
p
3
p
4
is shown in Fig. 2.7. Construction of
the QSOECO nondeterministic nite automaton starts in the same way as
for the SSOECO automaton. The nal part of this construction is addition
of the -transitions. The diagonal -transitions star in all states having
transitions to following states on levels from 0 to m 1. Algorithm 2.4
describes the construction of the QSOECO automaton in detail.
Algorithm 2.4
Construction of the QSOECO automaton.
Input: Pattern P = p
1
p
2
. . . p
m
, QFOECO automaton M
= (Q
, A,
, q
0
, F
)
for P.
Output: QSOECO automaton M.
Method:
1. Create a sequence of m instances of QFOECO automata for pattern
P M
j
= (Q
j
, A,
j
, q
0j
, F
j
) for j = 0, 1, 2, . . . , m1. Let the states in
Q
j
be
q
0j
, q
1j
, . . . , q
mj
.
2. Construct automaton M = (Q, A, , q
0
, F) as follows:
Q =
m1
j=0
Q
j
,
(q, a) =
j
(q, a) for all q Q, a A, j = 0, 1, 2, . . . , m1,
(q
ij
, ) = q
i+1,j+1
, for i = 0, 1, . . . , m1, j = 0, 1, 2, . . . , m1,
q
0
= q
00
,
F = Q q
00
, q
11
, . . . , q
m1,m1
.
3. Remove the states which are inaccessible from state q
0
. 2
29
Figure 2.7: Transition diagram of NFA for exact subsequence matching
(QSOECO automaton) for pattern P = p
1
p
2
p
3
p
4
The QSOECO automaton has (m + 1) + m + (m 1) + . . . + 2 =
m(m+3)
2
states.
2.2.3 Approximate string matching - general alphabet
We will discuss three variants of approximate string matching corresponding
to the three denitions of distances between strings in the general alphabet:
Hamming distance, Levenshtein distance, and generalized Levenshtein dis-
tance.
Note:
The notion level of the state corresponds to the number of errors in the
nondeterministic nite automata for approximate pattern matching.
2.2.3.1 Hamming distance Let us recall that the Hamming distance
(R-distance) between strings x and y is equal to the minimum number of
editing operations replace which are necessary to convert string x into string
y (see Def. 1.17). This type of string matching using R-distance is called
string R-matching.
A model of the algorithm for string R-matching (SFORCO problem) for
string P = p
1
p
2
p
3
p
4
is shown in Fig. 2.8. The construction of the SFORCO
30
nondeterministic nite automaton again uses a composition of SFOECO
automata similarly to the construction of SSOECO or QSOECO automata.
The composition is done in this case by inserting diagonal transitions
starting in all states having transitions to next states on levels from 0 to k2.
The diagonal transitions are labelled by all symbols for which no transition
to the next state exists. These transitions represent replace operations.
Algorithm 2.5 describes the construction of the SFORCO automaton in
detail.
Figure 2.8: Transition diagram of NFA for string R-matching (SFORCO
automaton) for pattern P = p
1
p
2
p
3
p
4
, k = 3
Algorithm 2.5
Construction of the SFORCO automaton.
Input: Pattern P = p
1
p
2
. . . p
m
, k, SFOECO automaton M
= (Q
, A,
,
q
0
, F
) for P.
Output: SFORCO automaton M.
Method:
1. Create a sequence of k + 1 instances of SFOECO automata
M
j
= (Q
j
, A,
j
, q
0j
, F
j
) for j = 0, 1, 2, . . . , k. Let states in Q
j
be
q
0j
, q
1j
, . . . , q
mj
.
2. Construct SFORCO automaton M = (Q, A, , q
0
, F) as follows:
Q =
k
j=0
Q
j
,
31
(q, a) =
j
(q, a) for all q Q, a A, j = 0, 1, 2, . . . , k,
(q
ij
, a) = q
i+1,j+1
, for all i = 0, 1, . . . , m1, j = 0, 1, 2, . . . , k 1,
a A p
i+1
,
q
0
= q
00
,
F =
k
j=0
F
j
.
3. Remove all states which are inaccessible from state q
0
. 2
The SFORCO automaton has (m+ 1) +m+ (m1) +. . . +mk + 1 =
m(k+1)+1(k(k1))
2
states.
2.2.3.2 Levenshtein distance Let us recall that the Levenshtein dis-
tance (DIR distance) between strings x and y is equal to the minimum
number of editing operations delete, insert and replace which are necessary
to convert string x into string y (see Def. 1.17). This type of string matching
using DIR-distance is called string DIR-matching. A model of the algorithm
for string DIR-matching (SFODCO problem) for the string P = p
1
p
2
p
3
p
4
is shown in Fig. 2.9. Construction of the SFODCO nondeterministic nite
automaton is performed by an extension of the SFORCO automaton. The
extension is done by the following two operations:
1. Adding -transitions parallel to the diagonal transition of the
SFORCO automaton. These represent delete operations.
2. Adding vertical transitions starting in all states as -transitions. La-
belling added vertical transitions is the same as for diagonal transi-
tions. The vertical transitions represent insert operations.
Algorithm 2.6 describes the construction of the SFODCO automaton in
detail.
Algorithm 2.6
Construction of the SFODCO automaton.
Input: Pattern P = p
1
p
2
. . . p
m
, k, SFORCO automaton M
= (Q, A,
,
q
0
, F) for P.
Output: SFODCO automaton M for P.
Method: Let states in Q be
q
00
, q
10
, q
20
, . . . , q
m0
,
q
11
, q
21
, . . . , q
m1
,
.
.
.
q
k,k
, . . . , q
mk
.
Construct SFODCO automaton M = (Q, A, , q
0
, F) as follows:
(q, a) =
= (Q
, A,
, q
0
, F) for P.
Output: SFOGCO automaton M for P.
Method: Let states in Q
be
33
Figure 2.10: Transition diagram of NFA for string DIRT-matching
(SFOGCO automaton) for pattern P = p
1
p
2
p
3
p
4
, k = 3
q
00
, q
10
, q
20
, . . . , q
m0
,
q
11
, q
21
, . . . , q
m1
,
.
.
.
q
kk
, . . . , q
mk
.
Construct SFOGCO automaton M = (Q, A, , q
0
, F) as follows:
Q = Q
r
ij
: j = 1, 2, . . . , k, i = j 1, j, . . . , m2,
(q, a) =
i
, i = 2, 3, . . . , m. The composition is done by inserting diagonal
transitions having dierent angles and starting in all non-nal states of
all copies but the last one. They lead to all following next copies. The
inserted transitions represent replace operations. Algorithm 2.9 describes
the construction of the SFOCO automaton in detail.
Algorithm 2.9
Construction of SFOCO automaton
35
Figure 2.12: Transition diagram of NFA for string -matching (SFOCO
problem) for pattern P = p
1
p
2
p
3
p
4
Input: Pattern P = p
1
p
2
. . . p
m
, k, SFOECO automaton
M
= (Q
, A,
, q
0
, F
) for P.
Output: SFOCO automaton M.
Method:
1. Create a sequence of k + 1 instances of SFOECO automata
M
j
= (Q
j
, A,
j
, q
0j
, F
j
) for j = 0, 1, 2, . . . , k. Let states in Q
j
are
q
0j
, q
1j
, . . . , q
mj
.
2. Construct SFOCO automaton M = (Q, A, , q
0
, F) as follows:
Q =
k
j=0
Q
j
,
(q, a) =
0
(q, a) for all q Q
0
, a A,
(q
ij
, b) =
j
(q
i
, p
i+1
) for all b p
i+1
, i = 1, 2, . . . , m 1, j =
1, 2, . . . , k 1,
(q
ij
, a) = q
i+1,j+1
, q
i+1,j+2
, . . . , q
i+1,k
, for all i = 0, 1, . . . , m 1,
j = 0, 1, 2, . . . , k 1, a p
k
i+1
,
q
0
= q
00
,
F =
m
j=0
F
j
.
3. Remove all states which are inaccessible from state q
0
. 2
The SFOCO automaton has m(k + 1) + 1 states.
2.2.4.2 -distance Let us note that two strings x and y have -distance
equal to k if the symbols on the equal positions have -distance less or equal
to k and the sum of all these -distances is less or equal to k. The -distance
36
may be equal to -distance (see Def. 1.18). This type of string matching
using -distance is called string -matching (see Def. 1.19).
Model of algorithm for string -matching (SFOCO problem) for string
P = p
1
p
2
p
3
p
4
is shown in Fig. 2.13. Construction of the SFOCO nonde-
terministic nite automaton is based on composition of k + 1 copies of the
SFOECO automaton. The composition is done by insertion of diagonal
transitions having dierent angles and starting in all non-nal states of
the all copies but the last one. They are leading to all next copies. The
inserted transitions represent replace operations. Algorithm 2.11 describes
the construction of the SFOCO automaton in detail.
Figure 2.13: Transition diagram of NFA for string -matching (SFOCO
problem) for pattern P = p
1
p
2
p
3
p
4
, k = 3
Algorithm 2.10
Construction of SFOCO automaton.
Input: Pattern P = p
1
p
2
. . . p
m
, k, SFOECO automaton
M
= (Q
, A,
, q
0
, F
) for P.
Output: SFOCO automaton M.
Method:
1. Create a sequence of k + 1 instances of SFOECO automata
M
j
= (Q
j
, A,
j
, q
0j
, F
j
) for j = 0, 1, 2, . . . , k. Let states in Q
j
are
q
0j
, q
1j
, . . . , q
mj
.
2. Construct SFOCO automaton M = (Q, A, , q
0
, F) as follows:
Q =
k
j=0
Q
j
,
(q, a) =
j
(q, a) for all q Q, a A, j = 0, 1, 2, . . . , k,
37
(q
ij
, a) = q
i+1,j+1
, q
i+1,j+2
, . . . , q
i+1,k
, for all i = 0, 1, . . . , m 1,
j = 0, 1, 2, . . . , k 1, a p
k
i+1
,
q
0
= q
00
,
F =
m
j=0
F
j
.
3. Remove all states which are inaccessible from state q
0
. 2
The SFOCO automaton has m(k + 1) + 1 states.
2.2.4.3 (, )-distance Let us note that two strings x and y have
distance if the symbols on the equal positions have -distance less or equal
to l and the sum of these -distances is less or equal to k. The -distance is
strictly less than the -distance (see Def. 1.18). This type of string match-
ing using (, ) distance is called string (, ) matching (see Def. 1.19).
A model of the algorithm for string (, ) matching (SFO(, )CO prob-
lem) for the string P = p
1
p
2
p
3
p
4
is shown in Fig. 2.14. Construction of the
Figure 2.14: Transition diagram of NFA for string (, )-matching
(SFO(, )CO problem) for pattern P = p
1
p
2
p
3
p
4
, l = 2, k = 3
SFO(, )CO nondeterministic nite automaton is similar to the construc-
tion of the SFOCO automaton. The only dierence is that the number
of diagonal transitions is limited by l. Algorithm 2.11 describes the con-
struction of the SFO(, )CO automaton in detail.
Algorithm 2.11
Construction of the SFO(, )CO automaton.
Input: Pattern P = p
1
p
2
. . . p
m
, k, l, SFOECO automaton for P.
Output: SFO(, )CO automaton M.
38
Method: Let M
= (Q
, A,
, q
0
, F
j
= (Q
j
, A,
j
, q
0j
, F
j
) for j = 0, 1, 2, . . . , k. Let states in Q
j
be
q
0j
, q
1j
, . . . , q
mj
.
2. Construct the SFO(, )CO automaton M = (Q, A, , q
0
, F) as fol-
lows:
Q =
k
j=0
Q
j
,
(q, a) =
j
(q, a) for all q Q, a A, j = 1, 2, . . . , m,
(q
ij
, a) = q
i+1,j+1
, q
i+1,j+2
, . . . , q
i+1,j+l
, for all i = 0, 1, . . . , m 1,
j = 0, 1, 2, . . . , l 1, a p
l
i+1
,
q
0
= q
00
,
F =
m
j=0
F
j
.
3. Remove all states which are inaccessible from state q
0
.
The SFO(, )CO automaton has less than m(k + 1) + 1 states. 2
2.2.5 Approximate sequence matching
Figure 2.15: Transition diagram of NFA for sequence R-matching
(QFORCO automaton)
39
Figure 2.16: Transition diagram of NFA for sequence DIR-matching
(QFODCO automaton)
Here we discuss six variants of approximate sequence matching in which
the following distances are used: Hamming distance, Levenshtein distance,
generalized Levenshtein distance, -distance, -distance, and (, )-distance.
There are two ways of constructing nondeterministic nite automata for ap-
proximate sequence matching:
- The rst way is to construct a QFO?CO nondeterministic nite au-
tomaton by the corresponding algorithms for approximate string match-
ing, to which we give the QFOECO automaton as the input.
- The second way is to transform a SFO?CO automaton to a QFO?CO
automaton by adding self loops for all symbols to all non-initial states
that have at least one outgoing transition, as shown in Algorithm 2.12.
Algorithm 2.12
Transformation of an SFO?CO automaton to a QFO?CO automaton.
Input: SFO?CO automaton M
= (Q
, A,
, q
0
, F
).
Output: QFO?CO automaton M.
Method: Construct automaton M = (Q, A, , q
0
, F) as follows:
1. Q = Q
.
2. (q, a) =
(q
, a) for each q Q, a A .
40
Figure 2.17: Transition diagram of NFA for sequence DIRT-matching
(QFOGCO automaton)
3. (q, a) =
(q
0
.
5. F = F
. 2
Transition diagrams of the resulting automata are depicted in Figs. 2.15,
2.16 and 2.17 for pattern P = p
1
p
2
p
3
p
4
, k = 3.
2.2.6 Matching of nite and innite sets of patterns
A model of the algorithm for matching a nite set of patterns is constructed
as the union of NFA for matching the individual patterns.
As an example we show the model for exact matching of the set of
patterns P = p
1
p
2
p
3
, p
4
p
5
, p
6
p
7
p
8
(SFFECO problem). This is shown in
Fig. 2.18. This automaton is in some contexts called dictionary matching
automaton. A dictionary is a nite set of strings.
The operation union of nondeterministic nite automata is the general
approach for the family of matching of nite set of patterns (??F??? family).
Moreover, the way of matching of each individual pattern must be dened.
41
Figure 2.18: Transition diagram of the nondeterministic nite automaton
for exact matching of the nite set of strings P = p
1
p
2
p
3
, p
4
p
5
, p
6
p
7
p
8
(SFFECO automaton)
The next algorithm describes this approach and assumes that the way
of matching is xed (exact, approximate, . . . ).
Algorithm 2.13
Construction of the ??F??? automaton.
Input: A set of patterns with a specication of the way of matching
P = P
1
(w
1
), P
2
(w
2
), . . ., P
r
(w
r
), where P
1
, P
2
, . . . , P
r
are patterns and
w
1
, w
2
, . . . , w
r
are specications of the ways of matching them.
Output: The ??F??? automaton.
Method:
1. Construct an NFA for each pattern P
i
, 1 i r, with respect to the
specication of matching w
i
.
2. Create the NFA for the language which is the union of all input lan-
guages of the automata constructed in step 1. The resulting automaton
is the ??F??? automaton. 2
Example 2.14
Let the input to Algorithm 2.13 be P = abc(SFOECO), def(QFOECO),
xyz(SSOECO). The result of step 1 of Algorithm 2.13 is shown in Fig. 2.19.
The transition diagram of the nal
_
S
Q
__
F
S
_
OECO automaton is
shown in Fig. 2.20. 2
The model of the algorithm for matching an innite set of patterns is
based on a nite automaton accepting this set. The innite set of patterns
is in this case dened by a regular expression. Let us present the exact
matching of an innite set of strings (SFIECO problem). The construction
of the SFIECO nondeterministic nite automaton is performed in two steps.
42
Figure 2.19: Transition diagram of nondeterministic nite automata for
individual patterns from Example 2.14
In the rst step, a nite automaton accepting the language dened by the
given regular expression is constructed. The seloop in its initial state for
all symbols from the alphabet is added in the second step. Algorithm 2.15
describes the construction of the SFIECO automaton in detail.
Algorithm 2.15
Construction of the SFIECO automaton.
Input: Regular expression R describing a set of strings over alphabet A.
Output: SFIECO automaton M.
Method:
1. Construct nite automaton M
= (Q, A,
, q
0
, F) such that L(M
) =
h(R), where h(R) is the value of the regular expression R.
2. Construct nondeterministic nite automaton M = (Q, A, , q
0
, F), where
(q, a) =
(q
0
, a) q
0
for all a A. 2
43
Figure 2.20: Transition diagram of the resulting nite automaton for the set
of patterns P from Example 2.14
Example 2.16
Let the regular expression R = ab
= (Q, A,
, q
0
, F) such that L(M
) =
h(R), where h(R) is the value of the regular expression R.
2. Construct nondeterministic nite automaton M = (Q, A, , q
0
, F) where
(q, a) =
c + bc over alphabet A = a, b, c.
The result of Algorithm 2.17 is the automaton having transition diagram
depicted in Fig. 2.22. 2
2.2.7 Pattern matching with dont care symbols
Let us recall that the dont care symbol is the symbol matching any
other symbol from alphabet A including itself. The transition diagram of
the nondeterministic nite automaton for exact string matching with the
dont care symbol (SFOEDO problem) for pattern P = p
1
p
2
p
4
is shown
in Fig. 2.23.
An interesting point of this automaton is the transition from state 2
to state 3 corresponding to the dont care symbol. This is in fact a set of
45
Figure 2.22: Transition diagram of the resulting QFIECO automaton from
Example 2.18
Figure 2.23: Transition diagram of the nondeterministic nite automaton for
exact string matching with the dont care symbol (SFOEDO automaton)
for pattern P = p
1
p
2
p
4
transitions for all symbols of alphabet A. The rest of the automaton is the
same as for the SFOECO automaton.
The transition diagram of the nondeterministic nite automaton for ex-
act sequence matching with the dont care symbol (QFOEDO problem) for
the pattern P = p
1
p
2
p
4
is shown in Fig. 2.24.
Figure 2.24: Transition diagram of the nondeterministic nite automaton for
exact sequence matching with dont care symbol (QFOEDO automaton)
for pattern P = p
1
p
2
p
4
46
The transition for the dont care symbol is the same as for string match-
ing. However, the rest of the automaton is a slightly changed QFOECO
automaton. The self loop in state 2 is missing. The reason for this is that
the symbol following symbol p
2
is always the third element of the given
sequence, because we do not care about.
The construction of automata for other problems with dont care symbols
uses the principle of insertion of the sets of transitions for all symbols of
the alphabet to the place corresponding to the positions of the dont care
symbols.
Example 2.19
Let pattern P = p
1
p
2
p
4
be given. We construct the SFORDO automaton
(approximate string matching of one full pattern using Hamming distance)
for Hamming distance k = 3. The transition diagram of the SFORDO
automaton is depicted in Fig. 2.25. Let us note that transitions labelled by
A refer to a transition for an empty set of symbols, and may be removed.
Figure 2.25: Transition diagram of the nondeterministic nite automaton
for the SFORDO problem for pattern P = p
1
p
2
p
4
(transitions labelled by
A may be removed)
47
2.2.8 Matching a sequence of patterns
Matching a sequence of patterns is dened by Denition 1.16. The nondeter-
ministic nite automaton for matching a sequence of patterns is constructed
by making a cascade of automata for patterns in the given sequence. The
construction of the nondeterministic nite automaton for a sequence of pat-
terns starts by constructing nite automata for matching all elements of
the sequence. The next operation (making a cascade) is the insertion of
-transitions from all nal states of the automaton in the sequence to the
initial state of the next automaton in the sequence, if it exists. The following
algorithm describes this construction.
Algorithm 2.20
Construction of the ?????S automaton.
Input: Sequence of patterns P
1
(w
1
), P
2
(w
2
), . . . , P
s
(w
s
), where P
1
, P
2
, . . . , P
s
are patterns and w
1
, w
2
, . . . , w
S
are specications of their matching.
Output: ?????S automaton.
Method:
1. Construct a NFA M
i
= (Q
i
, A
i
,
i
, q
0i
, F
i
) for each pattern P
i
(w
i
),
1 i s, s > 1, with respect to the specication w
i
.
2. Create automaton M = (Q, A, , q
0
, F) as a cascade of automata in
this way:
Q =
s
i=1
Q
i
,
A =
s
i=1
A
i
,
(q, a) =
i
(q, a) for all q Q
i
, a A
i
, i = 1, 2, . . . , s,
(q, ) = (q
0,i+1
), for all q F
i
, 1 i s 1,
q
0
= q
01
,
F = F
s
. 2
The main point of Algorithm 2.20 is the insertion of -transitions from all
nal states of automaton M
i
to the initial state of automaton M
i+1
.
Example 2.21
Let the input to Algorithm 2.20 be the sequence abc, def, xyz. The resulting
SFOECS automaton is shown in Fig. 2.26. 2
2.3 Some deterministic pattern matching automata
In this Section we will show some deterministic nite automata obtained
by determinising the of nondeterministic nite automata of the previous
Section. A deterministic pattern matching automaton needs at most n steps
for pattern matching in the text T = t
1
t
2
. . . t
n
. This means that the time
complexity of searching is linear (O(n)) for all problems in the classication
described in Section 1.2.
48
A
A
A
a START
d
x
b
e
y
c
f
z
q
0
q
4
e
e
Figure 2.26: Transition diagram of the nondeterministic nite automaton
for matching the sequence of patterns P = abc(SFOECO), def (SFOECO),
xyz(SFOECO) (SFOECS automaton)
On the other hand, the use of deterministic nite automata has two
drawbacks:
1. The size of the pattern matching automaton depends on the cardinality
of the alphabet. Therefore this approach is suitable primarily for small
alphabets.
2. The number of states of a deterministic pattern matching automaton
can be much greater than the number of states of its nondeterministic
equivalent.
These drawbacks, the time and space complexity of the construction of a
deterministic nite automaton, are the Price we have to pay for fast search-
ing. Several methods for simulating the original nondeterministic pattern
matching automata have been designed to overcome these drawbacks. They
will be discussed in the following Chapters.
2.3.1 String matching
The deterministic SFOECO nite automaton for the pattern P = p
1
p
2
. . . p
m
is the result of determinisation of the nondeterministic SFOECO automaton.
Example 2.22
Let us have pattern P = abab over alphabet A = a, b. A transition
diagram of the SFOECO(abab) automaton and its deterministic equivalent
are depicted in Fig. 2.27. Transition tables of both automata are shown in
Table 2.1. The deterministic SFOECO automaton is a complete automaton
and has just m + 1 states for a pattern of length m. The number of steps
49
Figure 2.27: Transition diagrams of the nondeterministic and deterministic
SFOECO automata for pattern P = abab from Example 2.22
(number of transitions) made during pattern matching in a text of length n
is just n. 2
2.3.2 Matching of a nite set of patterns
The deterministic SFFECO automaton for matching a set of patterns
S = P
1
, P
2
, . . . , P
s
is the result of the determinisation of nondetermin-
istic SFFECO automaton.
Example 2.23
Let us have set of patterns S = ab, bb, babb over alphabet A = a, b. Tran-
sition diagram of the SFFECO(S) automaton and its deterministic equiva-
lent are depicted in Fig. 2.28. Transition tables of both automata are shown
in Table 2.2. 2
The deterministic pattern matching automaton for a nite set of patterns
has less than [S[ + 1 states, where [S[ =
s
i=1
[P
i
[ for P = P
1
, P
2
, . . . , P
s
.
The maximum number of states (equal to [S[+1) is reached in the case when
no two patterns have common prex. In Example 2.23 holds that [S[ +1 = 9
and patterns bb and babb have common prex b. Therefore the number of
states is 8 which is less than 9.
50
a b
0 0, 1 0
1 2
2 3
3 4
4
a) Nondeterministic SFOECO(abab)
automaton
a b
0 01 0
01 01 02
02 013 0
013 01 024
024 013 0
b) Deterministic SFOECO(abab)
automaton
Table 2.1: Transition tables of SFOECO(abab) automata from Example 2.22
a b
0 0, 1 0, 3, 7
1 2
2
3 4
4 5
5 6
6
7 8
8
a) Nondeterministic SFFECO
(ab, bb, babb) automaton
a b
0 01 037
01 01 0237
037 014 0378
0237 014 0378
014 01 02357
0378 014 0378
02357 014 03678
03678 014 0378
b) Deterministic SFFECO
(ab, bb, babb) automaton
Table 2.2: Transition tables of SFFECO(ab, bb, babb) automata from Ex-
ample 2.23
51
0 3
1
7
4
2
8
5 6
b
a
b
a
b
b
b b
a
b
START
a) Nondeterministic ({ }) automaton SFFECO ab,bb,babb
b) Deterministic automaton SFFECO ab,bb,babb ({ })
0 037
01
014 03678 02357
0237
0378
b a b b
b
a
START
b
b
a
a
a
a a
a
b
b
Figure 2.28: Transition diagrams of the nondeterministic and deterministic
SFFECO automata for S = (ab, bb, babb) from Example 2.23
2.3.3 Regular expression matching
The deterministic nite automaton for matching an innite set of patterns
is the result of the determinisation of SFIECO automaton.
52
Example 2.24
Let us have regular expression R = ab
c + bc over alphabet A = a, b, c
(see also Example 2.16). Transition diagram of the SFIECO(R) automaton
and its deterministic equivalent are shown depicted in Fig. 2.29. Transition
Figure 2.29: Transition diagrams of the nondeterministic and deterministic
SFIECO automata for R = ab
c +bc) automaton
a b c
0 01 02 0
01 01 012 03
02 01 02 03
012 01 012 03
03 01 02 0
b) Deterministic SFIECO
(ab
c +bc) automaton
Table 2.3: Transition tables of SFIECO(ab
.
57
Figure 2.32: Transition diagrams of the nondeterministic and deterministic
SFODCO(aba, 1) automata from Example 2.27
Denition 2.28
Strings v and w are denoted by uw
1
and v
1
u when u = vw.
The following Theorem 2.29 adopted from [CH97b] is necessary for dic-
tionary matching automata construction.
Theorem 2.29
Let U A
. Then
1. for each v A
v A
U i h
U
(v) A
U,
2. h
U
() = ,
3. for each v A
, a A h
U
(va) = h
U
(h
U
(v)a).
58
a b
0 0,1 0,4 4
1 4,5 2,4 5
2 3,5 5,6 6
3
4 5
5 6
6
a b
0 01 045
01 01456 0245
045 016 045
01456 01456 0245
0245 01356 0456
01356 01456 0245
0456 016 045
016 01456 0245
Table 2.6: Transition tables of SFODCO(aba, 1) automata from Exam-
ple 2.27
Proof
If v A
and u U. By the
denition of h
U
, u is necessarily a sux of h
U
(v); therefore h
U
(v) A
U.
Conversely, if h
U
(v) A
U, we have also v A
U, because h
U
(v) is a sux
of v. Which proves (1).
Property (2) clearly holds.
It remains to prove (3). Both words h
U
(va) and h
U
(v)a are suxes of
va, and therefore one of them is a sux of the other. Then two cases are
distinguished according to which word is a sux of the other.
First case: h
U
(v)a is a proper sux of h
U
(va) (hence h
U
(va) ,= ).
Consider the word w dened by w = h
U
(va)a
1
. Thus we have: h
U
(v)
is a proper sux of w, w is a sux of v, and w Pref(U). Since w is a
sux of v that belongs to Pref (U), but strictly longer than h
U
(v), there is
a contradiction in the maximality of [h
U
(v)[, so this case is impossible.
Second case: h
U
(va) is a sux of h
U
(v)a. Then, h
U
(va) is a sux of
h
U
(h
U
(v)a). Since h
U
(v)a is a sux of va, h
U
(h
U
(v)a) is a sux of h
U
(va).
Both properties imply h
U
(va) = h
U
(h
U
(v)a) and the expected result follows.
2
Now the dictionary matching automaton can be constructed according
to Theorem 2.30, borrowed from [CH97b].
59
Theorem 2.30
Let X be a nite language. Then the automaton M = (Q, A, q
0
, , F), where
Q = q
x
[ x Pref(X), q
0
= q
, (q
p
, a) = q
h
X
(pa)
, p Pref(X), a A,
F = q
x
[ x Pref(X) A
X. This automa-
ton is deterministic and complete.
Proof
Let v A
X, it
must hold h
X
(v) A
X
by denition of the automaton. This implies that v A
X from (1) of
Theorem 2.29 again. 2
Transition diagram of an example of dictionary matching automaton is
shown in Figure 2.33.
Figure 2.33: Transition diagram of dictionary matching automaton for lan-
guage aba, aab, bab
2.4.1 Construction of a dictionary matching automaton
The rst method of dictionary matching automata construction follows di-
rectly from Theorem 2.30. But, as will be shown in this section, dictionary
matching automata can be build using standard algorithms. This method
is described in Algorithm 2.31.
Algorithm 2.31
Construction of a dictionary matching automaton for a given nite language.
60
Input: Finite language X.
Output: Dictionary matching automaton accepting language A
X.
Method:
1. Create a tree-like nite automaton accepting language X (using Algo-
rithm 2.32),
2. Add a self loop (q
0
, a) = (q
0
, a) q
0
for each a A to M.
3. Using the subset construction (see Algorithm 1.40) make a determin-
istic dictionary matching automaton accepting language A
X.
2
Algorithm 2.32
Construction of a deterministic automaton accepting the set of strings.
Input: Finite language X.
Output: Deterministic nite automaton M = (Q, A, , q
0
, F) accepting lan-
guage X.
Method:
1. Q = q
x
[ x Pref(X),
2. q
0
= q
,
3. (q
p
, a) =
_
q
pa
in case when pa Pref(X),
undened otherwise,
4. F = q
p
[ p X. 2
Example 2.33
Let us create a dictionary matching automaton for language aba,
aab, bab to illustrate this algorithm. The outcome from the step (2) of
Algorithm 2.31 is shown in Figure 2.34 and the result of the whole algo-
rithm is shown in Figure 2.35. 2
What is the result of this algorithm? As shown in Theorem 2.34, au-
tomata created according to Theorem 2.30 are equivalent to automata cre-
ated according to Algorithm 2.31.
Theorem 2.34
Given nite language X, nite automaton M
1
= (Q
1
, A,
1
, q
, F
1
) accept-
ing language A
, F
2
) accepting the same language but created by
Theorem 2.30.
Proof
The rst step is to show that after reading a string w automaton M
1
will be
in state q =
1
(q
, w) = q
x
[ x Su(w) Pref(X). Let us remind the
61
Figure 2.34: Transition diagram of nondeterministic dictionary matching
automaton for the language aba, aab, bab
reader that deterministic automaton M
1
was created from nondeterministic
automaton M
1
= (Q
1
, A,
1
, q
, F
1
) by the subset construction. As follows
from the subset construction algorithm, after reading string w automaton
M
1
will be in state q =
1
(q
, w) =
(q
(q
, w) = q
x
[ x Su(w) Pref(X). Given string v
Su(w)Pref(X), w can be written as uv, u A
, uv)
|u|
M
1
(q
, v)
|v|
M
1
(q
v
, ), so q
v
(q
, w). Conversely,
consider q
v
(q
, w). Since M
1
can read arbitrarily long words only
by using the seloop for all a A in the initial state, the sequence of
moves must be as follows (q
, w)
1
(q
, v)
|v|
M
1
(q
v
, ). Consequently
v Su(w) Pref(X).
Since the set of states Q
1
T(q
x
[ x Pref(X)), it is possible to
dene an isomorphism f[
Q
1
: T(q
x
[ x Pref(X)) q
x
[ x Pref(X)
as follows:
f(q) = p
w
q, [w[ is maximal.
Now it is necessary to show that
1. q
1
, q
2
Q
1
, q
1
,= q
2
f(q
1
) ,= f(q
2
),
2. p Q
2
q Q
1
, f(q) = p,
3. f(q
1
0
) = q
2
0
4. f(
1
(q, a)) =
2
(f(q), a),
5. f(F
1
) = F
2
.
62
Figure 2.35: Transition diagram of deterministic dictionary matching au-
tomaton for language aba, aab, bab created from nondeterministic one by
the subset construction
Let us suppose p =
1
(q
, u), q =
1
(q
1
(q
, u) =
q
x
[ x Su(u) Pref(X). Thus f(
1
(, u)) = q
u
.
Property (3) clearly holds.
It will be shown that f(
1
(q
, u)) =
2
(q
1
(q
, u)) = f(q
x
[ x Su(u) Pref(X))
and
2
(u
, u) = q
h
X
(u)
. Thus, the preposition holds from the denitions of
f and h
X
.
Property (5) clearly holds from previous properties and thus both au-
tomata accept the same language. 2
The main consequence of the previous Theorem is that during the trans-
formation of a nondeterministic tree-like automaton with the self loop for all
a A in the initial state to the deterministic one, the number of states does
not increase.
But it is possible to show more. It is easy to see that a deterministic
dictionary matching automaton accepting language A
X created by last
two steps of Algorithm 2.31 contains at most the same number of states as
nite automaton M
2
= (Q
2
, A,
2
, q
2
0
, F
2
) accepting the same language and
created by Algorithm 2.31 from scratch.
Proof
Let us remind that deterministic automaton M
1
was created from nonde-
terministic automaton M
= (Q, A,
, q
0
, F) by the subset construction.
The set of active states of automaton M
after reading u A is
(q
0
, u),
which is equal to
(q
0
, h
X
(u)). Let us denote the active state of automaton
M
1
after reading u by q. Subset construction ensures that
(q
0
, u) q.
So, in the worst case for all q Q
1
it holds q =
(q
0
, u), u Pref(X),
which completes the proof. 2
2.4.2 Approximate string matching
In order to prove the upper bound of the state complexity of deterministic
nite automata for approximate string matching, it is necessary to limit the
number of states of the dictionary matching automata accepting language
A
wX
[w[).
Proof
Because of Theorem 2.35, the number of states of such way created deter-
ministic dictionary matching automaton is at most the same as the number
of states of a tree-like nite automaton accepting X, whose number of states
is in the worst case equal to 1 +
wX
[w[. 2
2.4.2.1 Hamming distance At rst, it is necessary to dene the nite
automaton for approximate string matching using Hamming distance. It
is the Hamming automaton M(A
H
k
(p)) accepting language A
H
k
(p),
where H
k
(p) = u [ u A
, D
H
(u, p) k for the given pattern p A
and
the number of allowed errors k 1.
64
Since H
k
(p) is the nite language, it is possible to estimate the number
of states of the deterministic automaton M(A
H
k
(p)) using Theorem 2.36.
The only concern is to compute the size of the language H
k
(p).
Theorem 2.37
The number of strings generated by at most k replace operations from the
pattern p = p
1
p
2
. . . p
m
is O
_
[A[
k
m
k
_
. 2
Proof
The set of strings created by exactly i (0 i k) operations replace are
made by replacing exactly i symbols of p by other symbols. There are
_
m
i
_
possibilities for choosing i symbols from m. Each chosen symbol can be
replaced by [A[ 1 symbols, so the number of generated strings is at most
_
m
i
_
([A[ 1)
i
= ([A[ 1)
i
O
_
m
i
_
= O
_
[A[
i
m
i
_
,
because
_
m
i
_
= O(m
i
). The set of strings created by at most k operations
replace is the union of the abovementioned sets of strings. Thus, the cardi-
nality of this set is
k
i=0
O
_
[A[
i
m
i
_
= O
_
[A[
k
m
k
_
.
2
Since the number of strings generated by the replace operation is now
known, it is possible to estimate the number of states of the deterministic
Hamming automaton.
Theorem 2.38
The number of states of deterministic nite automaton M(A
H
k
(p)), p =
p
1
p
2
. . . p
m
is
O
_
[A[
k
m
k+1
_
.
Proof
As shown in Theorem 2.36 the number of states of this automaton is at most
the same as the size of language H
k
(p). As for all u H
k
(p) holds [u[ = m,
the size of language H
k
(p) is
O(
uH
k
(p)
[u[) = O
_
m[A[
k
m
k
_
= O
_
[A[
k
m
k+1
_
.
2
65
2.4.2.2 Levenshtein distance The same approach as in Section 2.4.2.1
can be used to bound the number of states of a deterministic automa-
ton for approximate string matching using Levenshtein distance. It is the
Levenshtein automaton M(A
L
k
(p)) accepting language A
L
k
(p), where
L
k
(p) = u [ u A
, D
L
(u, p) k for given pattern p A
_
x
2
+ 1
_
. . .
_
x
z 1
+ 1
_
_
x
z
+ 1
_
.
In case that x 2 all fractions but the rst one are smaller than (or equal
to) x. Thus
_
x +z
z
_
(x + 1) x
z1
= x
z
+x
z1
= O(x
z
) .
As the number of allowed errors is smaller than the length of pattern, i < m,
c
m,i
=
_
m+i
min(m, i)
_
=
_
m+i
i
_
= O
_
m
i
_
.
Thus, the number of strings generated by exactly i insert operations is
O
_
[A[
i
m
i
_
. Since the set of strings generated by at most k insert opera-
tions is the union of the above mentioned sets of strings, the cardinality of
this set is
k
i=0
O
_
[A[
i
m
i
_
= O
_
[A[
k
m
k
_
.
2
Theorem 2.40
The number of strings generated by at most k delete operations from pattern
p = p
1
p
2
. . . p
m
is O
_
m
k
_
.
Proof
Sets of strings generated by exactly i delete operations consist of strings that
are made by deleting exactly i symbols fromp. There are
_
m
i
_
possibilities for
choosing i symbols from m, so the number of such strings is at most
_
m
i
_
=
O
_
m
i
_
. Since the set of strings generated by at most k delete operations is
the union of the above mentioned sets, the number of strings within this set
is
k
i=0
O
_
m
i
_
= O
_
m
k
_
.
2
67
Now it is known the number of strings generated by each edit opera-
tion, so it is possible to estimate the number of strings generated by these
operations all at once.
Theorem 2.41
The number of strings generated by at most k replace, insert, and delete
operations from pattern p = p
1
p
2
. . . p
m
is O
_
[A[
k
m
k
_
.
Proof
The number of strings generated by at most k replace, insert, delete op-
erations can be computed as a sum of the number of strings generated by
these operations for exactly i errors allowed for 0 i k. Such strings are
generated by a combination of above mentioned operations, so the number
of generated strings is
k
x=0
kx
y=0
kxy
z=0
O([A[
x
m
x
)
. .
replace x
symbols
O([A[
y
m
y
)
. .
insert y
symbols
O(m
z
)
. .
delete z
symbols
=
=
k
x=0
kx
y=0
kxy
z=0
O
_
[A[
x+y
m
x+y+z
_
=
= O
_
[A[
k
m
k
_
2
The last step is to bound the number of states of the deterministic dic-
tionary matching automaton.
Theorem 2.42
The number of states of deterministic nite automaton M(A
L
k
(p)), p =
p
1
p
2
. . . p
m
is
O
_
[A[
k
m
k+1
_
.
Proof
The proof is the same as for Theorem 2.38. Since for all u L
k
(p) holds
[u[ m+k, the size of language L
k
(p) is
O(
uL
k
(p)
[u[) = O
_
(m+k)[A[
k
m
k
_
=
_
[A[
k
m
k+1
_
.
2
2.4.2.3 Generalized Levenshtein distance It is clear that the con-
cepts from Sections 2.4.2.1 and 2.4.2.2 will be used also for generalized Lev-
enshtein distance.
The nite automaton for approximate string matching using generalized
Levenshtein distance M(A
G
k
(p)) will be called the nite automaton ac-
cepting language A
G
k
(p), where G
k
(p) = u [ u A
, D
G
(u, p) k for
given pattern p A
i=0
O
_
m
i
_
= O
_
m
k
_
.
2
Now, the number of strings generated by all operations dened by gen-
eralized Levenshtein distance is to be found.
Theorem 2.44
The number of strings generated by at most k replace, insert, delete, and
transpose operations from the pattern p = p
1
p
2
. . . p
m
is O
_
[A[
k
m
k
_
.
Proof
The number of strings generated by at most k replace, insert, delete, and
transpose operations can be computed as a sum of the numbers of strings
generated by these operations for exactly i errors allowed for 0 i k. Such
strings are generated by a combination of above mentioned operations. So
the number of generated strings is
k
w=0
kw
x=0
kwx
y=0
kwxy
z=0
O([A[
w
m
w
)
. .
replace w
symbols
O([A[
x
m
x
)
. .
insert x
symbols
O(m
y
)
. .
delete y
symbols
O(m
z
)
. .
transpose z
pairs
=
=
k
w=0
kw
x=0
kwx
y=0
kwxy
z=0
O
_
[A[
w+x
m
w+x+y+z
_
=
= O
_
[A[
k
m
k
_
2
Finally, the number of states of the deterministic dictionary matching
automaton will be found.
69
Theorem 2.45
The number of states of deterministic nite automaton M(A
G
k
(p)), p =
p
1
p
2
. . . p
m
is
O
_
[A[
k
m
k+1
_
.
Proof
The proof is the same as for Theorem 2.38. Since for all u G
k
(p) holds
[u[ m+k the size of language G
k
(p) is
O(
uG
k
(p)
[u[) = O
_
(m+k)[A[
k
m
k
_
= O
_
[A[
k
m
k+1
_
.
2
2.4.2.4 distance The nite automaton for approximate string match-
ing using distance M(A
k
(p)) will be called the nite automaton accept-
ing language A
k
(p), where A is an ordered alphabet,
k
(p) = u [ u
A
, D
as j 1
j(j 1)
i+1
+j(j 1)
i
+j j
i
=
= j
_
(j 1)
i+1
+ (j 1)
i
+j
i
_
from induction assumption
j j
i+1
= j
i+2
which completes the proof. 2
70
Theorem 2.47
For all i 3, j 2,
i
x=0
(j 1)
x
+
i1
x=0
(j 1)
x
j
i
.
Proof
It will be shown by induction on i. It is satised for i = 3:
3
x=0
(j 1)
x
+
2
x=0
(j 1)
x
= 2(j 1)
0
+ 2(j 1)
1
+ 2(j 1)
2
+ (j 1)
3
=
= 2 + 2j 2 + 2j
2
4j + 2 +j
3
2j
2
+j j
2
+ 2j 1 =
= j
3
j
2
+j + 1
since j 2
j
3
Consider the assumption is satised for i 3. Than for i + 1
i+1
x=0
(j 1)
x
+
i
x=0
(j 1)
x
=
= (j 1)
i+1
+ (j 1)
i
+
i
x=0
(j 1)
x
+
i1
x=0
(j 1)
x
k
(p)) (shown
in Figure REF) without the self loop for the whole alphabet in the initial
71
state q
0,0
(q
0,0
, a), a A. The number of these paths can be computed
by following recurrent formula:
c
i,j
=
_
_
1 i = 0, j = 0
0 i > 0, j = 0
c
i,j1
+ 2
i1
x=0
c
x,j1
otherwise
.
Let us show in several steps, that c
i,j
2j
i
.
i = 0, j > 0: c
0,j
= c
0,j1
= c
0,j2
= . . . = 1 2 j
0
i = 1, j > 0: By induction on j. It is satised for j = 1 because
c
1,1
= 2 2 1
1
. Consider the assumption holds for j > 0. Than for
j + 1
c
1,j+1
= c
1,j
+ 2 c
0,j
from induction assumption and the fact that c
0,j
= 1
2j + 2 = 2(j + 1) 2(j + 1)
1
i = 2, j > 0: By induction on j. It is satised for j = 1 because
c
2,1
= 2 2 1
2
. Consider the assumption holds for j 1. Than for
j + 1
c
2,j+1
= c
2,j
+ 2 c
1,j
+ 2 c
0,j
from induction assumption
2j
2
+ 2 2j + 2 1 = 2(j
2
+ 2j + 1)
2(j + 1)
1
= 2(j + 1)
2
i 3, j = 1: c
i,1
= c
i,0
+ 2 c
i1,0
+ 2 c
i2,0
+. . . + 2 c
0,0
= 2.
i 3, j 2: By induction on j. It was shown in previous step that
the assumption holds for j 1 1. Then for j
c
i,j
= c
i,j1
+ 2
i1
x=0
c
x,j1
(4)
from induction assumption
2(j 1)
i
+ 2
i1
x=0
2(j 1)
x
x=0
2(j 1)
x
+
i1
x=0
2(j 1)
x
i=0
O
_
m
i
_
= O
_
m
k
_
.
2
Finally, the number of states of the deterministic dictionary matching
automaton will be found.
Theorem 2.49
The number of states of the deterministic nite automaton M(A
k
(p)),
p = p
1
p
2
. . . p
m
is
O
_
m
k+1
_
.
Proof
The proof is the same as for Lemma 2.38. Since for all u
k
(p), it holds
[u[ = m the size of the language
k
(p) is
O(
u
k
(p)
[u[) = O
_
m m
k
_
= O
_
m
k+1
_
.
2
2.5 (, ) distance
The nite automaton for approximate string matching using () distance
M(A
(
l
k
)(p)) will be called the nite automaton accepting language
A
(
l
k
)(p), where A is an ordered alphabet, (
l
k
)(p) = u [ u A
,
D
(u, p) l, D
k
)(p).
Let us start by special case when l = 1. In order to do that, it is necessary
to prove one auxiliary lemma.
Theorem 2.50
For all i 2, j 1, (j 1)
i
+ 2(j 1)
i1
j
i
.
Proof
It will be shown by induction on i. It holds for i = 2:
(j 1)
2
+ 2(j 1) = j
2
2j + 1 + 2j 2 = j
2
1 j
2
.
Consider the assumption is fullled for i 2. Than for i + 1
(j 1)
i+1
+ 2(j 1)
i
= (j 1)
_
(j 1)
i
+ 2(j 1)
i1
_
73
from induction assumption
(j 1)j
i
= j
i+1
j
i
j
i+1
which completes the proof. 2
Theorem 2.51
The number of strings generated from pattern p of the length m with at
most k allowed errors in distance and at most 1 allowed error in dis-
tance is O
_
m
k
_
.
Proof
The number of strings generated from the pattern p of the length m for
exactly k errors can be computed as the number of dierent paths in the
transition diagram of an automaton M((
l
k
)(p)) from the initial state
q
0,0
to the nal state q
k,m
, where the transition diagram of the automa-
ton M((
l
k
)(p)) is the same as the transition diagram of the automaton
M(A
(
l
k
)(p)) (shown in Figure REF) without the self loop for the whole
alphabet in the initial state q
0,0
(q
0,0
, a), a A. The number of these
paths can be computed by following recurrent formula:
c
i,j
=
_
_
1 i = 0, j = 0
c
i,j1
+ 2c
i1,j1
0 i j, 0 < j
0 otherwise
.
Let us show in several steps that c
i,j
2j
i
.
i = 0, j > 0: c
0,j
= c
0,j1
= . . . = c
0,0
= 1 2j
0
i = 1, j > 0: By induction on j. It is satised for j = 1 because
c
1,1
= c
1,0
+ 2c
0,0
= 2 2. Consider the assumption holds for j > 0.
Than for j + 1
c
1,j+1
= c
1,j
+ 2c
0,j
from induction assumption and the fact that c
0,j
= 1
2j + 2 = 2(j + 1) 2(j + 1)
1
.
i 2, j 1; By induction on j. It is satised for j = 1 because
c
i,1
= c
i,0
+ 2c
i1,0
= 0. Consider the condition holds for j 1 1.
Than for j
c
i,j
= c
i,j1
+ 2c
i1,j1
from induction assumption
2(j 1)
i
+ 2 2(j 1)
j1
from Lemma 2.50
2j
i
.
74
Thus the number of string generated from pattern of the length m with
exactly i allowed errors in distance and 1 error in distance is O
_
m
i
_
.
Since the set of strings generated from pattern of the length m with at
most k allowed errors in distance and 1 error in distance is the union
of the above mentioned sets of strings, its cardinality is
k
i=0
O
_
m
i
_
= O
_
m
k
_
.
2
The other special case is that l k. It is quite easy to see, that this is the
same case as when just distance is used. The number of strings generated
in this case, which was given by Lemma 2.48, is O
_
m
k
_
.
Since the asymptotic number of strings generated by combined ()
distance is the same in both cases (l = 1 and l k), the number of allowed
errors in distance does not aect the asymptotic number of generated
strings.
Now it is possible to estimate the number of states of the deterministic
dictionary matching automaton.
Theorem 2.52
The number of states of the deterministic nite automaton M(A
(
l
k
)(p)),
p = p
1
p
2
. . . p
m
is
O
_
m
k+1
_
.
Proof
The proof is the same as for Lemma 2.38. Since for all u (
l
k
)(p), it
holds [u[ = m the size of the language (
l
k
)(p) is
O(
u(
k
k
)(p)
[u[) = O
_
m m
k
_
= O
_
m
k+1
_
.
2
2.6 distance
The nite automaton for approximate string matching using distance
M(A
k
(p)) will be called the nite automaton accepting language A
k
(p),
where A is an ordered alphabet,
k
(p) = u [ u A
, D
k
(p)),
p = p
1
p
2
. . . p
m
is
O(m(2k + 1)
m
) .
Proof
The proof is the same as for Lemma 2.38. Since for all u
k
(p), it holds
[u[ = m the size of the language
k
(p) is
O(
u
k
(p)
[u[) = O(m(2k + 1)
m
) = O(m(2k + 1)
m
) .
2
76
3 Finite automata accepting parts of a string
In this Chapter we explain how to construct nite automata accepting all
prexes, suxes, factors, and subsequences of a given string. At the end we
show the construction of factor oracle automaton accepting all factors of a
given string and moreover some of its subsequences.
3.1 Prex automaton
Having string x = a
1
a
2
. . . a
n
we can express set Pref(x) (see Def. 1.1) using
the following two forms of regular expressions:
R
Pref
= +a
1
+a
1
a
2
+. . . +a
1
a
2
. . . a
n
= +a
1
( +a
2
( +. . . +a
n1
( +a
n
) . . .)).
Using the rst form of regular expression R
Pref
, we can construct the nite
automaton accepting set Pref(x) using Algorithm 3.1.
Algorithm 3.1
Construction of the prex automaton I (union of prexes).
Input: String x = a
1
a
2
. . . a
n
.
Output: Finite automaton M accepting language Pref(x).
Method: We use description of language Pref(x) by regular expression:
R
Pref
= +a
1
+a
1
a
2
+. . . +a
1
a
2
. . . a
n
.
1. Construct n nite automata M
i
accepting strings a
1
a
2
. . . a
i
for all
i = 0, 1, . . . , n.
2. Construct automaton M accepting union of languages L(M
i
),
i = 0, 1, . . . , n.
L(M) = L(M
0
) L(M
1
) L(M
2
) . . . L(M
n
). 2
Transition diagram of the prex automaton constructed by Algorithm 3.1 is
depicted in Fig. 3.1.
If we use the second form of regular expression:
R
Pref
= +a
1
( +a
2
( +. . . +a
n1
( +a
n
) . . .)),
we can construct the nite automaton using Algorithm 3.2.
Algorithm 3.2
Construction of prex automaton II (set of neighbours).
Input: String x = a
1
a
2
. . . a
n
.
Output: Finite automaton M accepting language Pref(x).
Method: We use description of language Pref(x) by regular expression:
R
Pref
= +a
1
( +a
2
( +. . . +a
n1
( +a
n
) . . .)).
1. We will use the method of neighbours.
(a) The set of initial symbols: IS = a
1
.
77
Figure 3.1: Transition diagram of the nite automaton accepting language
Pref(a
1
a
2
. . . a
n
) constructed for regular expression R
Pref
= +a
1
+a
1
a
2
+
. . . +a
1
a
2
. . . a
n
from Algorithm 3.1
(b) The set of neighbours: NS = a
1
a
2
, a
2
a
3
, . . . , a
n1
a
n
.
(c) The set of nal symbols: FS = a
1
, a
2
, . . . , a
n
.
2. Construct automaton
M = (q
0
, q
1
, q
2
, . . . , q
n
, A, , q
0
, F)
where (q
0
, a
1
) = q
1
because IS = a
1
,
(q
i
, a
i+1
) = q
i+1
for all i = 1, 2, . . . , n 1,
because a) each state q
i
, i = 1, 2, . . . , n1, coresponds to the
prex a
1
a
2
. . . a
i
,
b) a
i
a
i+1
NS,
F = q
0
, q
1
, . . . , q
n
because the set of nal symbols is
FS = a
1
, a
2
, . . . , a
n
and h(R
Pref
). 2
Transition diagram of the resulting prex is automaton M depicted in
Fig. 3.2.
a
2
a
3
a
n
a
1
START
q
0
q
1
q
2
q
n
Figure 3.2: Transition diagram of the nite automaton accepting language
Pref(a
1
a
2
. . . a
n
) constructed for regular expression R
Pref
= +a
1
(+a
2
(+
. . . +a
n1
( +a
n
) . . .)) from Algorithm 3.2
Example 3.3
Let us have string x = abab. Construct automata accepting Pref(x) us-
ing both methods of their construction. Using Algorithm 3.1, we obtain
prex automaton M
1
having the transition diagram depicted in Fig. 3.3.
Algorithm 3.2 yields prex automaton M
2
having transition diagram de-
picted in Fig. 3.4. Both automata M
1
and M
2
are accepting language
78
Figure 3.3: Transition diagram of prex automaton M
1
accepting Pref(abab)
from Example 3.3
START
0
a b a b
1 2 3 4
Figure 3.4: Transition diagram of prex automaton M
2
accepting Pref(abab)
from Example 3.3
Pref(abab) and therefore they should be equivalent. Let us show it:
As prex automaton M
1
is nondeterministic, let us construct its determin-
istic equivalent M
1
. Its transition diagram is depicted in Fig. 3.5 and it is
obvious that both automata M
1
and M
2
are equivalent. 2
Figure 3.5: Transition diagram of deterministic automaton M
1
from Exam-
ple 3.3
The second variant of the construction of the prex automaton is more
straightforward than the rst one. Therefore we will simplify it for the
practical use in the following algorithm. As the states in this automaton
correspond to the length of respective prexes, we will use integer numbers
as labels of states.
Algorithm 3.4
The construction of a nite automaton accepting set Pref(x).
Input: String x = a
1
a
2
. . . a
n
.
79
Output: Deterministic nite automaton M = (Q, A, , q
0
, F) accepting set
Pref(x).
Method:
Q = 0, 1, 2, . . . , n,
A is the set of all dierent symbols in x,
(i 1, a
i
) = i, i = 1, 2, . . . , n,
q
0
= 0,
F = 0, 1, 2, . . . , n. 2
Example 3.5
Let us construct the deterministic nite automaton accepting Pref(abba) =
, a, ab, abb, abba using Algorithm 3.4. The resulting automaton M =
(0, 1, 2, 3, 4, a, b, , 0, 0, 1, 2, 3, 4). Its transition diagram is depicted in
Fig 3.6. 2
0 1 2 3 4
a b b a START
Figure 3.6: Transition diagram of nite automaton M accepting set
Pref(abba) from Example 3.5
3.2 Sux automaton
Having string x = a
1
a
2
. . . a
n
, we can express set Su(x) (see Def. 1.2) using
the following two forms of regular expressions:
R
Su
(x) =a
1
a
2
. . . a
n
+a
2
a
3
. . . a
n
+. . . +a
n
+
=(. . . ((a
1
+)a
2
+)a
3
+. . . +)a
n
+.
Using the rst form of regular expression R
Su
we can construct the nite
automaton accepting set Su(x) using Algorithm 3.6. Let us call it the sux
automaton for string x.
Algorithm 3.6
Construction of the sux automaton I (union of suxes).
Input: String x = a
1
a
2
. . . a
n
.
Output: Finite automaton M accepting language Su(x).
Method: We use description of language Su(x) by regular expression:
R
Su
= a
1
a
2
. . . a
n
+a
2
. . . a
n
+. . . +a
n
+.
1. Construct n nite automata M
i
accepting strings a
i
a
i+1
. . . a
n
for
i = 1, 2, . . . , n. Construct automaton M
0
accepting empty string.
2. Construct automaton M
N
accepting union of languages L(M
i
),
i = 0, 1, 2, . . . , n, i.e. L(M
N
) = L(M
0
) L(M
1
) L(M
2
) . . . L(M
n
).
80
3. Construct deterministic automaton M equivalent to automaton M
N
.
2
Transition diagram of the sux automaton constructed by Algorithm 3.6 is
depicted in Fig. 3.7.
Figure 3.7: Transition diagram of nite automaton accepting language
Su(a
1
a
2
. . . a
n
) constructed for regular expression R
Su
= a
1
a
2
. . . a
n
+
a
2
. . . a
n
+. . . +a
n
+
If we use the second form of the regular expression:
R
Su
(x) = (. . . (a
1
+)a
2
+)a
3
+. . . +)a
n
+,
we can construct the nite automaton using Algorithm 3.7.
Algorithm 3.7
Construction of the sux automaton II (use of transitions).
Input: String x = a
1
a
2
. . . a
n
.
Output: Finite automaton M accepting language Su(x).
Method: We use description of language Su(x) by regular expression
R
Su
(x) = (. . . ((a
1
+)a
2
+)a
3
+. . . +)a
n
+.
1. Construct nite automaton M
1
accepting string x = a
1
a
2
. . . a
n
.
M
1
= (q
0
, q
1
, . . . , q
n
, A, , q
0
, q
n
),
where (q
i
, a
i+1
) = q
i+1
for all i = 0, 1, . . . , n 1.
2. Construct nite automaton M
2
= (q
0
, q
1
, . . . , q
n
, A,
, q
0
, q
n
) from
the automaton M
1
by inserting transitions:
(q
0
, ) = q
1
, q
2
, . . . , q
n1
, q
n
.
3. Replace all transitions in M
2
by nontransitions. The resulting
automaton is M
3
.
4. Construct deterministic nite automaton M equivalent to automaton
M
3
. 2
Sux automaton M
2
constructed by Algorithm 3.7 has, after step 2., tran-
sition diagram depicted in Fig. 3.8. Sux automaton M
3
has, after step
81
Figure 3.8: Transition diagram of sux automaton M
2
with transitions
constructed in step 2. of Algorithm 3.7
3. of Algorithm 3.7, its transition diagram depicted in Fig. 3.9.
Figure 3.9: Transition diagram of sux automaton M
3
after the removal of
transitions in the step 3. of Algorithm 3.7
We can use an alternative method for the construction of the sux au-
tomaton described by the second form of regular expression:
R
Su
(x) = (. . . (a
1
+)a
2
+)a
3
+. . . +)a
n
+.
Algorithm 3.8 uses this method.
Algorithm 3.8
Construction of the sux automaton III (using more initial states).
Input: String x = a
1
a
2
. . . a
n
.
Output: Finite automaton M accepting language Su(x).
Method: We use description of language Su(x) by regular expression
R
Su
(x) = (. . . ((a
1
+)a
2
+)a
3
+. . . +)a
n
+.
1. Construct nite automaton M
1
accepting string x = a
1
a
2
. . . a
n
.
M
1
= (q
0
, q
1
, q
2
, . . . , q
n
, A, , q
0
, q
n
),
where (q
i
, a
i+1
) = q
i+1
for all i = 0, 1, . . . , n 1.
2. Construct nite automaton M
2
= (q
0
, q
1
, q
2
, . . . , q
n
, A, , I, q
n
)
from automaton M
1
having this set of initial states:
I = q
0
, q
1
, . . . , q
n1
, q
n
.
3. Construct deterministic automaton M equivalent to automaton M
2
.
We use following steps for the construction:
(a) Using Algorithm 1.39 construct automaton M
3
equivalent to au-
tomaton M
2
having just one initial state and -transitions from
the initial state to all other states.
82
(b) Using Algorithm 1.38 construct automaton M
4
without -transitions
equivalent to automaton M
3
.
(c) Using Algorithm 1.40 construct deterministic automaton M equiv-
alent to automaton M
4
. 2
Transition diagram of sux automaton M
2
constructed in step 2. of Algo-
rithm 3.8 is depicted in Fig. 3.10.
Figure 3.10: Transition diagram of sux automaton M
2
accepting language
Su(a
1
a
2
. . . a
n
) constructed in step 2. of Algorithm 3.8
Example 3.9
Let us have string x = abab. Construct automata accepting Su(x) using all
three methods of their construction. Using Algorithm 3.6 we obtain (after
the step 2.) sux automaton M
1
having the transition diagram depicted
in Fig. 3.11. Algorithm 3.7 yields, after step 2., sux automaton M
2
with
a b
b
a
a
a
b
b
b
b
START
0 1 2
2
3
3
3
4
4
4
4
Figure 3.11: Transition diagram of sux automaton M
1
accepting Su(abab)
from Example 3.9
the transition diagram depicted in Fig. 3.12. Algorithm 3.8 yields, after
the step 2., sux automaton M
3
with the transition diagram depicted in
Fig. 3.13. All automata M
1
, M
2
and M
3
accepts language Su(abab) and
therefore they should be equivalent. Let us show it.
As sux automaton M
1
is nondeterministic, let us construct equivalent
deterministic automaton M
1
. Transition table and transition diagram of
83
0 1 2 3 4
a b a b START
b a b
Figure 3.12: Transition diagram of sux automaton M
2
accepting Su(abab)
from Example 3.9
Figure 3.13: Transition diagram of nondeterministic sux automaton M
3
accepting Su(abab) from Example 3.9
sux automaton M
1
are depicted in Fig. 3.14. Automaton M
1
can be min-
a b
0 1
13
24
24
3
3 4
3
1
from Example 3.9
imized as the states in pairs (2, 4
), (2
, 4
), 3, 3
, and 4, 4
are equiv-
alent. Transition diagram of minimized sux automaton M
1
is depicted in
Fig. 3.15.
84
0 1 2 3 4
a b a b
START
b
Figure 3.15: Transition diagram of deterministic sux automaton M
1
after
minimization from Example 3.9
Sux automaton M
2
is also nondeterministic and after the determiniza-
tion we obtain automaton M
2
having transition diagram depicted in Fig. 3.16.
Figure 3.16: Transition diagram of deterministic sux automaton M
2
from
Example 3.9
Sux automaton M
3
has ve initial states. Construction of equivalent
automaton M
3
is shown step by step in Fig. 3.17.
Sux automata M
1
, M
2
, and M
3
(see Figs 3.15, 3.16, 3.17) are obviously
equivalent. 2
We can use the experience from the possible constructions of the sux
automaton in the practical algorithm.
Algorithm 3.10
Construction of a nite automaton accepting set Su(x).
Input: String x = a
1
a
2
. . . a
n
.
Output: Deterministic nite automaton M = (Q, A, , q
0
, F) accepting set
Su(x).
Method:
1. Construct nite automaton M
1
= (Q
1
, A,
1
, q
0
, F
1
) accepting string x
and empty string:
Q
1
= 0, 1, 2, . . . , n,
A is the set of all dierent symbols in x,
1
(i 1, a
i
) = i, i = 1, 2, . . . , n,
q
0
= 0,
F
1
= 0, n.
2. Insert additional transitions into automaton M
1
leading from initial
state 0 to states 2, 3, . . . , n:
(0, a) = i if (i 1, a) = i for all a A, i = 2, 3, . . . , n.
The resulting automaton is M
2
.
85
a) Transition diagram of sux automaton with one initial state and
-transitions
b) Transition diagram of sux automaton after removal of -transitions
c) Transition diagram of deterministic sux automaton M
3
Figure 3.17: Three steps of construction of deterministic sux automaton
M
3
from Example 3.9
3. Construct deterministic nite automaton M equivalent to automaton
M
2
. 2
Denition 3.11 (Terminal state of the sux automaton)
The nal state of the sux automaton having no outgoing transition is
called terminal state. 2
Denition 3.12 (Backbone of the sux automaton)
The backbone of sux automaton M for string x is the longest continuous
sequence of states and transitions leading from the initial state to terminal
state of M. 2
Example 3.13
Let us construct the deterministic nite automaton accepting Su(abba) =
abba, bba, ba, a, using Algorithm 3.10. Automaton M
1
= (0, 1, 2, 3, 4,
a, b,
1
, 0, 0, 4) accepting strings , abba has the transition diagram
depicted in Fig. 3.18. Finite automaton M
2
= (0, 1, 2, 3, 4, a, b,
2
, 0,
0, 4) with additional transitions has the transition diagram depicted in
Fig. 3.19. The nal result of this construction is deterministic nite au-
tomaton M = (0, 14, 2, 23, 3, 4, a, b, , 0, 4) which transition table is
shown in Table 3.1. dsubsets of automaton M are: 0, 14, 2, 23, 3, 4. Tran-
86
0 1 2 3 4
a b b a START
Figure 3.18: Transition diagram of nite automaton M
1
accepting string
abba from Example 3.13
0 1 2 3 4
a b b a START
b b a
Figure 3.19: Transition diagram of nondeterministic nite automaton M
2
with additional transitions from Example 3.13
sition diagram of automaton M is depicted in Fig. 3.20. 2
0 14 2
23
3 4
a b b a START
b
b a
Figure 3.20: Transition diagram of deterministic nite automaton M ac-
cepting set Su(abba) = abba, bba, ba, a, from Example 3.13
3.3 Factor automaton
Factor automaton is in some sources called Directed Acyclic Word Graph
(DAWG).
Having string x = a
1
a
2
. . . a
n
, we can express set Fact(x) (see Def. 1.3)
using regular expressions:
1. R
Fac1
(x) = (. . . (( a
1
+)a
2
+)a
3
+. . . +)a
n
+
+(. . . (( a
1
+)a
2
+)a
3
+. . . +)a
n
+(. . . (( a
1
+)a
2
+)a
3
+. . . +)a
n1
+. . .
+ a
1
,
2. R
Fac2
(x) =+a
1
(+a
2
( +. . . +a
n1
(+a
n
) . . .))
+a
1
(+a
2
( +. . . +a
n1
(+a
n
) . . .))
+ a
2
( +. . . +a
n1
(+a
n
) . . .)
+. . .
+ a
n
.
87
a b
0 14 23
14 2
2 3
23 4 3
3 4
4
Table 3.1: Transition table of deterministic nite automaton M from Ex-
ample 3.13
The rst variant of the regular expression corresponds to that set Fact(x)
is exactly the set of all suxes of all prexes of x. The second variant cor-
responds to the fact that set Fact(x) is exactly the set of all prexes of all
suxes of x. It follows from these possibilities of understanding of both reg-
ular expressions that the combination of methods of constructing the prex
and sux automata can be used for the construction of factor automata.
For the rst variant of the regular expression we can use Algorithms 3.6,
3.7, 3.8 and 3.10 as a base for the construction of sux automata and to
modify them in order to accept all prexes of all suxes accepted by sux
automata. Algorithm 3.14 makes this modication by setting all states nal.
Algorithm 3.14
Construction of the factor automaton.
Input: String x = a
1
a
2
. . . a
n
.
Output: Finite automaton M accepting language Fact(x).
Method:
1. Construct sux automaton M
1
for string x = a
1
a
2
. . . a
n
using any of
Algorithms 3.6, 3.7, 3.8 or 3.10.
2. Construct automaton M
2
by setting all states of automaton M
1
nal
states.
3. Perform minimization of automaton M
2
. The resulting automaton is
automaton M. 2
The resulting deterministic automaton need not be minimal and therefore
the minimization takes place as the nal operation.
Example 3.15
Let us have string x = abbbc. We construct the factor automaton accept-
ing set Fact(x) using all four possible ways of its construction. The rst
method is based on Algorithm 3.6. Factor automaton M
1
has after step 2.
of Algorithm 3.6 transition diagram depicted in Fig. 3.21.
88
a b
b
b
b
b
b c
c
c
c
c
b
b
b
START
0 1 2
2
3
3
3
4 5
5
5
5
5
4
4
4
Figure 3.21: Transition diagram of factor automaton M
1
accepting set
Fact(abbbc) from Example 3.15
As factor automaton M
1
is nondeterministic, we do its determinisation.
Transition table and transition diagram of deterministic factor automaton
M
1
are depicted in Fig. 3.22. This automaton is not minimal because sets
of states 4, 4
, 5, 5
, 5
, 5
, 5
1
is depicted in Fig. 3.23. 2
The second method of the construction of the factor automaton is based
on Algorithm 3.7. Factor automaton M
2
after step 2 of Algorithm 3.7 has
the transition diagram depicted in Fig. 3.24. Factor automaton M
2
is nonde-
terministic and therefore we do its determinisation. The resulting determin-
istic factor automaton M
2
is minimal and its transition table and transition
diagram are depicted in Fig. 3.25.
89
a b c
0 1 2
1 2
2 3
3 4
4 5
5
2
1
accepting set Fact(abbbc) from Example 3.15
90
b
a b
b
b
b
b c
c
c
c
START
0 1 2
234
3
34
4 5
Figure 3.23: Transition diagram of minimal factor automaton M
1
accepting
set Fact(abbbc) from Example 3.15
Figure 3.24: Transition diagram of nondeterministic factor automaton M
2
accepting set Fact(abbbc) from Example 3.15
a b c
0 1 234 5
1 2
2 3
3 4
4 5
234 34 5
34 4 5
5
a b
b
b
b
b
b c
c
c
c
START
0 1 2
234
3
34
4 5
Figure 3.25: Transition table and transition diagram of deterministic factor
automaton M
2
accepting set Fact(abbbc) from Example 3.15
91
The third method of the construction of the factor automaton is based
on Algorithm 3.8. Factor automaton M
3
has after step 2. of Algorithm 3.8
transition diagram depicted in Fig. 3.26. Automaton M
3
has more than
Figure 3.26: Transition diagram of factor automaton M
3
accepting set
Fact(abbbc) from Example 3.15
one initial state, therefore we transform it to automaton M
3
having just
one initial state. Its transition table and transition diagram is depicted in
Fig. 3.27. 2
a b c
012345 1 234 5
1 2
2 3
3 4
4 5
234 34 5
34 4
5
Figure 3.27: Transition table and transition diagram of factor automaton
M
3
with just one initial state accepting set Fact(abbbc) from Example 3.15
Note: We keep notions of terminal state and the backbone (see Defs. 3.11
and 3.12) also for factor automaton.
3.4 Parts of sux and factor automata
We will need to use in some applications only some parts of sux and factor
automata instead of their complete forms. We identied three cases of useful
92
parts of both automata:
1. Backbone of sux or factor automaton.
2. Front end of sux or factor automaton.
3. Multiple front end of sux or factor automaton.
3.4.1 Backbone of sux and factor automata
The backbone of sux or automaton for string x is such part of the automa-
ton where all states and transitions which does not correspond to prexes
of x are removed. A general method of extraction of the backbone of sux
or factor automaton is the operation intersection.
Algorithm 3.16
Construction of the backbone of sux (factor) automaton.
Input: String x = a
1
a
2
. . . a
n
, deterministic sux (factor) automaton M
1
=
(Q
1
, A,
1
, q
01
, F
1
) for x, deterministic prex automaton M
2
= (Q
2
, A,
2
,
q
02
, F
2
) for x.
Output: Backbone M = (Q, A, , q
0
, F) of sux (factor) automaton M
1
=
(Q
1
, A,
1
, q
01
, F
1
).
Method: Construct automaton M accepting intersection Su (x) Pref(x)
(Fact(x) Pref(x)) using Algorithm 1.44. 2
Example 3.17
Let us construct backbone of the sux automaton for string x = abab. Tran-
sition diagrams of input automata M
1
and M
2
and output automaton M
are depicted in Fig. 3.28. 2
The resulting backbone is similar to input sux automaton M
1
. The only
change is that transition from state 0
S
to state 2
S
4
S
for input symbol b
is removed. Moreover, we can see that the resulting backbone in Example
3.17 is equivalent to the input prex automaton M
2
. The important point
of this construction is that d-subsets of sux automaton M
1
are preserved
in the resulting automaton M. This fact will be useful in some applications
described in next Chapters.
The algorithm for extraction of the backbone of the sux or factor au-
tomaton is very simple and straightforward. Nevertheless it can be used for
extraction of backbones of sux and factor automata for set of strings and
for approximate sux and factor automata as well.
3.4.2 Front end of sux or factor automata
A front end of a sux or factor automaton for string x is a nite automaton
accepting prexes of factors of string x having limited length which is strictly
less than the length of string x.
93
Figure 3.28: Construction of the backbone of sux automaton M
1
from
Example 3.17
For the construction of a front end part of factor automaton we adapt
Algorithm 1.40 for transformation of nondeterministic nite automaton to
a deterministic nite automaton. The adaptation consists of two points:
1. To append information on the minimal distance of a state of an au-
tomaton from its initial state. The minimal distance is, in this case,
the minimal number of transitions, which are necessary to reach the
state in question from the initial state.
2. To stop construction of deterministic automaton as soon as all states
having desired limited distance from the initial state are constructed.
Denition 3.18
Let M = (Q, A, , q
0
, F) be an acyclic nite automaton accepting language
L(M). Front end of automaton M for given limit h is a minimal deter-
ministic nite automaton M
h
accepting at least all prexes of strings from
language L(M) having length less or equal to h. This language will be de-
noted by L
h
. 2
Algorithm 3.19
Transformation of an acyclic nondeterministic nite automaton to a deter-
ministic nite automaton with states having distance from the initial state
less or equal to given limit.
94
Input: Nondeterministic acyclic nite automaton M = (Q, A, , q
0
, F),
limit h of maximal distance.
Output: Deterministic nite automaton M
h
= (Q
h
, A,
h
, q
0h
, F
h
) such that
L
h
(M) = L
h
(M
h
).
Method:
1. Set Q
h
= (q
0
, 0) will be dened, state q
0
, 0 will be treated as
unmarked.
2. If each state in Q
h
is marked then continue with step 5.
3. If there is no unmarked state (q, l) in Q
h
, where l is less than h then
continue with step 5.
4. An unmarked state (q, l) will be choosen from Q
h
and the following
operations will be executed:
(a)
h
((q, l), a) = (q
, l
) Q
h
then Q
h
= Q
h
(q
, min(l+1, l
)) and
h
((q, l), a) =
(q
, min(l + 1, l
),
(c) state (q, l) will be marked,
(d) continue with step 2.
5. q
0h
= q
0
, 0.
6. F
h
= (q, l) : (q, l) Q
h
, q F ,= . 2
Example 3.20
Let us have string x = abab. Construct front end of the factor automaton
for string x of length h = 2. Transition diagram of nondeterministic factor
automaton M for string x is depicted in Fig. 3.29. Transition diagram of
Figure 3.29: Transition diagram of nondeterministic factor automaton M
for string x = abab from Example 3.20
the front end of deterministic factor automaton M
h
for h = 2 is depicted in
Fig. 3.30. 2
95
Figure 3.30: Transition diagram of the front end of deterministic factor
automaton M
2
from Example 3.20
3.4.3 Multiple front end of sux and factor automata
Let us recall denition of multiple state of deterministic nite automaton
(see Def. 1.42). Multiple front end of a sux or factor automaton is the part
of them containing multiple states only. For the construction of such part of
sux or factor automaton we again adapt Algorithm 1.40 for transformation
of nondeterministic nite automaton to a deterministic nite automaton.
The adaption is very simple:
If some state constructed during determinisation is simple state then we
omit it.
Algorithm 3.21
Construction of multiple states part of sux or factor automaton.
Input: Nondeterministic sux or factor automaton M = (Q, A, , q
0
, F).
Output: Part M
= (Q
, A,
, q
0
, F
= q
0
will be dened, state q
0
will be treated as unmarked.
2. If all states in Q
= Q
(q, a),
(c) the state q Q
will be marked,
(d) continue with step 2.
4. q
0
= q
0
.
5. F
= q : q Q
, q F ,= . 2
Example 3.22
Let us have string x = abab as in Example 3.20. Construct multiple front end
of the factor automaton for string x. Transition diagram of nondeterministic
factor automaton M for string x is depicted in Fig. 3.29. Transition diagram
of multiple front end M
1
(i 1, a
i
) = i, i = 1, 2, . . . , n,
q
0
= 0, F
1
= 0, 1, 2, . . . , n.
2. Insert transitions into automaton M
1
leading from each state to
its next state. Resulting automaton M
2
= (Q, A,
2
, q
0
, F
1
), where
2
=
1
, where
(i 1, ) = i, i = 1, 2, . . . , n.
3. Replace all transitions by nontransitions. The resulting automa-
ton is M
3
.
4. Construct deterministic nite automaton M equivalent to automaton
M
3
. All its states will be nal states. 2
Example 3.24
Let us construct the deterministic nite automaton accepting set
Sub(abba) = , a, b, ab, ba, aa, bb, aba, bba, abb, abba. Let us mention, that
strings aa, aba are subsequences of x = abba but not its factors. Us-
ing Algorithm 3.23 we construct the subsequence automaton. Automaton
M
1
= (0, 1, 2, 3, 4, a, b,
1
, 0, 0, 1, 2, 3, 4) accepting all prexes of string
abba has transition diagram depicted in Fig. 3.35.
97
Finite automaton M
2
with inserted transitions has transition diagram
depicted in Fig 3.32. Nondeterministic nite automaton M
3
after the elim-
0 1 2 3 4
a b b a START
Figure 3.32: Transition diagram of automaton M
2
with transitions ac-
cepting all subsequences of string abba from Example 3.24
ination of transitions has transition diagram depicted in Fig. 3.33. The
0 1 2 3 4
a b b a START
b
b
a
a
b
a
Figure 3.33: Transition diagram of nondeterministic nite automaton M
3
accepting set Sub(abba) after elimination of the transitions from Exam-
ple 3.24
nal result of this construction is deterministic nite automaton (subse-
quence automaton) M. Its transition table is shown in Table 3.2. Transition
a b
0 14 23
14 4 23
23 4 3
3 4
4
Table 3.2: Transition table of automaton M from Example 3.24
diagram of automaton M is depicted in Fig. 3.34. 2
98
0 14 23 3 4
a b b a START
b
a
a
Figure 3.34: Transition diagram of deterministic subsequence automaton M
accepting set Sub(abba) from Example 3.24
3.6 Factor oracle automata
The factor oracle automaton for given string x accepts all factors of x and
possibly some subsequences of x.
Factor oracle automaton is similar to the factor automaton, but it has
always n + 1 states, where n = [x[. It is possible to construct factor oracle
automaton from factor automaton. This construction is based on the notion
of corresponding states in a factor automaton.
Denition 3.25
Let M be the factor automaton for string x and q
1
, q
2
be dierent states of
M. Let there exist two sequences of transitions in M:
(q
0
, x
1
)
(q
1
, ), and
(q
0
, x
2
)
(q
2
, ).
If x
1
is a sux of x
2
and x
2
is a prex of x then q
1
and q
2
are corresponding
states. 2
The factor oracle automaton can be constructed by merging the correspond-
ing states.
Example 3.26
Let us construct the deterministic nite automaton accepting set
Fact(abba) = , a, b, ab, bb, ba, abb, bba, abba using Algorithm 3.7. Automa-
ton M
1
= (0, 1, 2, 3, 4, a, b,
1
, 0, 0, 1, 2, 3, 4) accepting all prexes of
the string abba has the transition diagram depicted in Fig. 3.35. Finite au-
0 1 2 3 4
a b b a START
Figure 3.35: Transition diagram of nite automaton M
1
accepting set of all
prexes of the string abba from Example 3.26
tomaton M
2
with inserted transitions has the transition diagram depicted
99
in Fig. 3.36. Nondeterministic nite automaton M
3
after the elimination of
transitions has the transition diagram depicted in Fig. 3.37.
Figure 3.36: The transition diagram of automaton M
2
with transitions
accepting all factors of string abba from Example 3.26
0 1 2 3 4
a b b a START
b b a
Figure 3.37: Transition diagram of nondeterministic factor automaton M
3
after the elimination of transitions from Example 3.26
The nal result of this construction is deterministic factor automaton
M. Its transition table is shown in Table 3.3. The transition diagram of
a b
0 14 23
14 2
2 3
23 4 3
3 4
4
Table 3.3: The transition table of the automaton M from Example 3.26
automaton M is depicted in Fig. 3.38. The corresponding states in this
automaton are: 2 and 23. If we make this two states equivalent, then we
obtain factor oracle automaton Oracle(abba) with the transition diagram
depicted in Fig. 3.39. The language accepted by the automaton is:
L(Oracle(abba)) =, a, b, ab, bb, ba, abb, bba, abba, aba
=Fact(abba) aba.
String aba is not factor of abba but it is its subsequence. 2
The approach used in Example 3.26 has this drawback: the intermediate
result is a factor automaton and its number of states is limited by 2n 2
100
0 14 2
23
3 4
a b b a START
b
b a
Figure 3.38: Transition diagram of factor automaton M accepting set
Fact(abba) from Example 3.26
0 14 2,23 3 4
a b b a START
b a
Figure 3.39: Transition diagram of the Oracle(abba) from Example 3.27
while the number of states of factor oracle automaton is always equal to n+1,
where n is the length of string x. Fortunately, a factor oracle automaton
can be constructed directly during the determinization of nondeterministic
factor automaton. In this case, it is necessary to x the identication of
corresponding states.
Interpretation A of Denition 3.25: Using our style of the numbering of
states, the identication of corresponding states can be done as follows:
Two states
p = i
1
, i
2
, . . . , i
n1
, q = j
1
, j
2
, . . . , j
n2
c t c = g a c c a t t c
C(3, 4) = g a
c c a t t c t c = g a c a t t c t c ()
C(3, 4), (7, 9) = g a
c c a t
t
c t c = g a c a t t c
C(3, 10) = g a
a
t
t
c
t c = g a c
C(6, 7) = g a c c a
t t c t c = g a c c a t c t c ()
C(6, 9) = g a c c a
t
t
c t c = g a c c a t c
C(3, 4), (6, 7) = g a
c c a
t t c t c = g a c a t c t c ()
C(3, 4), (6, 9) = g a
c c a
t
t
c t c = g a c a t c
C(2, 5) = g
a
c a t t c t c = g a t t c t c ()
C(2, 5), (7, 9) = g
a
c a t
t
c t c = g a t t c
C(2, 5), (6, 7) = g
a
c a
t t c t c = g a t c t c ()
C(2, 5), (6, 9) = g
a
c a
t
t
c t c = g a t c
C(3, 8)) = g a
a
t
t c t c = g a c t c ()
2
The set of strings SC(T) created by contractions can contain some string
which are substring of other elements of set SC. Such strings can be from
SC removed because they are redundant.
Example 3.32
The set SCO(T) for text T = gaccattctc from Example 3.30 contains after
optimisation these strings (marked in Example 3.31 by *):
SCO(T) = gaccattctc, gacattctc, gaccatctc, gacatctc, gattctc, gatctc, gactc.
Language accepted by factor oracle M
0
for T is:
L(M
0
) = Fact(SCO(T)). 2
3.7 The complexity of automata for parts of strings
The maximum state and transition complexities of prex, sux, factor,
subsequence, and factor oracle automata are summarized in Table 3.8. The
length of the string is always equal to n. [A[ is the size of alphabet A.
The factor automaton having maximal state and transition complexities is
depicted in Fig. 3.23. This complexity is reached for strings ab
n2
c. The
complexity of the sux automaton is in many cases the same as of the
factor automaton. But in some cases the factor automaton can have less
states than the sux automaton for the same string. The reason for thisis
fact that some factor automata may be minimized because they have all
states nal.
Example 3.33
Let us have text T = abb. We will construct the sux automaton accepting
Su(abb) and the factor automaton accepting Fact(abb). The nondetermin-
istic and deterministic sux automata have transition diagrams depicted
108
Type of automaton No. of states No. of transitions
Prex automaton n + 1 n
Sux automaton 2n 2 3n 4
Factor automaton 2n 2 3n 4
Subsequence automaton n + 1 [A[.n
Factor oracle automaton n + 1 2n 1
Table 3.8: Maximum state and transition complexities of automata accept-
ing parts of string
in Fig. 3.43. The nondeterministic and deterministic factor automata have
Figure 3.43: Transition diagrams of the sux automata for string x = abb
from Example 3.33
transition diagrams depicted in Fig. 3.44. The deterministic factor automa-
ton can be minimized as states 2 and 23 are equivalent. This is not true
for the deterministic sux automaton as 23 is a nal state and 2 is not a
nal state. Therefore the minimal factor automaton accepting Fact(abb)
has transition diagram depicted in Fig. 3.45. 2
The reader can verify, that for string x = ab
n
, n > 1, the factor automaton
has less states than the sux automaton.
3.8 Automata for parts of more than one string
All nite automata constructed above in this Chapter can be used as a
base for the construction of the same types of automata for a nite set
of strings. Construction of a nite automaton accepting set Pref (S) (see
Def. 1.6) where S is a nite set of strings from A
+
we will do in the similar
109
0 1 2 3
a b
b
b
b
START
0 1 2
23
3
a b
b
b
b
START
Figure 3.44: Transition diagrams of the factor automata for string x = abb
from Example 3.33
0 1 2,23 3
a b
b
b START
Figure 3.45: Transition diagram of the minimal factor automaton accepting
Fact(abb) from Example 3.33
way as for one string in Algorithm 3.4.
Algorithm 3.34
Construction of a nite automaton accepting set Pref S, S A
+
.
Input: A nite set of strings S = x
1
, x
2
, . . . , x
|s|
.
Output: Prex automaton M = (Q, A, , q
0
, F) accepting set Pref (S).
Method:
1. Construct nite automata M
i
= (Q
i
, A
i
,
i
, q
0i
, F
i
) accepting set
Pref (x
i
) for i = 1, 2, . . . , [S[ using Algorithm alg@ktis3-01.
2. Construct deterministic nite automaton M = (Q, A, , q
0
, F) accept-
ing set Pref(S) = Pref(x
1
) Pref(x
2
) . . . Pref(x
|S|
). 2
Example 3.35
Let us construct the prex automaton for set of strings S = abab, abba.
Finite automata M
1
and M
2
have transition diagrams depicted in Fig. 3.46.
110
Figure 3.46: Transition diagrams of automata M
1
and M
2
from Exam-
ple 3.35
Prex automaton M accepting set Pref(abab, abba) has transition diagram
depicted in Fig. 3.47. 2
Figure 3.47: Transition diagram of prex automaton accepting set
Pref(abab, abba) from Example 3.35
Construction of sux and factor automata for a nite set of strings we will
do in the similar way as for one string (see Sections 3.2 and 3.3). One of
principles of their construction is formalized in the following Algorithm for
sux and factor automata. An Xautomaton, in the next Algorithm, means
sux or factor automaton.
Algorithm 3.36
Construction of Xautomaton for a nite set of strings.
Input: Finite set of strings S = x
1
, x
2
, . . . , x
|S|
.
Output: Deterministic Xautomaton M = (Q, A, , q
0
, F) for set S.
Method:
1. Construct Xautomata M
1
, M
2
, . . . , M
|S|
with transitions (see
Fig. 3.8) for all strings x
1
, x
2
, . . . , x
|S|
.
2. Construct automaton M
accepting language
L(M
) = L(M
1
) L(M
2
) . . . L(M
|S|
).
111
3. Construct automaton M
N
by removing transitions.
4. Construct deterministic nite automaton M equivalent to automaton
M
N
. 2
Example 3.37
Let us construct the factor automaton for set of strings S = abab, abba.
First, we construct factor automata M
1
and M
2
for both strings in S. Their
transition diagrams are depicted in Figs 3.48 and 3.49, respectively.
Figure 3.48: Transition diagram of factor automaton M
1
accepting
Fact(abab) from Example 3.37
Figure 3.49: Transition diagram of factor automaton M
2
accepting
Fact(abba) from Example 3.37
In the second step we construct automaton M
accepting set
Fact(abab) Fact(abba) from Example 3.37
a b
0 1
1
1
2
3
1
4
2
2
1
2
2
3
2
4
1
1
1
1
2
3
1
4
2
2
1
2
2
4
1
2
1
2
2
3
2
4
1
3
1
4
2
3
2
2
1
2
2
4
1
3
1
3
2
3
1
4
2
4
1
3
1
4
1
3
2
4
2
4
1
4
2
Table 3.9: Transition table of automaton M from Example 3.37
113
b
b
b
b
b
a
a
b
a
b
b
a
a
a
START
0
1
2
2
2
3
2
4
2
1
1
2
1
3
1
4
1
Figure 3.51: Transition diagram of factor automaton M
N
accepting set
Fact(abab) Fact(abba) from Example 3.37
114
Transition diagram of resulting deterministic factor automaton M is de-
picted in Fig. 3.52. 2
b
b
b a
a
b
b
b
a
a
START
multilpe front end
0 1 1 3 4
1 2 1 2
2 2 4
1 2 1
2 2 3 4
1 2 2 1
3 4
1 2
3
2
3
1
4
2
4
1
Figure 3.52: Transition diagram of deterministic factor automaton M ac-
cepting set Fact(abab) Fact(abba) from Example 3.37
3.9 Automata accepting approximate parts of a string
Finite automata constructed above in this Chapter are accepting exact parts
of string (prexes, suxes, factors, subsequences). It is possible to use the
lessons learned from their constructions for the construction of automata
accepting approximate parts of string. The main principle of algorithms
accepting approximate parts of string x is:
1. Construct nite automaton accepting set:
Approx (x) = y : D(x, y) k.
2. Construct nite automaton accepting approximate parts of string us-
ing principles similar to that for the construction of automata accept-
ing the exact parts of string.
We show this principle using Hamming distance in the next example.
Denition 3.38
Set H
k
(P) of all strings similar to string P is:
H
k
(P) = X : X A
, D
H
(X, P) k.
where D
H
(X, P) is the Hamming distance. 2
Example 3.39
Let string be x = abba. We construct approximate prex automaton for
Hamming distance k = 1. For this purpose we use a modication of Algo-
rithm 2.5. The modication consists in removing the seloop in the initial
state q
0
. After that we make all state nal states. Transition diagram of
resulting Hamming prex automaton is depicted in Fig. 3.53.
115
Figure 3.53: Transition diagram of the Hamming prex automaton ac-
cepting APref(abba) for Hamming distance k = 1 from Example 3.39
Example 3.40
Let string be x = abba. We construct approximate factor automaton using
Hamming distance k = 1.
1. We use Algorithm 3.10 modied for factor automaton (see Algo-
rithm 3.14). For the construction of nite automaton accepting string
x and all strings with Hamming distance equal to 1, we use a modi-
cation of Algorithm 2.5. The modication consists in removing the
self loop in initial state q
0
. Resulting nite automaton has transition
diagram depicted in Fig. 3.54.
b
b a
b
b
a
a
b a a b
START
1 2 3 4
1 0 2 3 4
Figure 3.54: Transition diagram of the Hamming automaton accepting
H
1
(abba) with Hamming distance k = 1 from Example 3.40
2. We use the principle of inserting the transitions from state 0 to states
1,2,3 and 4. Moreover, all states are xed as nal states. Transition
diagram with inserted transition is depicted in Fig. 3.55.
3. We replace transitions by nontransitions. The resulting automa-
ton has transition diagram depicted in Fig. 3.56.
4. The nal operation is the construction of the equivalent deterministic
nite automaton. Its transition table is shown in Table 3.10.
The transition diagram of the resulting deterministic approximate factor
automaton is depicted in Fig. 3.57. All states of it are nal states. 2
Example 3.41
Let us construct backbone of the Hamming factor automaton from Exam-
116
e e e
b
b a
b
b
a
a
b a a b
START
1' 2' 3' 4'
1 0 2 3 4
e
Figure 3.55: Transition diagram of the Hamming factor automaton with
transitions inserted and nal states xed from Example 3.40
b
b
b
a
b
b
b
a
a
a
b a
a
a
a
b
b
START
1' 2' 3' 4'
1 0 2 3 4
Figure 3.56: Transition diagram of the Hamming factor automaton after
removal of transitions from Example 3.40
ple 3.40. The construction of the backbone consists in intersection of Ham-
ming factor automaton having transition diagram depicted in Fig. 3.57 and
Hamming prex automaton having transition diagram depicted in Fig. 3.53.
The result of this intersection is shown in Fig. 3.58. In this automaton pairs
of states:
((2
, 2
P
), (32
, 2
P
)) and
((3
, 3
P
), (3
, 3
P
))
are equivalent.
The backbone equivalent to one depicted in Fig. 3.58 we obtain using
the following approach. We can recognize, that set of states of the Hamming
factor automaton (see Fig. 3.57):
(24,324) and (34,3,34)
are equivalent. After minimization we obtain Hamming factor automaton
depicted in Fig. 3.59. The backbone of the automaton we obtain by removal
of transition drawn by the dashed line. 2
117
a b
0 142
231
142
23
231
32
23
3
3
32
4 3
3 4 4
4
Table 3.10: Transition table of the deterministic Hamming factor automa-
ton from Example 3.40
Figure 3.57: Transition diagram of the deterministic Hamming approxi-
mate factor automaton for x = abba, Hamming distance k = 1 from Exam-
ple 3.40
118
a b
(0, 0
P
) (142
, 1
P
) (231
, 1
P
)
(142
, 1
P
) (2
, 2
P
) (23
, 2
P
)
(231
, 1
P
) (32
, 2
P
)
(23
, 2
P
) (3
, 3
P
) (3, 3
P
)
(2
, 2
P
) (3
, 3
P
)
(32
, 2
P
) (3
, 3
P
)
(3
, 3
P
) (4
, 4
P
)
(3, 3
P
) (4, 4
P
) (4
, 4
P
)
(3
, 3
P
) (4
, 4
P
)
Table 3.11: Transition table of the backbone of Hamming factor automaton
for string x = abba
Figure 3.58: Transition diagram of the backbone of Hamming factor au-
tomaton for string x = abba, from Example 3.41
START
a b
b
b a a b
b
b
a
a
a
0 142 3 ' '
231'4'
23'
2'4'
' ' 32 4
3'4
3'
3'4'
3 4
4'
Figure 3.59: Minimized Hamming factor automaton for string x = abba
from Example 3.41
119
4 Borders, repetitions and periods
4.1 Basic notions
Denition 4.1 (Proper prex)
The proper prex is any element of Pref(x) not equal to x. 2
Denition 4.2 (Border)
Border of string x A
+
is any proper prex of x, which is simultaneously its
sux. The set of all borders of string x is bord(x) = (Pref(x)x)Su(x).
The longest border of x is Border(x). 2
Denition 4.3 (Border of a nite set of string)
Border of set of strings S = x
1
, x
2
, . . . , x
|S|
is any proper prex of some
x
i
S which is the sux of some x
j
S, i, j < 1, [S[ >. The set of all
borders of the set S is
mbord(S) =
|S|
_
i=1
|S|
_
j=1
(Pref(x
i
) x
i
) Su(x
j
).
The longest border which is the sux of x
i
, i < 1, [S[ > belongs to the set
mBorder(S).
mBorder(S) = u
i
: u
i
mbord(S), u
i
is the longest sux of x
i
,
i < 1, [S[ >.
2
Denition 4.4 (Period)
Every string x A
+
can be written in the form:
x = u
r
v,
where u A
+
and v Pref(u). The length of the string u, p = [u[ is a
period of the string x, r is an exponent of the string x and u is a generator
of x. The shortest period of string x is Per(x). The set of all periods of x is
periods (x). String x is pure periodic if v = . 2
Denition 4.5 (Normal form)
Every string x A
+
can be written in the normal form:
x = u
r
v,
where p = [u[ is the shortest period Per(x), therefore r is the highest expo-
nent and v Pref(u). 2
Denition 4.6 (Primitive string)
If string x A
+
has the shortest period equal to its length, then we call it
the primitive string. 2
Let us mention that for primitive string x holds that Border(x) = .
120
Denition 4.7 (Border array)
The border array [1..n] of string x A
+
is a vector of the lengths of the
longest borders of all prexes of x:
[i] = [Border(x[1..i])[ for i = 1, 2, . . . , n. 2
Denition 4.8 (Border array of a nite set of strings)
The mborder array m[1..n] of a set of strings S = x
1
, x
2
, . . . , x
|S|
is a
vector of the longest borders of all prexes of strings from S:
m[h] =[mBorder(x
1
, x
2
, . . . , x
i1
, x
i
[1..j], x
i+1
, . . . x
|S|
)[ for
i 1, [S[), j 1, [x
i
[).
The values of variable h are used for the labelling of states of nite
automaton accepting set S.
h
|S|
l=1
[x
l
[.
2
Denition 4.9 (Exact repetition in one string)
Let T be a string, T = a
1
a
2
. . . a
n
and a
i
= a
j
, a
i+1
= a
j+1
, . . . , a
i+m
=
a
j+m
, i < j, m 0. String x
2
= a
j
a
j+1
. . . a
j+m
is an exact repetition of
string x
1
= a
i
a
i+1
. . . a
m
. x
1
or x
2
are called repeating factors in text T.
2
Denition 4.10 (Exact repetition in a set of strings)
Let S be a set of strings, S = x
1
, x
2
, . . . , x
|S|
and x
pi
= x
qj
, x
pi+1
=
x
qj+1
, . . . , x
pm
= x
qm
, k ,= l or k = l and i < j, m 0.
String x
qj
x
qj+1
. . . x
qm
is an exact repetition of string x
pi
x
pi+1
. . . x
pm
.
2
Denition 4.11 (Aproximate repetition in one string)
Let T be a string, T = a
1
a
2
. . . a
n
and D(a
i
a
i+1
. . . a
i+m
, a
j
a
j+1
. . . a
j+m
)
k, where m, m
i=1
[x
i
[.
Method:
1. Construct nondeterministic factor automaton M
1
for S.
2. Construct equivalent deterministic factor automaton M
2
and preserve
dsubsets.
3. Extract the backbone of M
2
creating automaton M
3
= (Q, A, , q
0
, F).
4. Set n := [Q[ 1.
5. Initialize all elements of mborder array m[1..n] by the value zero.
6. Do analysis of multiple dsubsets of automaton M
3
from left to right
(starting with states having minimal depth):
If the dsubset has the form i
1
, i
2
, . . . , i
h
(this sequence is ordered
according to the depth of each state), then set m[j] := i
1
for j =
i
2
, i
3
, . . . , i
h
. 2
127
a b
0 13
1
4
2
23
2
4
1
13
1
4
2
24
1
23
2
4
1
3
1
4
2
3
2
24
1
3
1
3
2
3
1
4
2
4
1
3
1
4
1
3
2
4
2
4
1
4
2
Table 4.1: Transition table of deterministic factor automaton M
2
from Ex-
ample 4.24
Note: We will use the labelling of states reecting the depth of them instead
of the running numbering as in Denition 4.8.
Example 4.24
Let us construct the mborder array for set of strings S = abab, abba. In the
rst step, we construct nondeterministic factor automaton M
1
for set S. Its
transition diagram is depicted in Fig. 4.8. Table 4.1 is the transition table
START b
b
a
a
b
a
b
b
a
b
a
0
3
2
4
2
1 2
3
1
4
1
Figure 4.8: Transition diagram of nondeterministic factor automaton M
1
for set x = abab, abba from Example 4.24
of deterministic factor automaton M
2
. Its transition diagram is depicted in
Fig. 4.9. The dashed lines and circles show the part of the automaton out of
the backbone. Therefore the backbone of M
3
is drawn by solid lines. Now
we do analysis of dsubsets. Result is in the next table:
Analyzed state Values of mborder array elements
13
1
4
2
m(4
2
) = m(3
1
) = 1
24
1
m(4
1
) = 2
128
START b
b
a
a
b
a
b b
a b
0
3
2
3 4
1 2
4
2
13 4
1 2
24
1
23 4
2 1
3
1
4
1
Figure 4.9: Transition diagram of deterministic factor automaton M
2
for
the x = abab, abba from Example 4.24
Resulting mborder array m(S) is shown in the next table:
state 1 2 3
1
3
2
4
1
4
2
symbol a b a b b a
m[state] 0 0 1 0 2 1
2
4.4 Repetitions
4.4.1 Classication of repetitions
Problems of repetitions of factors in a string over a nite size alphabet can
be classied according to various criteria. We will use ve criteria for clas-
sication of repetition problems leading to ve-dimensional space in which
each point corresponds to the particular problem of repetition of a factor
in a string. Let us make a list of all dimensions including possible values
in each dimension:
1. Number of strings:
- one,
- nite number greater than one,
- innite number.
2. Repetition of factors (see Denition 4.12):
- with overlapping,
- square,
- with gap.
3. Specication of the factor:
- repeated factor is given,
- repeated factor is not given,
- length l of the repeated factor is given exactly,
- length of the repeated factor is less than given l,
- length of the repeated factor is greater than given l,
- nding the longest repeated factor.
4. The way of nding repetitions:
129
- exact repetition,
- approximate repetition with Hamming distance (R-repetition),
- approximate repetition with Levenshtein distance (DIR-repeti-
tion),
- approximate repetition with generalized Levenshtein distance
(DIRT-repetition),
- -approximate repetition,
- -approximate repetition,
- (, )-approximate repetition.
5. Importance of symbols in factor:
- take care of all symbols,
- dont care of some symbols.
The above classication is visualised in Figure 4.10. If we count the
number of possible problems of nding repetitions in a string, we obtain
N = 3 3 2 7 2 = 272.
Approximate repetition (dimension 4):
5
Specification
Importance Number
3
Factor is
given
No factor
is given
2
Repetition
Square
Overlapping
Gaps
1
Finite
Infinite
One
4
Distance
Exact
Care
Don't care
A
p
p
ro
x
im
a
te
R-matching DIRT-matching G-matching
DIR-matching D-matching ( )-matching D,G
Figure 4.10: Classication of repetition problems
In order to facilitate references to a particular problem of repetition
in a string, we will use abbreviations for all problems. These abbreviations
are summarized in Table 4.2.
130
Dimension 1 2 3 4 5
O O F E C
F S N R D
I G D
T
(, )
Table 4.2: Abbreviations of repetition problems
Using this method, we can, for example, refer to the overlapping exact
repetition in one string of a given factor where all symbols are considered
as the OOFEC problem.
Instead of the single repetition problem we will use the notion of a family
of repetitions in string problems. In this case we will use symbol ? instead
of a particular symbol. For example ?S??? is the family of all problems
concerning square repetitions.
Each repetition problem can have several instances:
1. verify whether some factor is repeated in the text or not,
2. nd the rst repetition of some factor,
3. nd the number of all repetitions of some factor,
4. nd all repetitions of some factor and where they are.
If we take into account all possible instances, the number of repetitions
in string problems grows further.
4.4.2 Exact repetitions in one string
In this section we will introduce how to use a factor automaton for nding
exact repetitions in one string (O?NEC problem). The main idea is based on
the construction of the deterministic factor automaton. First, we construct
a nondeterministic factor automaton for a given string. The next step is
to construct the equivalent deterministic factor automaton. During this
construction, we memorize d-subsets. The repetitions that we are looking
for are obtained by analyzing these d-subsets. The next algorithm describes
the computation of d-subsets of a deterministic factor automaton.
Algorithm 4.25
Computation of repetitions in one string.
Input: String T = a
1
a
2
. . . a
n
.
Output: Deterministic factor automaton M
D
accepting Fact(T) and
131
dsubsets for all states of M
D
.
Method:
1. Construct nondeterministic factor automaton M
N
accepting Fact(T):
(a) Construct nite automaton M accepting string T = a
1
a
2
. . . a
n
and all its prexes.
M = (q
0
, q
1
, q
2
, . . . , q
n
, A, , q
0
, q
0
, q
1
, . . . , q
n
),
where (q
i
, a
i+1
) = q
i+1
for all i 0, n 1).
(b) Construct nite automaton M
with transitions
constructed in step 1.b of Algorithm 4.25
step 1.c of Algorithm 4.25, the transition diagram depicted in Fig. 4.12.
Figure 4.12: Transition diagram of factor automaton M
N
after the removal
of transitions in step 1.c of Algorithm 4.25
The next example shows the construction of the deterministic factor
automaton and the analysis of the dsubsets.
132
Let us make a note concerning labelling: Labels used as the names of
states are selected in order to indicate positions in the string. This labelling
will be useful later.
Example 4.26
Let us use text T = ababa. At rst, we construct nondeterministic factor
automaton M
(ababa) = (Q
, A,
, 0, Q
N
(x + 1, a
2
) = x + 2,
N
(y + 1, a
2
) = y + 2,
N
(x + 2, a
3
) = x + 3,
N
(y + 2, a
3
) = y + 3,
.
.
.
.
.
.
N
(x +m1, a
m
) = x +m,
N
(y +m1, a
m
) = y +m.
Deterministic factor automaton M
D
(T) = (Q
D
, A,
D
, D
0
, Q
D
) then con-
tains states D
0
, D
1
, D
2
, . . . D
m
having this property:
D
(D
0
, a
1
) = D
1
, x + 1, y + 1 D
1
,
D
(D
1
, a
2
) = D
2
, x + 2, y + 2 D
2
,
.
.
.
.
.
.
D
(D
m1
, a
m
) = D
m
, x +m, y +m D
m
.
We can conclude that the d-subset D
m
contains the pair x+m, y +m.
2
Lemma 4.29
Let T be a string and let M
D
(T) be the deterministic factor automaton for
T with states labelled by corresponding d-subsets. If a d-subset D
m
contains
two elements x +m and y +m then there exists factor u = a
1
a
2
. . . a
m
,
m 1, starting at both positions x and y in string T.
Proof
Let M
N
(T) be the nondeterministic factor automaton for T. If a d-subset
D
m
contains elements from x +m, y +m then it holds for
N
of M
N
(T):
x +m, y +m
N
(0, a
m
), and
N
(x +m1, a
m
) = x +m,
N
(y +m1, a
m
) = y +m for some a
m
A.
Then d-subset D
m1
such that
D
(D
m1
, a
m
) = D
m
must contain
x +m1, y +m1 such that x +m1, y +m1
N
(0, a
m1
),
N
(x +m2, a
m1
) = x +m1,
N
(y +m2, a
m1
) = y +m1
and for the same reason D-subset D
1
must contain x + 1, y + 1 such that
x + 1, y + 1
N
(0, a
1
) and
N
(x, a
1
) = x + 1,
N
(y, a
1
) = y + 1.
Then there exists the sequence of transitions in M
D
(T) :
a
1
a
2
a
3
a
n
START
D
0
D
1
D
2
D
n
Figure 4.17: Repeated factor u = a
1
a
2
. . . a
m
in M
D
(T)
(D
0
, a
1
a
2
. . . a
m
) (D
1
, a
2
. . . a
m
)
(D
2
, a
3
. . . a
m
)
137
.
.
.
(D
m1
, a
m
)
(D
m
, ),
where
x + 1, y + 1 D
1
,
.
.
.
x +m, y +m D
m
.
This sequence of transitions corresponds to two dierent sequences of
transitions in M
N
(T) going through state x + 1:
(0, a
1
a
2
. . . a
m
) (x + 1, a
2
. . . a
m
)
(x + 2, a
3
. . . a
m
)
.
.
.
(x +m1, a
m
)
(x +m, ),
(x, a
1
a
2
. . . a
m
) (x + 1, a
2
. . . a
m
)
(x + 2, a
3
. . . a
m
)
.
.
.
(x +m1, a
m
)
(x +m, ).
Similarly two sequences of transitions go through state y + 1:
(0, a
1
a
2
. . . a
m
) (y + 1, a
2
. . . a
m
)
(y + 2, a
3
. . . a
m
)
.
.
.
(y +m1, a
m
)
(y +m, ),
(y, a
1
a
2
. . . a
m
) (y + 1, a
2
. . . a
m
)
(y + 2, a
3
. . . a
m
)
.
.
.
(y +m1, a
m
)
(y +m, ).
It follows from this that the factor u = a
1
a
2
. . . a
m
is present twice in
string T in dierent positions x + 1, y + 1. 2
The following Lemma is a simple consequence of Lemma 4.29.
138
Lemma 4.30
Let u be a repeating factor in string T. Then all factors of u are also re-
peating factors in T. 2
Denition 4.31
If u is a repeating factor in text T and there is no longer factor of the form
vuw, v or w ,= but not both, which is also a repeating factor, then we will
call u the maximal repeating factor. 2
Denition 4.32
Let M
D
(T) be a deterministic factor automaton. The depth of each state D
of M
D
is the length of the longest sequence of transitions leading from the
initial state to state D. 2
If there exists a sequence of transitions from the initial state to state
D which is shorter than the depth of D, it corresponds to the sux of the
maximal repeating factor.
Lemma 4.33
Let u be a maximal repeating factor in string T. The length of this factor
is equal to the depth of the state in M
D
(T) indicating the repetition of u.
2
Proof
The path for maximal repeating factor u = a
1
a
2
. . . a
m
starts in the initial
state, because states x + 1 and y + 1 of nondeterministic factor automaton
M
N
(T) are direct successors of its initial state and therefore
D
(D
0
, a
1
) = D
1
and x +1, y +1 D
1
. Therefore there exists a sequence of transitions in
deterministic factor automaton M
D
(T):
(D
0
, a
1
a
2
. . . a
m
) (D
1
, a
2
. . . a
m
)
(D
2
, a
3
. . . a
m
)
.
.
.
(D
m1
, a
m
)
(D
m
, )
2
There follows one more observation from Example 4.26.
Lemma 4.34
If some state in M
D
(T) has a corresponding d-subset containing one element
only, then its successor also has a corresponding d-subset containing one
element.
139
Proof
This follows from the construction of the deterministic factor automaton.
The transition table of nondeterministic factor automaton M
N
(T) has more
than one state in the row for the initial state only. All other states have at
most one successor for a particular input symbol. Therefore in the equiv-
alent deterministic factor automaton M
D
(T) the state corresponding to a
d-subset having one element may have only one successor for one symbol,
and this state has a corresponding d-subset containing just one element. 2
We can use this observation during the construction of deterministic
factor automaton M
D
(T) in order to nd some repetition. It is enough to
construct only the part of M
D
(T) containing d-subsets with at least two
elements. The rest of M
D
(T) gives no information on repetitions.
Algorithm 4.35
Constructing a repetition table containing exact repetitions in a given string.
Input: String T = a
1
a
2
. . . a
n
.
Output: Repetition table R for string T.
Method:
1. Construct deterministic factor automaton
M
D
(T) = (Q
D
, A,
D
, 0, Q
D
) for given string T.
Memorize for each state q Q
D
:
(a) d-subset D(q) = r
1
, r
2
, . . . , r
p
,
(b) d = depth(q),
(c) maximal repeating factor for state q maxfactor(q) = x, [x[ = d.
2. Create rows in repetition table R for each state q having D(q) with
more than one element:
(a) the row for maximal repeating factor x of state q has the form:
(r
1
, r
2
, . . . , r
p
, x, (r
1
, F), (r
2
, X
2
), (r
3
, X
3
), . . . , (r
p
, X
p
),
where X
i
, 2 i p, is equal to
i. O, if r
i
r
i1
< d,
ii. S, if r
i
r
i1
= d,
iii. G, if r
i
r
i1
> d,
(b) for each sux y of x (such that the row for y was not created
before) create the row of the form:
(r
1
, r
2
. . . , r
p
, y, (r
1
, F), (r
2
, X
2
), (r
3
, X
3
), . . . , (r
p
, X
p
),
where X
i
, 2 i p, is deduced in the same manner. 2
An example of the repetition table is shown in Example 4.26 for string
T = ababa.
140
4.4.3 Complexity of computation of exact repetitions
The time and space complexity of the computation of exact repetitions in
string is treated in this Chapter.
The time complexity is composed of two parts:
1. The complexity of the construction of the deterministic factor automa-
ton. If we take the number of states and transitions of the resulting
factor automaton then the complexity is linear. More exactly the
number of its states is
NS 2n 2,
and the number of transitions is
NT 3n 4.
2. The second part of the overall complexity is the construction of repeti-
tion table. The number of rows of this table is the number of dierent
multiple d-subsets. The highest number of multiple d-subsets has the
factor automaton for text T = a
n
. Repeating factors of this text are
a, a
2
, . . . , a
n1
.
There is necessary, for the computation of repetitions using factor automata
approach, to construct the part of deterministic factor automaton containing
only all multiple states. It is the matter of fact, that a simple state has
at most one next state and it is simple one, too. Therefore, during the
construction of deterministic factor automaton, we can stop construction of
the next part of this automaton as soon as we reach a simple state.
Example 4.36
Let us have text T = a
n
, n > 0. Let us construct deterministic factor
automaton M
D
(a
n
) for text T. Transition diagram of this automaton is
depicted in Fig. 4.18. Automaton M
D
(a
n
) has n+1 states and n transitions.
Figure 4.18: Transition diagram of deterministic factor automaton M
D
(a
n
)
for text T = a
n
from Example 4.36
Number of multiple states is n 1.
To construct this automaton in order to nd all repetitions, we must
construct the whole automaton including the initial state and the state n
(terminal state). Repetition table R has the form shown in Table 4.6. 2
The opposite case to the previous one is the text composed of symbols
which are all dierent. The length of such text is limited by the size of
alphabet.
141
d-subset Factor List of repetitions
1, 2, . . . , n a (1, F), (2, S), (3, S), . . . , (n, S)
2, . . . , n aa (2, F), (3, O), (4, O), . . . , (n, O)
.
.
.
n 1, n a
n1
(n 1, F), (n, O)
Table 4.6: Repetition table R for text T = a
n
from Example 4.36
Example 4.37
Let the alphabet be A = a, b, c, d and text T = abcd. Deterministic factor
automaton M
D
(abcd) for text T has transition diagram depicted in Fig. 4.19.
Automaton M
D
(abcd) has n+1 states and 2n1 transitions. All respective
Figure 4.19: Transition diagram of deterministic factor automaton
M
D
(abcd) for text T = abcd from Example 4.36
d-subsets are simple. To construct this automaton in order to nd all repe-
titions, we must construct all next states of the initial state for all symbols
of the text. The number of these states is just n. The repetition table is
empty. 2
Now, after the presentation both limit cases, we will try to nd some case
inbetween with the maximal complexity. We guess, that the next example
is showing it. The text selected in such way, that all proper suxes of the
prex of the text appear in it and therefore they are repeating.
Example 4.38
Let the text be T = abcdbcdcdd. Deterministic factor automaton M
D
(T) has
the transition diagram depicted in Fig. 4.20. Automaton M
D
has 17 states
and 25 transitions while text T has 10 symbols. The number of multiple
d-subsets is 6. To construct this automaton in order to nd all repetitions,
we must construct all multiple states and moreover the states corresponding
to single d-subsets: 0, 1, 5, 8, A.
The results is, that we must construct 11 states from the total number
of 17 states. Repetition table R is shown in Table 4.7. 2
It is known, that the maximal state and transition complexity of the
142
Figure 4.20: Transition diagram of deterministic factor automaton M
D
(T)
for text T = abcdbcdcdd from Example 4.38
d-subset Factor List of repetitions
25 b (2, F), (5, G)
36 bc (3, F), (6, G)
47 bcd (4, F), (7, S)
368 c (3, F), (6, G), (8, G)
479 cd (4, F), (7, G), (9, G)
479A d (4, F), (7, G), (9, G), (10, S)
Table 4.7: Repetition table R for text T = abcdbcdcdd from Example 4.38
factor automaton is reached for text T = ab
n2
c. Let us show such factor
automaton in this context.
Example 4.39
Let the text be T = ab
4
c. Deterministic factor automaton M
D
(T) has tran-
sition diagram depicted in Fig. 4.21. Automaton M
D
(T) has 10 (2 6 2)
Figure 4.21: Transition diagram of deterministic factor automaton M
D
(T)
for text T = ab
4
c from Example 4.39
states and 14 (3 6 4) transitions while the text has 6 symbols. The num-
ber of multiple states is 3 (6 3). To construct this automaton in order to
nd all repetitions, we must construct the 3 multiple states and moreover 3
143
simple states. Therefore we must construct 6 states from the total number
of 10 states. Repetition table R is shown in Table 4.8. 2
d-subset List of repetitions
2345 (b, F)(2, F), (3, S), (4, S), (5, S)
345 (bb, F)(3, F), (4, O), (5, O)
45 (bbb, F)(4, F), (5, O)
Table 4.8: Repetition table R for text T = ab
4
c from Example 4.39
We have used, in the previous examples, three measures of complexity:
1. The number of multiple states of the deterministic factor automaton.
This number is equal to the number of rows in the resulting repetition
table, because each row of the repetition table corresponds to one
multiple d-subset. Moreover, it corresponds to the number of repeating
factors.
2. The number of states which are necessary to construct in order to
get all information on repetitions. We must reach the simple state on
all pathes starting in the initial state. We already know that there
is at most one successor of a simple state and it is a simple state,
too (Lemma 4.34). The number of such states which is necessary to
construct is therefore greater than the number of multiple states.
3. The total number of repetitions (occurrences) of all repeating factors
in text. This number corresponds to the number of items in the last
column of the repetition table headed by List of repetitions.
The results concerning the measures of complexity from previous examples
are summarized in the Table 4.9.
No. of No. of
Text multiple necessary No. of repetitions
states states
a
n
n 1 n + 1 (n
2
+n 2)/2
a
1
a
2
. . . a
n
0 n + 1 0
(all symbols unique)
a
1
a
2
. . . a
m
a
2
. . . a
m
. . . (m
2
m)/2 (m
2
+m)/2
m1
i=1
i(mi + 1)
a
m1
a
m
a
m
ab
n2
c n 3 n (n
2
3n)/2
Table 4.9: Measures of complexity from Examples 4.36, 4.37, 4.38, 4.39
144
Let us show how the complexity measures from Table 4.9 have been
computed.
Example 4.40
Text T = a
n
has been used in Example 4.36. The number of multiple
states is n 1 which is the number of repeating factors. The number of
necessary states is n+1 because the initial and the terminal states must be
constructed. The number of repetitions is given by the sum:
n + (n 1) + (n 2) +. . . + 2 =
n
2
+n 2
2
.
2
Example 4.41
Text T = abcd has been used in Example 4.37. This automaton has no
multiple state. The number of necessary states is n + 1. It means, that in
order to recognize that no repetition exists in such text all states of this
automaton must be constructed. 2
Example 4.42
Text T = abcdbcdcdd used in Example 4.38 has very special form. It consists
of prex abcd followed by all its proper suxes. It is possible to construct
such text only for some n. Length n of the text must satisfy condition:
n =
m
i=1
i =
m
2
+m
2
,
where m is the length of the prex in question. It follows that
m =
1
1 + 8n
2
and therefore m = O(
n).
The number of multiple states is
(m1) + (m2) +. . . + 1 =
m
2
m
2
.
The number of necessary states we must increase by m which is the
number of simple states being next states of the multiple states and the
initial state. Therefore the number of necessary states is (m
2
+ m)/2. The
number of repetitions is
m+ 2(m+ 1) + 3(m2) +. . . + (m1)2 =
m1
i=1
i(mi + 1).
Therefore this number is O(m
2
) = O(n). 2
Example 4.43
Text T = ab
n2
c used in Example 4.39 leads to the factor automaton having
maximal number of states and transitions. The number of multiple states is
145
equal to n3 and the number of necessary states is equal to n. The number
of repetitions is
(n 2) + (n 3) +. . . + 2 =
n
2
3n
2
.
2
It follows from the described experiments that the complexity of determin-
istic factor automata and therefore the complexity of computation of repe-
titions for a text of length n has these results:
1. The number of multiple states is linear. It means that the repetition
table has O(n) rows. It is the space complexity of the computation all
repeated factors.
2. The number of necessary states is again linear. It means that time
complexity of the computation all repeated factors is O(n).
3. The number of repetitions is O(n
2
) which is the time and space com-
plexity of the computation of all occurrences of repeated factors.
4.4.4 Exact repetitions in a nite set of strings
The idea of the use of a factor automaton for nding exact repetitions in one
string can also be used for nding exact repetitions in a nite set of strings
(F?NEC problem). Next algorithm is an extension of Algorithm 4.25 for a
nite set of strings.
Algorithm 4.44
Computation of repetitions in a nite set of strings.
Input: Set of strings S = x
1
, x
2
, . . . , x
|S|
, x
i
A
+
, i = 1, 2, . . . , [S[.
Output: Multiple front end MFE(M
D
) of deterministic factor automaton
M
D
accepting Fact(S) and d-subsets for all states of MFE(M
D
).
Method:
1. Construct nondeterministic factor automata M
i
for all strings x
i
, i =
1, 2, . . . , [S[:
(a) Construct nite automaton M
i
accepting string x
i
and all its
prexes for all i = 1, 2, . . . , [S[.
(b) Construct nite automaton M
i
from automaton M
i
by inserting
-transitions from the initial state to all other states for all i =
1, 2, . . . , [S[.
2. Construct automaton M
) =
Fact(abab) Fact(abba). Its transition diagram is depicted in Fig. 4.24.
In the third step we construct automaton M
N
by removing transitions
from automaton M
accepting set
Fact(abab) Fact(abba) from Example 4.45
b
b
b
b
b
a
a
b
a
b
b
a
a
a
START
0
1
2
2
2
3
2
4
2
1
1
2
1
3
1
4
1
Figure 4.25: Transition diagram of nondeterministic factor automaton M
N
accepting set Fact(abab) Fact(abba) from Example 4.45
148
a b
0 1
1
1
2
3
1
4
2
2
1
2
2
3
2
4
1
1
1
1
2
3
1
4
2
2
1
2
2
4
1
2
1
2
2
3
2
4
1
3
1
4
2
3
2
2
1
2
2
4
1
3
1
3
2
3
1
4
2
4
1
3
1
4
1
3
2
4
2
4
1
4
2
Table 4.10: Transition table of deterministic factor automaton M
D
from
Example 4.45
The last step is to construct deterministic factor automaton M
D
. Its
transition table is shown in Table 4.10. The transition diagram of the re-
sulting deterministic factor automaton M
D
is depicted in Fig. 4.26. Now we
b
b
b a
a
b
b
b
a
a
START
multilpe front end
0 1 1 3 4
1 2 1 2
2 2 4
1 2 1
2 2 3 4
1 2 2 1
3 4
1 2
3
2
3
1
4
2
4
1
Figure 4.26: Transition diagram of deterministic factor automaton M
D
ac-
cepting set Fact(abab) Fact(abba) from Example 4.45
do the analysis of dsubsets of resulting automaton M
D
. The result of this
analysis is the repetition table shown in Table 4.11 for set S = abab, abba.
2
Denition 4.46
Let S be a set of strings S = x
1
, x
2
, . . . , x
|S|
. The repetition table for S
contains the following items:
1. dsubset,
2. corresponding factor,
149
dsubset Factor List of repetitions
1
1
1
2
3
1
4
2
a (1, 1, F), (2, 1, F), (1, 3, G), (2, 4, G)
2
1
2
2
4
1
ab (1, 2, F), (2, 2, F), (1, 4, S)
2
1
2
2
3
2
4
1
b (1, 2, F), (2, 2, F), (2, 3, S), (1, 4, G)
3
1
4
2
ba (1, 3, F), (2, 4, F)
Table 4.11: Repetition table for set S = abab, abba from Example 4.45
3. list of repetitions of the factor containing elements of the form (i, j, X
ij
),
where i is the index of the string in S,
j is the position in string x
i
,
X
ij
is the type of repetition:
F - the rst occurrence of the factor in string x
i
,
O - repetition of the factor in x
i
with overlapping,
S - repetition as a square in x
i
,
G - repetition with a gap in x
i
.
2
Let us suppose that each element of d-subset constructed by Algorithm 4.44
keeps two kind of information:
- index of the string in S to which it belongs,
- depth (position) in this string.
Moreover we suppose, that it is possible to identify the longest factor (max-
imal repeating factor) to which the d-subset belongs.
Algorithm 4.47
Constructing a repetition table containing exact repetitions in a nite set of
strings.
Input: Multiple front end of factor automaton for set of strings S =
x
1
, x
2
, . . . , x
|S|
, x
i
A
+
, i = 1, 2, . . . , [S[.
Output: Repetition table R for set S.
Method: Let us suppose, that d-subset of multiple state q has form
r
1
, r
2
, . . . , r
p
. Create rows in repetition table R for each multiple state q:
the row for maximal repeating factor x of state q has the form:
(r
1
, r
2
, . . . , r
p
, x, (i
1
, j
1
, F), (i
2
, j
2
, X
i
2
,j
2
), . . . , (i
p
, j
p
, X
i
p
,j
p
),
where i
l
is the index of the string in S, l = 1, 2, . . . , [S[,
j
l
is the position in string x
i
, l = 1, 2, . . . , [S[,
X
ij
is the type of repetition:
i: O, if j
l
j
l1
< [x[,
ii: S, if j
l
j
l1
= [x[,
iii: G, if j
l
j
l1
> [x[. 2
150
4.4.5 Computation of approximate repetitions
We have used the factor automata for nding exact repetitions in either
one string or in a nite set of strings. A similar approach can be used for
nding approximate repetitions as well. We will use approximate factor
automata for this purpose. As before, a part of deterministic approximate
factor automaton will be useful for the repetition nding. This part we will
call mixed multiple front end.
Let us suppose that nite automaton M is accepting set
Approx(x) = y : D(x, y) k,
where D is a metrics and k is the maximum distance. We can divide au-
tomaton M into two parts:
- exact part which is used for accepting string x,
- approximate part which is used when other strings in Approx(x) are
accepting.
For the next algorithm we need to distinquish states either in exact part
or in approximate part. Let us do this distinction by labelling of states in
question.
Denition 4.48
A mixed multiple front end of deterministic approximate factor automaton
is a part of this automaton containing only multiple states with:
a) d-subsets containing only states from exact part of nondeterministic
automaton,
b) d-subset containing mix of states from exact part and approximate part
of nondeterministic automaton with at least one state from exact part.
Let us call such states mixed multiple states. 2
We can construct the mixed multiple front end by a little modied of Algo-
rithm 3.21. The modication consists in modication of point 3(b) in this
way:
(b) if
= Q
(q, a).
4.4.6 Approximate repetitions Hamming distance
In this Section, we show how to nd approximate repetitions using the
Hamming distance (O?NRC problem).
Example 4.49
Let string x = abba. We construct an approximate factor automaton using
Hamming distance k = 1.
1. We construct a nite automaton accepting string x and all strings
with Hamming distance equal to 1. The set of these strings is denoted
H
1
(abba). The resulting nite automaton has the transition diagram
depicted in Fig. 4.27.
151
b
b a
b
b
a
a
b a a b
START
1 2 3 4
1 0 2 3 4
Figure 4.27: Transition diagram of the Hamming automaton accepting
H
1
(abba) with Hamming distance k = 1 from Example 4.49
2. We use the principle of inserting the transitions from state 0 to
states 1, 2, 3 and 4 and we x all states as nal states. The transition
diagram with inserted transition and xed nal states is depicted in
Fig. 4.28.
e e e
b
b a
b
b
a
a
b a a b
START
1' 2' 3' 4'
1 0 2 3 4
e
Figure 4.28: Transition diagram of the Hamming factor automaton with
xed nal states and inserted transitions from Example 4.49
3. We replace transitions by nontransitions. The resulting automa-
ton has the transition diagram depicted in Fig. 4.29.
4. The nal operation is the construction of the equivalent deterministic
nite automaton. Its transition table is Table 4.12
The transition diagram of the resulting deterministic approximate factor
automaton is depicted in Fig. 4.30.
152
b
b
b
a
b
b
b
a
a
a
b a
a
a
a
b
b
START
1' 2' 3' 4'
1 0 2 3 4
Figure 4.29: Transition diagram of the nondeterministic Hamming factor
automaton after removal of transitions from Example 4.49
a b
0 142
231
142
23
231
4 32
23
3
3
4 4
32
4 3
3 4 4
4
Table 4.12: Transition table of the deterministic Hamming factor automa-
ton from Example 4.49
153
Figure 4.30: Transition diagram of the deterministic Hamming factor au-
tomaton for x = abba, Hamming distance k = 1 from Example 4.49
Construction of repetition table is based on the following observation con-
cerning mixed multiple states:
1. If only exact states are in the respective d-subset then exact repetitions
take place.
2. Let us suppose that a mixed multiple state corresponding to factor x
has d-subset containing pair (r
1
, r
2
), where r
1
is an exact state and r
2
is an approximate state. It means that state r
1
corresponds to exact
factor x. There is a sequence of transitions for factor x also to the
state r
2
. Therefore in the exact part must be a factor y such that
the distance of x and y less than k. It means that factor y is an
approximate repetition of x.
Now we construct the approximate repetition table. We take into account
the repetition of factors which are longer than k. In this case k = 1 and
therefore we select repetitions of factors having length greater or equal to
two. The next table contains information on the approximate repetition of
factors of the string x = abba.
dsubset Factor Approximate repetitions
23
230
1 1
21
2 2
32
3 43
4 4
230
140
21
230
41
32
21
32
32
43
41
32
43
43
1
(abbc). This nite automaton has the transition diagram depicted
in Fig. 4.35.
158
dsubset Factor Approximate repetitions
21234 ab (2, ab, F)(3, abb, O)
323 abb (3, abb, F)(2, ab, O), (3, bb, O)
434 abba (4, abba, F)(3, abb, O), (4, bba, O)
3234 bb (3, bb, F)(2, ab, O), (3, abb, O), (4, ba, O), (4, bba, O)
41234 ba (4, ba, F)(3, bb, S), (4, bba, O)
434 bba (4, bba, F)(3, bb, O), (4, ba, O)
Table 4.15: Approximate repetition table R for string x = abba with Leven-
shtein distance k = 1 from Example 4.50
a b
b a,c a,c b
a,b,c
b
a,b,c
c
b,c
START
0 1
1'
2
2'
3
3'
4
4'
Figure 4.35: Transition diagram of the automaton accepting
1
(abbc)
with -distance k = 1 from Example 4.51
2. We use the principle of inserting transitions from state 0 to states 1,
2, 3, and 4 and making all states nal states. The transition diagram
of the automaton with inserted transitions and xed nal states is
depicted in Fig. 4.36.
3. We replace transitions by nontransitions. The resulting automa-
ton has the transition diagram depicted in Fig. 4.37.
159
Figure 4.36: Transition diagram of the factor automaton with nal
states xed and transitions inserted from Example 4.51
a
b b c
a,c a,c b
b
b a,c a,c b
a,b,c
b
a,b,c
c
b,c
START
0 1
1'
2
2'
3
3'
4
4'
Figure 4.37: Transition diagram of the nondeterministic factor automa-
ton after the removal of transitions from Example 4.51
160
4. The nal operation is to construct the equivalent deterministic factor
automaton. Table 4.16 is its transition table and its transition diagram
is depicted in Fig. 4.38.
b a b
b,c
b,c
a
a
a
b
b
c
c
a
b
a a
a
c
c
b,c
b,c
b,c
b,c b
c
c
START
12 3 ' ' 0 23'4'
2'3'4'
2'3'
231'4'
34'
3 4 ' '
3'
3 4 2' '
4
4'
43'
42'3'
mixed multiple front end
Figure 4.38: Transition diagram of the deterministic factor automaton
for x = abbc, -distance=1, from Example 4.51
Now we construct the repetition table. We take into account the repetitions
of factors longer than the allowed distance. In this case, the distance is equal
to 1 and therefore we select repetitions of factors having the length greater
or equal to two. Table 4.17 contains information on the approximate
repetitions of factors of the string x = abbc. 2
4.4.9 Approximate repetitions distance
-distance is dened by Def. 1.18. This distance is dened as a global
distance, which means that the local errors are cumulated. In this section
we show solution of O?NC problem.
Example 4.52
Let string x = abbc over ordered alphabet A = a, b, c. We construct an
approximate factor automaton using -distance equal to two.
1. We construct a nite automaton accepting string x and all strings
having -distance equal to two. The set of all these strings is denoted
2
(abbc). This nite automaton has the transition diagram depicted
in Fig. 4.39.
161
a b c
0 12
231
42
12
23
23
34
231
32
42
32
43
34
4
43
4
42
, 2
, 3
, 3
, 3
, 43
, 4
, 3
, 3
, 43
, 3
.
Only states 43
and 43
231
42
12
23
231
32
43
23
34
32
43
34
4
42
43
43
4
2
231
42
12
23
231
32
43
23
34
32
43
34
4
42
43
43
4
2
, 3
, 3
, 3
, 3
, 43
, 4
, 43
, 3
.
We replace all equivalent states by the respective sets. Then we obtain the
transition diagram of the optimized deterministic (, ) factor automaton
depicted in Fig. 4.46. Now we construct the repetition table. We take
into account the repetitions of factors longer than two (allowed distance).
Table 4.21 contains information on approximate repetition of one factor. 2
4.4.11 Exact repetitions in one string with dont care symbols
The dont care symbol () is dened by Def. 1.14. Next example shows
principle of nding repetitions in the case of presence of dont care symbols
(O?NED problem).
168
b
b
a b
b
b
b
c
c
c
b
b
a,c
a,c
a,c
a,c
c
b
a,c
a,c
c
START
12 3 ' '
231'4'
0 23'
32'4'
42'3'
{ } 43'2'',3'2''
2'3''
34''
{ } 3 4 ',3' ',3'4''
{ } 3 4 '',3'' '',43'',4'3''
4
4'
4''
mixed multiple front end
Figure 4.46: Transition diagram of the optimized deterministic (, )
factor automaton for the string x = abbc from Example 4.53
dsubset Factor Approximate repetitions
34 abb (3, abb, F), (4, bbc, O)
43 bbc (4, bbc, F), (3, abb, O)
Table 4.21: Approximate repetition table for string x = abbc, (, ) distance
equal to (1,2), from Example 4.53
Example 4.54
Let string x = aaab over alphabet A = a, b, c. Symbol is the dont care
symbol. We construct a dont care factor automaton.
1. We construct a nite automaton accepting set of strings described by
string x with dont care symbol. This set is DC(x) = aaaab, abaab,
acaab. This nite automaton has the transition diagram depicted in
Fig. 4.47.
Figure 4.47: Transition diagram of the DC automaton accepting DC(x)
from Example 4.54
2. We insert -transitions from state 0 to states 1,2,3,4, and 5 and we
169
make all states nal. Transition diagram of DC
is an unread part
of the text. This conguration means that the pattern was found and its
177
position is in the text just before w. The searching automaton performs
forward and backward transitions. Transition relation
(QA
) (QA
)
is dened in this way:
1. if (q, a) = p, then (q, aw) (p, w) is a forward transition,
2. if (q, x) = p, then (q, w) (p, w) is a backward transition, where x is
the sux of the part of the text read before reaching state q. 2
Just one input symbol is read during forward transition. If (q, a) = fail
then backward transition is performed and no symbol is read. Forward and
backward transition functions and have the following properties:
1. (q
0
, a) ,= fail for all a A,
2. If (q, x) = p then the depth of p is strictly less than the depth of q,
where the depth of state q is the length of the shortest sequence of
forward transitions from state q
0
to state q.
The rst condition ensures that no backward transition is performed in
the initial state. The second condition ensures that the total number of
backward transitions is less than the number of forward transitions. It
follows that the total number of performed transitions is less than 2n, where
n is the length of the text.
5.2 MP and KMP algorithms
MP and KMP algorithms are the simulators of the SFOECO automaton
(see Section 2.2.1) for exact matching of one pattern. We will show both
algorithms in the following example. Let us mention, that the backward
transition function is simplied and
: (Qq
0
) Q.
Example 5.2
Let us construct MP and KMP searching automata for pattern P = ababb
and compare it with pattern matching automaton for P. The construction
of both, deterministic pattern matching automaton and MP searching au-
tomaton is shown in Fig. 5.1. We can construct, for the resulting MP and
KMP searching automata, the following Table 5.1 containing forward tran-
sition function , backward transition function for MP algorithm, and
optimized backward transition function
opt
for KMP algorithm. The rea-
son for the introduction of the optimized backward transition function
opt
will follow from the next example. 2
Example 5.3
The MP searching automaton for pattern P = ababb and text T = abaababbb
performs the following sequence of transitions (backward transition function
is used):
178
a b
opt
0 1 0
1 fail 2 0 0
2 3 fail 0 0
3 fail 4 1 0
4 fail 5 2 2
4 fail fail 0 0
Table 5.1: Forward transition function , backward transition functions
and
opt
from Example 5.2
(0, abaababbb) (1,baababbb)
(2, aababbb)
(3, ababbb) fail
(1, ababbb) fail
(0, ababbb)
(1, babbb)
(2, abbb)
(3, bbb)
(4, bb)
(5, b)
The pattern is found in state 5 at position 8. 2
Let us mention one important observation. We can see, that there are
performed two subsequent backward transitions from state 3 for the input
symbol a leading to states 1 and 0. The reason is that (3, a) = (1, a) =
fail. This variant of searching automaton is called MP (MorrisPratt) au-
tomaton and the related algorithm shown below is called MP algorithm.
There is possible to compute optimized backward transition function
opt
having in this situation value
opt
(3) = 0. The result is, that each such
sequence of backward transitions is replaced by just one backward transi-
tion. The algorithm using optimized backward transition function
opt
is
KMP (KnuthMorrisPratt) algorithm. After this informal explanation, we
show the direct construction of MP searching automaton, computation of
backward and optimized backward transition functions. We start with MP
searching algorithm.
179
Figure 5.1: SFOECO and MP searching automata for pattern P = ababb
from Example 5.2
180
var TEXT:array[1..N] of char;
PATTERN:array[1..M] of char;
I,J: integer;
FOUND: boolean;
...
I:=1; J:=1;
while (I <= N) and (J <= M) do
begin
while (TEXT[I] <> PATTERN[J]) and (J > 0) do J:=PHI[J];
J := J + 1;
I := I + 1
FOUND := J > M;
end;
...
Variables used in MP searching algorithm are shown in Figure 5.2. The com-
Figure 5.2: Variables used in MP searching algorithm, pos = I J + 1
putation of backward transition function is based on the notion of repetitions
of prexes of the pattern in the pattern itself. The situation is depicted in
Fig. 5.3. If prex u = u
1
u
2
. . . u
j1
of the pattern matches the substring of
pattern
pattern
period
period
a
a b
u
v v
Figure 5.3: Repetition of prex v in prex u of the pattern
the text u = t
ij+1
t
ij
. . . t
i1
and u
j
,= t
i
then it is not necessary to com-
pare prex v of the pattern with substring t
ij+2
t
ij+3
. . . of text string at
181
the next position. Instead of this comparison we can do shift of the pattern
to the right. The length of this shift is the length of Border(u).
Example 5.4
Let us show repetitions and periods of prexes of pattern P = ababb in
Fig. 5.4. The shift is represented by the value of backward transition function
Figure 5.4: Repetitions and periods of P = ababb
for position j in the pattern is
(j) = [Border(p
1
p
2
. . . p
j
)[.
If there is no repetition of the prex of the pattern in itself, then the shift
is equal to j, because the period is equal to zero. 2
For the computation of the function for the pattern P we will use the fact,
that the value of (j) is equal to the element of the border array [j] for
the pattern P.
Example 5.5
Let us compute the border array for pattern P = ababb. Transition diagram
of the nondeterministic factor automaton is depicted in Fig. 5.5. Table 5.2 is
a b
b
a
a
b
b
b
b
START
0 1 2 3 4 5
Figure 5.5: Transition diagram of the nondeterministic factor automaton for
pattern P = ababb from Example 5.5
transition table of the equivalent deterministic factor automaton. Transition
diagram of the deterministic factor automaton is depicted in Fig. 5.6.
182
a b
0 13 245
13 24
24 3 5
245 3 5
3 4
4 5
5
Table 5.2: Transition table of the deterministic factor automaton from Ex-
ample 5.5.
Figure 5.6: Transition diagram of the deterministic factor automaton for
pattern P = ababb from Example 5.2
The analysis od dsubsets on the backbone of deterministic factor automaton
is shown in this table:
Analyzed state Value of border array element
13 [3] = 1
24 [4] = 2
Values of elements of border array are shown in this table:
j 1 2 3 4 5
symbol a b a b a
[j] 0 0 1 2 0
Let us recall, that (j) = [j]. 2
The next algorithm constructs MP searching automaton.
Algorithm 5.6
Construction of MP searching automaton.
Input: Pattern P = p
1
p
2
. . . p
m
.
183
Output: MP searching automaton.
Method:
1. The initial state is q
0
.
2. Each state q of MP searching automaton corresponds to prex
p
1
p
2
. . . p
j
of the pattern. (q, p
j+1
) = q
, where q
corresponds to
prex p
1
p
2
. . . p
j
p
j+1
.
3. The state corresponding to complete pattern p
1
p
2
. . . p
m
is the nal
state.
4. Dene (q
0
, a) = q
0
for all a for which no transitions was dened in
step 2.
5. (q, a) = fail for all a A and q Q for which (q, a) was not dened
in steps 2 and 3.
6. Function is the backward transition function. This is equal to the
border array for pattern P. 2
The next algorithm computes optimized backward transition function
opt
on the base of backward transition function .
Algorithm 5.7
Computation of optimized backward transition function
opt
.
Input: Backward transition function .
Output: Optimized backward transition function
opt
.
Method:
1.
opt
= for all states of the depth equal to one.
2. Let us suppose that the function
opt
has been computed for all states
having the depth less or equal to d. Let q has the depth equal to d+1.
Let us suppose that (q) = p. If (q, a) = fail and (p, a) = fail then
opt
(q) =
opt
((q)) else
opt
(q) = (q). 2
Example 5.8
Let us construct KMP searching automaton for pattern P = abaabaa and
alphabet A = a, b. First, we construct the forward transition function.
It is depicted in Fig. 5.7. Now we will construct both, the nonoptimized
Figure 5.7: Forward transition function of KMP searching automaton for
pattern P = abaabaa from Example 5.8
184
backward transition function and optimized backward transition func-
tion
opt
using Algorithms 4.19 and 5.7. To construct the nonoptimized
backward transition function, we construct the border array for pattern
P = abaabaa. The nondeterministic factor automaton has the transition
diagram depicted in Fig. 5.8. The transition diagram of the useful part
0 1 2 3 4 5 6 7
a b a a b a a
b a a b a a
START
Figure 5.8: Transition diagram of the nondeterministic factor automaton for
pattern P = abaabaa from Example 5.8
of the deterministic factor automaton is depicted in Fig. 5.9. The analy-
Figure 5.9: Part of the transition diagram of the deterministic factor au-
tomaton for pattern P = abaabaa from Example 5.8
sis of dsubsets of the deterministic factor automaton is summarized in the
Table 5.3.
dsubset Values of elements of border array
13467 [3] = 1, [4] = 1, [5], = 1, [6] = 1, [7] = 1
25 [5] = 2
36 [6] = 3
47 [7] = 4
Table 5.3: Computation of the border array for pattern P = abaabaa from
Example 5.8
185
The resulting border array is in the Table 5.4: The values [1] and [2]
Index 1 2 3 4 5 6 7
Symbol a b a a b a a
0 0 1 1 2 3 4
Table 5.4: The border array for pattern P = abaabaa from Example 5.8
are equal to zero because strings a and ab have borders of zero length. This
border array is equal to the backward transition function . The optimized
backward transition function
opt
is for some indices dierent than function
. Both functions and
opt
have values according to the Table 5.5. The
I [I]
opt
[I]
1 0 0
2 0 0
3 1 1
4 1 0
5 2 0
6 3 1
7 4 4
Table 5.5: Functions and
opt
for pattern P = abaabaa from Example 5.8
complete MP and KMP automata are depicted in Fig. 5.10. Let us have
b
0 1 2 3 4 5 6 7
a b a a b a a
START
Figure 5.10: KMP searching automaton for pattern P = abaabaa from
Example 5.8; nontrivial values of are shown by dashed lines, nontrivial
values of
opt
are shown by dotted lines, trivial values of both functions are
leading to the initial state 0
text starting with prex T = abaabac . . .. We show the behaviours of both
variants of KMP automaton. The rst one is MP variant and it is using .
186
(0, abaabac . . .) (1,baabac. . .)
(2, aabac. . .)
(3, abac. . .)
(4, bac. . .)
(5, ac. . .)
(6, c. . .)fail
(3, c. . .)fail
(1, c. . .)fail
(0, c. . .)
(0, . . .)
. . .
The second one is KMP variant using
opt
.
(0, abaabac . . .) (1,baabac. . .)
(2, aabac. . .)
(3, abac. . .)
(4, bac. . .)
(5, ac. . .)
(6, c. . .)fail
(1, c. . .)fail
(0, c. . .)
(0, . . .)
. . .
We can see from these two sequences of transitions, that MP variant com-
pares symbol c 4 times and KMP variant compares symbol c 3 times. 2
Both variants of KMP algorithm have linear time and space complexities.
If we have text T = t
1
t
2
. . . t
n
and pattern P = p
1
p
2
. . . p
m
then KMP
algorithm requires [Cro97]:
2m3 symbol comparisons during preprocessing phase (computation
of function or
opt
),
2n 1 symbol comparisons during searching phase,
m elements of memory to store the values of function or
opt
.
The nal result is that the time complexity is O(n + m) and the space
complexity is O(m). Let us note, that this complexity is not inuenced by
the size of alphabet on the contrary with the deterministic nite automata.
5.3 AC algorithm
The AC (AhoCorasick) algorithm is the simulator of the SFFECO au-
tomaton (see Section 2.2.5) for exact matching of a nite set of patterns.
187
It is based on the same principle as KMP algorithm and use the searching
automaton with restricted backward transition function:
: (Qq
0
) Q.
We start with the construction of forward transition function.
Algorithm 5.9
Construction of the forward transition function of AC automaton.
Input: Finite set of patterns S = (P
1
, P
2
, . . . , P
|S|
), where P
i
A
+
,
1 i [S[.
Output: Deterministic nite automaton M = (Q, A, , q
0
, F) accepting the
set S.
Method:
1. Q = q
0
, q
0
is the initial state.
2. Create all possible states. Each new state q of AC automaton cor-
responds to some prex a
1
a
2
. . . a
j
of one or more patterns. Dene
(q, a
j+1
) = q
, where q
corresponds to prex a
1
a
2
. . . a
j
a
j+1
of one or
more patterns. Q = Q q
.
3. For state q
0
dene (q
0
, a) = q
0
for all such a that (q
0
, a) was not
dened in step 2.
4. (q, a) = fail for all q and a for which (q, a) was not dened in steps
2 or 3.
5. Each state corresponding to the complete pattern will be the nal
state. It holds also in case when one pattern is either prex or sub-
string of another pattern. 2
The same result as by Algorithm 5.9 we can obtain by modication of
SFFECO automaton. This modication consists in two steps:
1. Removing some number of self loops in the initial state. There are
self loops for symbols for which exist transitions from the initial state
to the next states.
2. Determinization of the automaton resulting from step 1.
Step 2 must be used only in case when some patterns in set S = (P
1
, P
2
,
. . . , P
|s|
) have equal prexes. Otherwise the resulting automaton is deter-
ministic after step 1.
The next algorithm constructs the backward transition function. Let us note
that the depth of state q is the number of forward transitions from state q
0
to state q.
Algorithm 5.10
Construction of backward transition function of AC automaton for set S =
188
P
1
, P
2
, . . . , P
|S|
.
Input: Deterministic nite automaton M accepting set S.
Output: Complete AC automaton with backward transition function .
Method:
1. Let Q = Q
0
Q
1
. . . Q
maxd
, where elements of Q
d
are states with
depth d. Let us note:
(a) Q
0
= q
0
,
(b) Q
i
Q
j
= , for i ,= j, 0 i, j maxd.
2. (q) = q
0
for all q Q
1
.
3. For d = 2, . . . , maxd and all states in Q
d
construct mborder array
m[2..n]. 2
The computation of backward transition function is depicted in Fig. 5.11.
b
a a
j(q )
d
j j ( (q ))
d
j(q') q
d
q'
START
Figure 5.11: Computation of (q
)
This backward transition function is not optimized similarly to the case
of MP searching automaton. The optimized version
opt
of function is
constructed by the next algorithm.
Algorithm 5.11
Construction of optimized backward transition function
opt
of AC search-
ing automaton.
Input: AC searching automaton with backward transition function (not
optimized).
Output: Optimized backward transition function
opt
for the input AC
searching automaton.
Method:
1. Let Q = q
0
Q
1
. . . Q
maxd
, where elements of Q
d
are states with
the depth d.
2.
opt
(q) = (q) = q
0
for all q Q
1
.
3. For d = 2, 3, . . . , maxd and all states in q Q
d
do:
(a) X = a : (q, a) = p, a A,
189
(b) Y = a : ((q), a) = r, a A,
(c) if Y X or X = then
opt
(q) =
opt
((q)) else
opt
(q) = (q).
2
The principle of the construction of optimized backward transition function
opt
is depicted in Fig. 5.12.
b
c
a
a
j j j
opt opt
( )= q ( (q))
q
b
X= a,b , Y= a , Y X { } { }
j(q)
START
Figure 5.12: Construction of
opt
(q)
Example 5.12
Let us construct AC searching automaton for set of patterns S = ab, babb, bb
and alphabet A = a, b, c. First we construct the forward transition func-
tion. It is depicted in Fig. 5.13. Further we will construct both backward
b a b
b
a
b
c
b
START
0 1
23
2
2
1
1
3
2
1
2
3
4
Figure 5.13: Forward transition function of AC searching automaton for set
of patterns S = ab, babb, bb and A = a, b, c from Example 5.12
transition functions and
opt
. To construct the nonoptimized backward
transition function we construct the mborder array m for set of patterns
S = ab, babb, bb. The nondeterministic factor automaton has the transi-
tion diagram depicted in Fig. 5.14. Table 5.6 is the transition table of the
deterministic factor automaton for set of patterns S = ab, babb, bb. The
transition diagram of this factor automaton is depicted in Fig. 5.15. The
190
b a
a
b
b
b
b
a
b
b
b
b
START
0 1
23
2
2
1
1
3
2
1
2
3
4
Figure 5.14: Transition diagram of the nondeterministic factor automaton
for set of patterns S = ab, babb, bb from Example 5.12
a b
0 1
1
2
2
1
23
2
1
2
3
34
1
1
2
2
2
1
3
2
1
3 4
1
23
2
1
2
3
34 2
2
2
3
4
2
2
3
3 4
4
2
3
4
Table 5.6: Transition table of the deterministic factor automaton for set
S = ab, babb, bb from Example 5.12
analysis of dsubsets of the deterministic factor automaton is summarized
in the Table 5.7. The next table shows the border array.
State 1
1
1
23
2
1
2
2
2
3
3 4
Symbol a b b a b b b
m 0 0 1
23
1
1
1
23
2
1
2
3
The values of [1
1
] and [1
23
] are equal to zero because strings a and b have
borders of zero length. This border array is equal to backward transition
function . Optimized backward transition function
opt
diers for some
states of function . Forward transition function and both functions
and
opt
have values according to the Table 5.8. The complete AC search-
ing automaton is depicted on Fig. 5.16 The backward transition function
is shown by dashed lines, the optimized backward transition function is
shown by dotted lines in cases when ,=
opt
. The Fig. 5.17 shows the
transition diagram of the deterministic automaton for nondeterministic SF-
191
Figure 5.15: Transition diagram of the deterministic factor automaton for
set of patterns S = ab, babb, bb from Example 5.12
dsubset Values of elements of border array
1
23
2
1
2
3
34 m[2
1
] = 1
23
, m[2
3
] = 1
23
, m[3] = 1
23
, m[4] = 1
23
1
1
2
2
m[2
2
] = 1
1
2
1
3 m[3] = 2
1
2
3
4 m[4] = 2
3
Table 5.7: Computation of mborder array for set of patterns S =
ab, babb, bb from Example 5.12
FECO automaton for S = ab, babb, bb. It is included in order to enable
comparison of deterministic nite automaton and AC searching automaton.
Let us show the sequence of transitions of resulting AC searching automaton
for input string bbabb.
(0, bbabb) ( 1
23
, babb)
( 2
3
, abb) bb found, fail
( 1
23
, abb)
( 2
2
, bb)
( 3, b) ab found
( 4, ) bb and babb found.
For comparison we show the sequence of transitions of deterministic nite
automaton:
(0, bbabb) ( 01
23
, babb)
( 01
23
2
3
, abb) bb found
( 01
1
2
2
, bb)
( 01
23
2
1
3, b) ab found
( 01
23
2
3
4, ) bb and babb found.
2
192
a b c
opt
0 1
1
1
23
0
1
1
fail 2
1
fail 0 0
1
23
2
2
2
3
fail 0 0
2
1
fail fail fail 1
23
1
23
2
2
fail 3 fail 1
1
0
2
3
fail fail fail 1
23
1
23
3 fail 4 fail 2
1
1
23
4 fail fail fail 2
3
1
23
Table 5.8: The forward and both backward transition functions for set of
patterns S = ab, babb, bb from Example 5.12
Figure 5.16: Complete AC searching automaton for set of patterns S =
ab, babb, bb from Example 5.12
Similarly as the KMP algorithm, the AC algorithm have linear time and
space complexities. If we have alphabet A, text T = t
1
t
2
. . . t
n
and set of
patterns S = P
1
, P
2
, . . . P
|s|
, where [S[ =
|s|
i=1
[P
i
[, then AC algorithm
needs [CH97b]:
O([S[) time for preprocessing phase (construction of AC searching
automaton),
O(n) time for searching phase,
O([S[ [A[) space to store AC searching automaton.
The space requirement can be reduced to O([S[). In this case the searching
phase has time complexity O([n[ log[A[). See more details in Crochemore
and Hancart [CH97b].
193
Figure 5.17: Transition diagram of the deterministic nite automaton for
S = ab, babb, bb from Example 5.12; transitions for symbol c from all
states are leading to state 0
194
6 Simulation of nondeterministic nite automata
dynamic programming and bit parallelism
In the case when the space complexity or the preprocessing time of DFA
makes it unusable, we can use some of deterministic simulation methods
of corresponding NFA. At the beginning of this section we describe a ba-
sic simulation method, which is the base of the other simulation methods
presented further in this section. This method can be used for any general
NFA. The other simulation methods improve complexities of simulation, but
on the other hand they set some requirements to NFAs in order to be more
ecient.
The simulation methods will be presented on NFAs for exact and approx-
imate string matching. In this Section we will use another version of NFAs
for approximate string matching, where all transitions for edit operations
replace and insert are labeled by whole alphabet instead of complement of
matching symbol. This simplies the formulae for the simulation while the
behavior of NFAs practically does not change.
6.1 Basic simulation method
In Algorithm 6.1 we show the basic algorithm, which is very similar to the
transformation of NFA to the equivalent DFA. In each step of the run of
NFA a new set of active states is computed by evaluating all transitions from
all states of the previous set of active states. This provides that all possible
paths labeled by input string are considered in NFA. The simulation nishes
when the end of the input text is reached or when there is no active state
(e.g., no accepting path exists).
Algorithm 6.1 (Simulation of run of NFAbasic method)
Input: NFA M = (Q, A, , q
0
, F), input text T = t
1
t
2
. . . t
n
.
Output: Output of run of NFA.
Method: Set S of active states is used.
S := CLOSURE(q
0
)
i := 1
while i n and S ,= do
/ transitions are performed for all elements of S /
S :=
qS
CLOSURE((q, t
i
))
if S F ,= then
write(information associated with each nal state in S F)
endif
i := i + 1
endwhile
195
In the transformation of NFA to the equivalent DFA, all the possible con-
gurations of set S of active states are evaluated as well as all the possible
transitions among these congurations and a deterministic state is assigned
to each such conguration of S. Using the simulation method from Algo-
rithm 6.1 only the current conguration of set S is evaluated in each step of
the simulation of NFAonly used congurations are evaluated during the
simulation.
It is also possible to combine the simulation and the transformation.
When processing an input text, we can store the used congurations of S
in some state-cache and assign them deterministic states. In such a way we
transform NFA to DFA incrementally, but we evaluate only used states and
transitions. If the state-cache is full, we can use one of cache techniques for
making room in the state-cache, e.g., removing least recently used states.
This solution has the following advantages: we can control the amount of
used memory (size of state-cache) and for the most frequent congurations
we do not need always to compute the most frequent transitions, but we
can use directly the information stored in the state-cache (using a DFA
transition has better time complexity than computing a new conguration).
Theorem 6.2
The basic simulation method shown in Algorithm 6.1 simulates run of NFA.
Proof
Let M = (Q, A, , q
0
, F) be an NFA and T = t
1
t
2
. . . t
n
be an input text.
Algorithm 6.1 considers really all paths (i.e., sequences of congurations)
leading from q
0
.
At the beginning of the algorithm, set S of active states contains
CLOSURE(q
0
)each path must start in q
0
and then some of -transitions
leading from q
0
can be used. In this way all congurations (q
j
, T), q
j
Q,
(q
0
, T)
M
(q
j
, T), are considered.
In each i-th step of the algorithm, 1 i n, all transitions (relevant to
T) leading from all states of S are evaluated, i.e., both t
i
labeled transitions
as well as -transitions. At rst all transitions reachable by transitions
labeled by t
i
from each state of S are inserted in new set S and then
CLOSURE of new set S is also inserted in new set S. In this way all
congurations (q
j
, t
i+1
t
i+2
. . . t
n
), q
j
Q, (q
0
, T)
M
(q
j
, t
i+1
t
i+2
. . . t
n
), are
considered in i-th step of the algorithm.
In each step of the algorithm, the set S of active states is tested for nal
state and for each such found active nal state the information associated
with such state is reported. 2
6.1.1 Implementation
6.1.1.1 NFA without -transitions This basic method can be imple-
mented using bit vectors. Let M = (Q, A, , q
0
, F) be an NFA without
196
-transitions and T = t
1
t
2
. . . t
n
be an input text of M. We implement a
transition function as a transition table T (of size [Q[ [A[) of bit vectors:
T [i, a] =
_
1
.
.
.
|Q|1
_
_
(5)
where a A and bit
j
= 1, if q
j
(q
i
, a), or
j
= 0 otherwise. Then we
also implement a set F of nal states as a bit vector T:
T =
_
_
f
0
f
1
.
.
.
f
|Q|1
_
_
(6)
where bit f
j
= 1, if q
j
F, or f
j
= 0 otherwise. In each step i, 0 i n,
(i.e., after reading symbol t
i
) of the simulation of the NFA run, set S of
active states is represented by bit vector o:
o
i
=
_
_
s
0,i
s
1,i
.
.
.
s
|Q|1,i
_
_
(7)
where bit s
j,i
= 1, if state q
j
is active (i.e. q
j
o) in i-th simulation step,
or s
j,i
= 0 otherwise.
When constructing o
i+1
, we can evaluate all transitions labeled by t
i+1
leading from state q
j
, which is active in i-th step, at oncejust using bitwise
operation or for bit-vector o
i+1
and bit-vector T [j, t
i+1
]. This implementa-
tion is used in Algorithm 6.3.
Note, that the rst for cycle of Algorithm 6.3 is multiplication of vector
o
i
by matrix T [, t
i
]. This approach is also used in quantum automata
[MC97], where quantum computation is used.
Theorem 6.4
The simulation of the run of general NFA runs in time O(n[Q[
|Q|
w
|) and
space
1
O([A[[Q[
|Q|
w
|), where n is the length of the input text, [Q[ is the
1
For the space complexity of this theorem we expect complete transition table for
representation of transition function.
197
Algorithm 6.3 (Simulation of run of NFAbit-vector imple-
mentation of basic method)
Input: Transition table T and set T of nal states of NFA, input text
T = t
1
t
2
. . . t
n
.
Output: Output of run of NFA.
Method:
o
0
:= [100 . . . 0] / only q
0
is active at the beginning /
i := 1
while i n and o
i1
,= [00 . . . 0] do
o
i
:= [00 . . . 0]
for j := 0, 1, . . . [Q[ 1 do
if s
j,i1
= 1 then / q
j
is active in (i 1)-th step /
o
i
:= o
i
or T [j, t
i
] / evaluate transitions for q
j
/
endif
endfor
for j := 0, 1, . . . [Q[ 1 do
if s
j,i
= 1 and f
j
= 1 then / if q
j
is active nal state /
write(information associated with nal state q
j
)
endif
endfor
i := i + 1
endwhile
number of states of the NFA, A is the input alphabet, and w is the size of
the used computer word in bits.
Proof
See the basic simulation method in Algorithm 6.3. The main while-cycle is
performed at most n-times and both inner while-cycles are performed just
[Q[ times. If [Q[ > w (i.e., more than one computer word must be used to
implement bit-vector for all states of NFA), each elementary bitwise opera-
tions must be split into
|Q|
w
| bitwise operations. It gives us time complexity
O(n[Q[
|Q|
w
|). The space complexity is given by the implementation of tran-
sition function by transition table T , which contains ([Q[ [A[) bit-vectors
each of size
|Q|
w
|. 2
Example 6.5
Let M be an NFA for the exact string matching for pattern P = aba and
T = accabcaaba be an input text.
Transition table and its bit-vector representation T are as follows:
198
M
a b A a, b
0 0,1 0 0
1 2
2 3
3
T
M
a b A a, b
0
1
1
0
0
1
0
0
0
1
0
0
0
1
0
0
0
0
0
0
1
0
0
0
0
0
2
0
0
0
1
0
0
0
0
0
0
0
0
3
0
0
0
0
0
0
0
0
0
0
0
0
The process of simulation of M over T is displayed by set S of active
states and its bit-vector representation o.
- a c c a b c a a b a
o
1
0
0
0
1
1
0
0
1
0
0
0
1
0
0
0
1
1
0
0
1
0
1
0
1
0
0
0
1
1
0
0
1
1
0
0
1
0
1
0
1
1
0
1
S q
0
q
0
q
1
q
0
q
0
q
0
q
1
q
0
q
2
q
0
q
0
q
1
q
0
q
1
q
0
q
2
q
0
q
1
q
3
6.1.1.2 NFA with -transitions If we have an NFA with -transitions,
we can transform it to an equivalent NFA without -transitions using Algo-
rithm 6.6. There are also other algorithms for removing -transitions but
this algorithm does not change the states of the NFA.
Algorithm 6.6 provides that all transitions labeled by symbols (not -
transitions) leading from all states of CLOSURE(q
0
) lead also from
q
0
. Then for each state q Q and symbol a A all states accessible from
(q, a) (i.e., CLOSURE((q, a))) are inserted into
(q, a).
Note, that the resulting NFA M
= (Q, A,
, q
0
, F
) without -transitions.
Method:
S := CLOSURE(q
0
)
for each a A do
(q
0
, a) :=
qS
CLOSURE((q, a))
endfor
for each q Q q
0
do
for each a A do
:= F q
0
else
F
:= F
endif
c[i] =
_
_
e
0
e
1
.
.
.
e
|Q|1
_
_
(8)
where bit e
j
= 1, if q
j
CLOSURE(q
i
), or e
j
= 0 otherwise.
This implementation is used in Algorithm 6.7, where CLOSURE(S) is
computed in each step of the simulation. The time and space complexities
are asymptotically same as in Algorithm 6.3, therefore Theorem 6.4 holds
for all NFAs (with as well as without -transitions).
Lemma 6.8
Let NFA M = (Q, A, , q
0
, F) is implemented by bit-vectors as shown in the
previous subsection. Then Algorithm 6.6 runs in time O([A[[Q[
2
|Q|
w
|) and
space O([A[[Q[
|Q|
w
|), where w is a length of used computer word in bits.
Proof
Let -transitions be implemented by table c as shown above (See Formula 8).
The statement in the rst for cycle of Algorithm 6.6 contains the cycle
performed for all states of S (O([Q[)), in which there is nested another for
cycle performed for all states of (q, a) (O([Q[)).
200
Algorithm 6.7 (Simulation of run of NFA with -transitions
bit-vector implementation of basic method)
Input: Transition tables T and c, and set T of nal states of NFA,
input text T = t
1
t
2
. . . t
n
.
Output: Output of run of NFA.
Method:
/ only CLOSURE(q
0
) is active at the beginning /
o
0
:= [100 . . . 0] or c[0]
i := 1
while i n and o
i1
,= [00 . . . 0] do
o
i
:= [00 . . . 0]
for j := 0, 1, . . . [Q[ 1 do
if s
j,i1
= 1 then / q
j
is active in (i 1)-th step /
o
i
:= o
i
or T [j, t
i
] / evaluate transitions for q
j
/
endif
endfor
for j := 0, 1, . . . [Q[ 1 do / construct CLOSURE(o
i
) /
if s
j,i
= 1 then
o
i
:= o
i
or c[j]
endif
endfor
for j := 0, 1, . . . [Q[ 1 do
if s
j,i
= 1 and f
j
= 1 then / if q
j
is active nal state /
write(information associated with nal state q
j
)
endif
endfor
i := i + 1
endwhile
The statement inside next two nested for cycles (O([Q[) and O(A)) also
contains a cycle (CLOSURE() for all states of (q, a)O([Q[)).
Therefore the total time complexity is O([A[[Q[
2
|Q|
w
|). The space com-
plexity is given by the size of new copy M
210
d
j1,i1
0, 1, 0 < i n, 0 < j m. For each diagonal of the matrix D
they store only the positions, in which the value increases.
Let us remark that the size of the input alphabet can be reduced to
reduced alphabet A
, A
A, [A
[ m+1 (A
_
r
l
1,i
r
l
2,i
.
.
.
r
l
m,i
_
_
and D[x] =
_
_
d
1,x
d
2,x
.
.
.
d
m,x
_
_
, 0 i n, 0 l k, x A. (13)
211
6.3.2 String matching
6.3.2.1 Exact string matching In the exact string matching, vec-
tors R
0
i
, 0 i n, are computed as follows [BYG92]:
r
0
j,0
:= 1, 0 < j m
R
0
i
:= shl(R
0
i1
) or D[t
i
], 0 < i n (14)
In Formula 14 operation shl() is the bitwise operation left shift that
inserts 0 at the beginning of vector and operation or is the bitwise operation
or. Term shl(R
0
i1
) or D[t
i
] represents matchingposition i in text T is
increased, position in pattern P is increased by operation shl(), and the
positions corresponding to the input symbol t
i
are selected by term or D[t
i
].
Pattern P is found at position t
im+1
. . . t
i
if r
0
m,i
= 0, 0 < i n.
An example of mask matrix D for pattern P = adbbca is shown in
Table 6.4 and an example of matrix R
0
for exact searching for pattern P =
adbbca in text T = adcabcaabadbbca is shown in Table 6.5.
D a b c d A a, b, c, d
a 0 1 1 1 1
d 1 1 1 0 1
b 1 0 1 1 1
b 1 0 1 1 1
c 1 1 0 1 1
a 0 1 1 1 1
Table 6.4: Matrix D for pattern P = adbbca
R
0
- a d c a b c a a b a d b b c a
a 1 0 1 1 0 1 1 0 0 1 0 1 1 1 1 0
d 1 1 0 1 1 1 1 1 1 1 1 0 1 1 1 1
b 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1
b 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1
c 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1
a 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
Table 6.5: Matrix R
0
for the exact string matching (for pattern P = adbbca
and text T = adcabcaabadbbca)
Theorem 6.13
Shift-Or algorithm described by Formula 14 simulates a run of the NFA for
the exact string matching.
212
Proof
In Shift-Or algorithm for the exact string matching there is one bit vector
R
0
i
, 0 i n, which represents the set of active states of NFA. In the vector,
0 represents active state and 1 represents non-active state of the simulated
NFA. So in Shift-Or algorithm, the set of active states from Section 6.1 is
implemented by bit vector.
In Formula 14, term shl(R
0
i1
) or D[t
i
] represents matching transition
each active state is moved to the next position in the right
2
in the same level.
All active states are moved at once and only the transitions corresponding
to read symbol t
i
are selected by mask vector D[t
i
], which changes 0 to 1
in each such state that its incoming matching transition is not labeled by
t
i
. The initial state of NFA is not in vector R
0
and it is implemented by
inserting 0 at the beginning of the vector in operation shl()initial state is
always active because of its self loop.
At the beginning only the initial state is active therefore R
0
0
= 1
(m)
.
If r
0
m,i
= 0, 0 < i n, then we report that the nal state is active and
thus the pattern is found ending at position i in text T. 2
6.3.2.2 Approximate string matching using Hamming distance
In the approximate string matching using the Hamming distance, vectors R
l
i
,
0 l k, 0 i n, are computed as follows:
r
l
j,0
:= 1, 0 < j m, 0 l k
R
0
i
:= shl(R
0
i1
) or D[t
i
], 0 < i n
R
l
i
:= (shl(R
l
i1
) or D[t
i
]) and shl(R
l1
i1
), 0 < i n, 0 < l k
(15)
In Formula 15 operation and is bitwise operation and. Term
shl(R
l
i1
) or D[t
i
] represents matching and term shl(R
l1
i1
) represents edit
operation replaceposition i in text T is increased, position in pattern P is
increased, and edit distance l is increased. Pattern P is found with at most
k mismatches at position t
im+1
. . . t
i
if r
k
m,i
= 0, 0 < i n. The maximum
number of mismatches of the found string is D
H
(P, t
im+1
. . . t
i
) = l, where
l is the minimum number such that r
l
m,i
= 0.
An example of matrices R
l
for searching for pattern P = adbbca in text
T = adcabcaabadbbca with at most k = 3 mismatches is shown in Table 6.6.
2
In the Shift-Or algorithm, transitions from states in our gures are implemented by
operation shl() (left shift) because of easier implementation in the case when number of
states of NFA is greater than length of computer word and vectors R have to be divided
into two or more bit vectors.
213
R
0
- a d c a b c a a b a d b b c a
a 1 0 1 1 0 1 1 0 0 1 0 1 1 1 1 0
d 1 1 0 1 1 1 1 1 1 1 1 0 1 1 1 1
b 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1
b 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1
c 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1
a 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
R
1
- a d c a b c a a b a d b b c a
a 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
d 1 1 0 1 1 0 1 1 0 0 1 0 1 1 1 1
b 1 1 1 0 1 1 1 1 1 0 1 1 0 1 1 1
b 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1
c 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1
a 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
R
2
- a d c a b c a a b a d b b c a
a 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
d 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
b 1 1 1 0 1 0 0 1 1 0 0 1 0 0 1 1
b 1 1 1 1 0 1 1 1 1 1 0 1 1 0 1 1
c 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1
a 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
R
3
- a d c a b c a a b a d b b c a
a 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
d 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
b 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0
b 1 1 1 1 0 0 0 0 1 0 0 0 0 0 0 1
c 1 1 1 1 1 0 0 1 1 1 1 0 1 1 0 1
a 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0
Table 6.6: Matrices R
l
for the approximate string matching using the Ham-
ming distance (P = adbbca, k = 3, and T = adcabcaabadbbca)
214
8
11 10 9
2 3 4
5 6 7
A
A
A
A A
A
A
A
p
2
p
2
p
3
p
3
p
3
p
4
p
4
p
4
0 1
R
2
p
1
R
0
R
1
Figure 6.7: Bit parallelism uses one bit vector R for each level of states
of NFA
Theorem 6.14
Shift-Or algorithm described by Formula 15 simulates a run of the NFA for
the approximate string matching using the Hamming distance.
Proof
In Shift-Or algorithm for the approximate string matching there is for each
level l, 0 l k, of states of NFA one bit vector R
l
i
, 0 i n. So in
Shift-Or algorithm, the set of active states from Chapter 6.1 is implemented
by bit vectorsone vector for each level of states.
In Formula 15 term shl(R
l
i1
) or D[t
i
] represents matching transition
(see the proof of Theorem 6.13). Term shl(R
l1
i1
) represents transition re-
placeeach active state of level l 1 is moved to the next depth in level l.
The seloop of the initial state is implemented by inserting 0 at the
beginning of vector R
0
within operation shl(). 0 is inserted also at the
beginning of each vector R
l
, 0 < l k. Since the rst state of l-th level is
connected with the initial state by the sequence of l transitions labeled by A,
each of these rst states is active from l-th step of the simulation respectively
till the end. The impact of 0s inserted at the beginning of vectors R
l
does
not appear before l-th step, therefore also vectors R
l
simulate correctly the
NFA.
If r
l
m,i
= 0, 0 < i n, the nal state of l-th level is active and we can
report that the pattern is found with at most l errors ending at position i
in the text. In fact, we report just only the minimum l in each step. 2
6.3.2.3 Approximate string matching using Levenshtein distance
In the approximate string matching using the Levenshtein distance, vec-
tors R
l
i
, 0 l k, 0 i n, are computed by Formula 16. To prevent
215
insert transitions leading into nal states we use auxiliary vector V dened
by Formula 17.
r
l
j,0
:= 0, 0 < j l, 0 < l k
r
l
j,0
:= 1, l < j m, 0 l k
R
0
i
:= shl(R
0
i1
) or D[t
i
], 0 < i n
R
l
i
:= (shl(R
l
i1
) or D[t
i
])
and shl(R
l1
i1
and R
l1
i
)
and (R
l1
i1
or V ), 0 < i n, 0 < l k
(16)
V =
_
_
v
1
v
2
.
.
.
v
m
_
_
, where v
m
= 1 and v
j
= 0, j, 1 j < m. (17)
In Formula 16 term shl(R
l
i1
) or D[t
i
] represents matching, term
shl(R
l1
i1
) represents edit operation replace, term shl(R
l1
i
) represents edit
operation deleteposition in pattern P is increased, position in text T is
not increased, and edit distance l is increased. Term R
l1
i1
represents edit
operation insertposition in pattern P is not increased, position in text T
is increased, and edit distance l is increased. Term or V provides that no
insert transition leads from any nal state.
Pattern P is found with at most k dierences ending at position i if
r
k
m,i
= 0, 0 < i n. The maximum number of dierences of the found
string is l, where l is the minimum number such that r
l
m,i
= 0.
An example of matrices R
l
for searching for pattern P = adbbca in text
T = adcabcaabadbbca with at most k = 3 errors is shown in Table 6.7.
Theorem 6.15
Shift-Or algorithm described by Formula 16 simulates a run of the NFA for
the approximate string matching using the Levenshtein distance.
Proof
The proof is similar to the proof of Theorem 6.14, we only have to add the
simulation of insert and delete transitions.
In Formula 16 term shl(R
l
i1
) or D[t
i
] represents matching transition
and term shl(R
l1
i1
) represents transition replace (see the proof of The-
orem 6.14). Term R
l1
i1
represents transition inserteach active state of
level l 1 is moved into level l within the same depth. Term shl(R
l1
i
)
represents transition deleteeach active state of level l 1 is moved to the
next depth in level l while no symbol is read from the input (position i in
text T is the same).
216
R
0
- a d c a b c a a b a d b b c a
a 1 0 1 1 0 1 1 0 0 1 0 1 1 1 1 0
d 1 1 0 1 1 1 1 1 1 1 1 0 1 1 1 1
b 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1
b 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1
c 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1
a 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
R
1
- a d c a b c a a b a d b b c a
a 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
d 1 0 0 0 0 0 1 0 0 0 0 0 0 1 1 0
b 1 1 0 0 1 0 1 1 1 0 1 0 0 0 1 1
b 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1
c 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0
a 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0
R
2
- a d c a b c a a b a d b b c a
a 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
d 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
b 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
b 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0
c 1 1 1 0 1 1 0 1 1 1 1 1 0 0 0 0
a 1 1 1 1 0 1 1 0 1 1 1 1 1 0 0 0
R
3
- a d c a b c a a b a d b b c a
a 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
d 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
b 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
b 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
c 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0
a 1 1 1 0 0 1 0 0 0 1 0 1 0 0 0 0
Table 6.7: Matrices R
l
for the approximate string matching using the Lev-
enshtein distance (P = adbbca, k = 3, and T = adcabcaabadbbca)
217
At the beginning, only the initial state q
0
and all states located on the
same -diagonal like q
0
are active (i.e., all states of CLOSURE(q
0
) are
active), therefore l-th bit of vector R
l
, 0 < l k, is 0 in the initial setting of
the vector. The states in front of l-th bit in vector R
l
can also be 0, since they
have no impact (l-th bit is always 0, since all states of CLOSURE(q
0
)
are always active due to the seloop of q
0
). Therefore the bits behind l-th
bit are set in the initial setting while the initial setting of the bits in front
of l-th bit can be arbitrary.
In the case of the Levenshtein distance, the situation with inserting 0s
at the beginning of vectors R
l
, 0 < l k, during shl() operation is slightly
dierent. Since all the states of the -diagonal leading from q
0
are always
active, there is no impact of these 0 insertions. 2
6.3.2.4 Approximate string matching using generalized Leven-
shtein distance In order to construct Shift-Or algorithm for the approx-
imate string matching using the generalized Levenshtein distance [Hol97],
we modify Formula 16 for the approximate string matching using the Lev-
enshtein distance such that we add the term representing edit operation
transpose. Since NFA for the approximate string matching using the gener-
alized Levenshtein distance has on each transition transpose one auxiliary
state, (see Figure 6.6) we have to introduce new bit vectors S
l
i
, 0 l < k,
0 i < n, as follows:
S
l
i
=
_
_
s
l
1,i
s
l
2,i
.
.
.
s
l
m,i
_
_
, 0 l < k, 0 i < n. (18)
Vectors R
l
i
, 0 l k, 0 i n, and S
l
i
, 0 l < k, 0 i < n, are then
computed as follows:
218
r
l
j,0
:= 0, 0 < j l,
0 < l k
r
l
j,0
:= 1, l < j m,
0 l k
R
0
i
:= shl(R
0
i1
) or D[t
i
], 0 < i n
R
l
i
:= (shl(R
l
i1
) or D[t
i
])
and shl(R
l1
i1
and R
l1
i
and (S
l1
i1
or D[t
i
]))
and (R
l1
i1
or V ), 0 < i n,
0 < l k
s
l
j,0
:= 1, 0 < j m,
0 l < k
S
l
i
:= shl(R
l
i1
) or shr(D[t
i
]), 0 < i < n,
0 l < k
(19)
Term shl(R
l
i1
) or D[t
i
] represents matching, term shl(R
l1
i1
) represents
edit operation replace, term shl(R
l1
i
) represents edit operation delete, and
term R
l1
i1
represents edit operation insert.
Term (S
l1
i1
or D[t
i
]) represents edit operation transposeposition in pat-
tern P is increased by 2, position in text T is also increased by 2 but edit
distance is increased just by 1. The increase of both positions by 2 is pro-
vided using vector S
l
i
.
Pattern P is found with at most k dierences ending at position i if
r
k
m,i
= 0, 0 < i n. The maximum number of dierences of the found
string is l, where l is the minimum number such that r
l
m,i
= 0.
An example of matrices R
l
and S
l
for searching for pattern P = adbbca
in text T = adbcbaabadbbca with at most k = 3 errors is shown in Table 6.8.
Theorem 6.16
Shift-Or algorithm described by Formula 19 simulates a run of the NFA for
the approximate string matching using the generalized Levenshtein distance.
Proof
In Formula 19 term shl(R
l
i1
) or D[t
i
] represents matching transition and
term shl(R
l1
i1
) represents transition replace, term R
l1
i1
represents transition
insert, and term shl(R
l1
i
) represents transition delete (see the proof of
Theorem 6.15).
Term (S
l1
i1
or D[t
i
]) represents edit operation transpose. In this transi-
tion all states of level l are moved to the next position in the right (shl(R
l
i1
))
of auxiliary level l
are
moved to the next position in the right of level l +1 and only the transitions
corresponding to input symbol t
i+1
are selected by mask vector D[t
i+1
].
Since each transition leading to state of depth j +1 is labeled by symbol p
j
,
we have to shift mask vector D[t
i+1
] in the same direction in which vector S
l
i
is shifted. Therefore we insert term shl(S
l1
i1
or D[t
i
]) in Formula 16 as well
as we insert the formula for computation of vector S
l
i
in Formula 16. 2
6.3.3 Other methods of bit parallelism
In the previous Sections we described only Shift-Or algorithm. If we ex-
change the meaning of 0s and 1s and the usage of ands and ors in the
formulae presented, we get Shift-And algorithm [WM92].
Other method of bit parallelism is Shift-Add algorithm [BYG92]. In
this algorithm we have one bit-vector, which contains m blocks of bits of
size b = log
2
m| (one block for each depth of NFA for the approximate
string matching using the Hamming distance). The formula for computing
such vector then consists of shifting the vector by b bits and adding with the
mask vector for current input symbol t. This vector contains 1 in (b.j)-th
position, if t ,= p
j+1
, or 0, elsewhere. This algorithm runs in time O(
mb
w
|n).
We can also use Shift-Add algorithm for the weighted approximate string
matching using the Hamming distance. In such case each block of bits in
mask vector contains binary representation of weight of the corresponding
edit operation replace. We have also to enlarge the length of block of bits
to prevent a carry to the next block of bits. Note, that Shift-Add algorithm
can also be considered as an implementation of dynamic programming.
In [BYN96a] they improve the approximate string matching using the
Levenshtein distance in such a way, that they search for any of the rst
(k + 1) symbols of the pattern. If they nd any of them, they start NFA
simulation and if the simulation then reaches the initial situation (i.e., only
the states located in -diagonal leading from the initial state are active),
they again start searching for any of the rst (k +1) symbols of the pattern,
which is faster than the simulation.
Shift-Or algorithm can also be used for the exact string matching with
dont care symbols, classes of symbols, and complements as shown in
[BYG92]. In [WM92] they extend this method to unlimited wild cards and
apply the above features for the approximate string matching. They also use
bit parallelism for regular expressions, for the weighted approximate string
matching, for set of patterns, and for the situations, when errors are not
allowed in some parts of pattern (another mask vector is used).
All the previous cases have good time complexity if the NFA simulation
221
ts in one computer word (each vector in one computer word). If NFA is
larger, it is necessary to partition NFA or pattern as described in [BYN96b,
BYN96a, BYN97, BYN99, NBY98, WM92].
Shift-Or algorithm can also be used for multiple pattern matching
[BYG92, BYN97] or for the distributed pattern matching [HIMM99].
6.3.4 Time and space complexity
The time and space analysis of the Shift-Or algorithm is as follows [WM92].
Denote the computer word size by w. The preprocessing requires O(m[A[)
time plus O(k
m
w
|) to initialize the k vectors. The running time is O(nk
m
w
|).
The space complexity is O([A[
m
w
|) for mask matrix D plus O(k
m
w
|) for
the k vectors.
222
References
[Abr87] K. Abrahamson. Generalized string matching. SIAM J. Com-
put., 16(6):10391051, 1987.
[BYG92] R. A. Baeza-Yates and G. H. Gonnet. A new approach to text
searching. Commun. ACM, 35(10):7482, 1992.
[BYN96a] R. A. Baeza-Yates and G. Navarro. A fast heuristic for ap-
proximate string matching. In N. Ziviani, R. Baeza-Yates, and
K. Guimar aes, editors, Proceedings of the 3rd South American
Workshop on String Processing, pages 4763, Recife, Brazil,
1996. Carleton University Press.
[BYN96b] R. A. Baeza-Yates and G. Navarro. A faster algorithm for ap-
proximate string matching. In D. S. Hirschberg and E. W. Myers,
editors, Proceedings of the 7th Annual Symposium on Combina-
torial Pattern Matching, number 1075 in Lecture Notes in Com-
puter Science, pages 123, Laguna Beach, CA, 1996. Springer-
Verlag, Berlin.
[BYN97] R. A. Baeza-Yates and G. Navarro. Multiple approximate string
matching. In F. K. H. A. Dehne, A. Rau-Chaplin, J.-R. Sack,
and R. Tamassia, editors, Proceedings of the 5th Workshop on Al-
gorithms and Data Structures, number 1272 in Lecture Notes in
Computer Science, pages 174184, Halifax, Nova Scotia, Canada,
1997. Springer-Verlag, Berlin.
[BYN99] R. A. Baeza-Yates and G. Navarro. Faster approximate string
matching. Algorithmica, 23(2):127158, 1999.
[CH97a] R. Cole and R. Hariharan. Tighter upper bounds on the exact
complexity of string matching. SIAM J. Comput., 26(3):803856,
1997.
[CH97b] M. Crochemore and C. Hancart. Automata for matching pat-
terns. In G. Rozenberg and A. Salomaa, editors, Handbook of
Formal Languages, volume 2 Linear Modeling: Background and
Application, chapter 9, pages 399462. Springer-Verlag, Berlin,
1997.
[D om64] B. D om olki. An algorithm for syntactical analysis. Computa-
tional Linguistics, (3):2946, 1964.
[GL89] R. Grossi and F. Luccio. Simple and ecient string matching
with k mismatches. Inf. Process. Lett., 33(3):113120, 1989.
223
[GP89] Z. Galil and K. Park. An improved algorithm for approximate
string matching. In G. Ausiello, M. Dezani-Ciancaglini, and
S. Ronchi Della Rocca, editors, Proceedings of the 16th Interna-
tional Colloquium on Automata, Languages and Programming,
number 372 in Lecture Notes in Computer Science, pages 394
404, Stresa, Italy, 1989. Springer-Verlag, Berlin.
[HIMM99] J. Holub, C. S. Iliopoulos, B. Melichar, and L. Mouchard. Dis-
tributed string matching using nite automata. In R. Raman
and J. Simpson, editors, Proceedings of the 10th Australasian
Workshop On Combinatorial Algorithms, pages 114128, Perth,
WA, Australia, 1999.
[Hol97] J. Holub. Simulation of NFA in approximate string and sequence
matching. In J. Holub, editor, Proceedings of the Prague Stringol-
ogy Club Workshop 97, pages 3946, Czech Technical University,
Prague, Czech Republic, 1997. Collaborative Report DC9703.
[LV88] G. M. Landau and U. Vishkin. Fast string matching with k
dierences. J. Comput. Syst. Sci., 37(1):6378, 1988.
[MC97] C. Moore and J. P. Crutcheld. Quan-
tum automata and quantum grammars. 1997.
http://xxx.lanl.gov/abs/quant-ph/9707031.
[Mel95] B. Melichar. Approximate string matching by nite automata.
In V. Hlav ac and R.
S ara, editors, Computer Analysis of Images
and Patterns, number 970 in Lecture Notes in Computer Science,
pages 342349. Springer-Verlag, Berlin, 1995.
[NBY98] G. Navarro and R. Baeza-Yates. Improving an algorithm for
approximate pattern matching. Technical Report TR/DCC-98-
5, Dept. of Computer Science, University of Chile, 1998.
[Sel80] P. H. Sellers. The theory and computation of evolutionary dis-
tances: Pattern recognition. J. Algorithms, 1(4):359373, 1980.
[Shy76] R. K. Shyamasundar. A simple string matching algorithm. Tech-
nical report, Tata Institute of Fundamental Research, 1976.
[Ukk85] E. Ukkonen. Finding approximate patterns in strings. J. Algo-
rithms, 6(13):132137, 1985.
[WF74] R. A. Wagner and M. Fischer. The string-to-string correction
problem. J. Assoc. Comput. Mach., 21(1):168173, 1974.
[WM92] S. Wu and U. Manber. Fast text searching allowing errors. Com-
mun. ACM, 35(10):8391, 1992.
224