0% found this document useful (0 votes)
165 views226 pages

Tsa Lectures 1

This document provides an overview of algorithms for processing text strings, with a focus on string matching algorithms. It discusses different types of string matching problems, including exact string matching, approximate string matching allowing for errors, and matching subsets or parts of strings. It also covers the use of finite automata to model these string matching problems and efficiently solve them. The document serves as a tutorial on stringology and string matching algorithms, with chapters on topics like pattern matching automata, computing repetitions in strings, and automata that recognize parts of strings.

Uploaded by

Joseph Lee
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
165 views226 pages

Tsa Lectures 1

This document provides an overview of algorithms for processing text strings, with a focus on string matching algorithms. It discusses different types of string matching problems, including exact string matching, approximate string matching allowing for errors, and matching subsets or parts of strings. It also covers the use of finite automata to model these string matching problems and efficiently solve them. The document serves as a tutorial on stringology and string matching algorithms, with chapters on topics like pattern matching automata, computing repetitions in strings, and automata that recognize parts of strings.

Uploaded by

Joseph Lee
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 226

Czech Technical University in Prague

Faculty of Electrical Engineering


Department of Computer Science and Engineering
TEXT SEARCHING ALGORITHMS
VOLUME I: FORWARD STRING MATCHING
Borivoj Melichar, Jan Holub, Tom as Polcar
November 2005
Preface
Text is the simplest and most natural representation of information in a
range of areas. Text is a linear sequence of symbols from some alphabet.
The text is manipulated in many application areas: processing of text in
natural and formal languages, study of sequences in molecular biology, music
analysis, etc.
The design of algorithms that process texts goes back at least thirty
years. In particular, the 1990
s
produced many new results. This progress is
due in part to genome research, where text algorithms are often used.
The basic problem of text processing concerns string matching. It is
used to access information, and this operation is used very frequently.
We have recognized while working in this area that nite automata are
very useful tools for understanding and solving many text processing prob-
lems. We have found in some cases that well known algorithms are in fact
simulators of non-deterministic nite automata serving as models of these
algorithms. Far this reason the material used in this course is based mainly
on results from the theory of nite automata.
Because the string is a central notion in this area, Stringology has become
the nickname of this subeld of algorithmic research.
We suppose that you, the reader of this tutorial, have basic knowledge
in the following areas:
Finite and innite sets, operations with sets.
Relations, operations with relations.
Basic notions from the theory of oriented graphs.
Regular languages, regular expressions, nite automata, operations
with nite automata.
The material included in this tutorial corresponds to our point of view
on the respective aspects of Stringology. Some parts of the tutorial are the
results of our research and some of the principles described here have not
been published before.
Prague, November 2004 Authors
1
2
Contents
1 Text retrieval systems 7
1.1 Basic notions and notations . . . . . . . . . . . . . . . . . . . 8
1.2 Classication of pattern matching problems . . . . . . . . . . 12
1.3 Two ways of pattern matching . . . . . . . . . . . . . . . . . 14
1.4 Finite automata . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.5 Regular expressions . . . . . . . . . . . . . . . . . . . . . . . . 20
1.5.1 Denition of regular expressions . . . . . . . . . . . . 20
1.5.2 The relation between regular expressions and nite
automata . . . . . . . . . . . . . . . . . . . . . . . . . 21
2 Forward pattern matching 23
2.1 Elementary algorithm . . . . . . . . . . . . . . . . . . . . . . 24
2.2 Pattern matching automata . . . . . . . . . . . . . . . . . . . 25
2.2.1 Exact string and sequence matching . . . . . . . . . . 25
2.2.2 Substring and subsequence matching . . . . . . . . . . 27
2.2.3 Approximate string matching - general alphabet . . . 30
2.2.3.1 Hamming distance . . . . . . . . . . . . . . . 30
2.2.3.2 Levenshtein distance . . . . . . . . . . . . . . 32
2.2.3.3 Generalized Levenshtein distance . . . . . . . 33
2.2.4 Approximate string matching - ordered alphabet . . . 34
2.2.4.1 distance . . . . . . . . . . . . . . . . . . . 35
2.2.4.2 -distance . . . . . . . . . . . . . . . . . . . 36
2.2.4.3 (, )-distance . . . . . . . . . . . . . . . . . 38
2.2.5 Approximate sequence matching . . . . . . . . . . . . 39
2.2.6 Matching of nite and innite sets of patterns . . . . . 41
2.2.7 Pattern matching with dont care symbols . . . . . 45
2.2.8 Matching a sequence of patterns . . . . . . . . . . . . 48
2.3 Some deterministic pattern matching automata . . . . . . . . 48
2.3.1 String matching . . . . . . . . . . . . . . . . . . . . . 49
2.3.2 Matching of a nite set of patterns . . . . . . . . . . . 50
2.3.3 Regular expression matching . . . . . . . . . . . . . . 52
2.3.4 Approximate string matching Hamming distance . . 56
2.3.5 Approximate string matching Levenshtein distance . 57
2.4 The state complexity of the deterministic pattern matching
automata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
2.4.1 Construction of a dictionary matching automaton . . 60
2.4.2 Approximate string matching . . . . . . . . . . . . . . 64
2.4.2.1 Hamming distance . . . . . . . . . . . . . . . 64
2.4.2.2 Levenshtein distance . . . . . . . . . . . . . . 66
2.4.2.3 Generalized Levenshtein distance . . . . . . . 68
2.4.2.4 distance . . . . . . . . . . . . . . . . . . . 70
2.5 (, ) distance . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3
2.6 distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3 Finite automata accepting parts of a string 77
3.1 Prex automaton . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.2 Sux automaton . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.3 Factor automaton . . . . . . . . . . . . . . . . . . . . . . . . . 87
3.4 Parts of sux and factor automata . . . . . . . . . . . . . . . 92
3.4.1 Backbone of sux and factor automata . . . . . . . . 93
3.4.2 Front end of sux or factor automata . . . . . . . . . 93
3.4.3 Multiple front end of sux and factor automata . . . 96
3.5 Subsequence automata . . . . . . . . . . . . . . . . . . . . . . 97
3.6 Factor oracle automata . . . . . . . . . . . . . . . . . . . . . . 99
3.7 The complexity of automata for parts of strings . . . . . . . . 108
3.8 Automata for parts of more than one string . . . . . . . . . . 109
3.9 Automata accepting approximate parts of a string . . . . . . 115
4 Borders, repetitions and periods 120
4.1 Basic notions . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
4.2 Borders and periods . . . . . . . . . . . . . . . . . . . . . . . 121
4.2.1 Computation of borders . . . . . . . . . . . . . . . . . 123
4.2.2 Computation of periods . . . . . . . . . . . . . . . . . 124
4.3 Border arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
4.4 Repetitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
4.4.1 Classication of repetitions . . . . . . . . . . . . . . . 129
4.4.2 Exact repetitions in one string . . . . . . . . . . . . . 131
4.4.3 Complexity of computation of exact repetitions . . . . 141
4.4.4 Exact repetitions in a nite set of strings . . . . . . . 146
4.4.5 Computation of approximate repetitions . . . . . . . . 151
4.4.6 Approximate repetitions Hamming distance . . . . . 151
4.4.7 Approximate repetitions Levenshtein distance . . . . 155
4.4.8 Approximate repetitions distance . . . . . . . . . 158
4.4.9 Approximate repetitions distance . . . . . . . . . 161
4.4.10 Approximate repetitions (, ) distance . . . . . . . 164
4.4.11 Exact repetitions in one string with dont care symbols 168
4.5 Computation of periods revisited . . . . . . . . . . . . . . . . 172
5 Simulation of nondeterministic pattern matching automata
fail function 177
5.1 Searching automata . . . . . . . . . . . . . . . . . . . . . . . 177
5.2 MP and KMP algorithms . . . . . . . . . . . . . . . . . . . . 178
5.3 AC algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
4
6 Simulation of nondeterministic nite automata dynamic
programming and bit parallelism 195
6.1 Basic simulation method . . . . . . . . . . . . . . . . . . . . . 195
6.1.1 Implementation . . . . . . . . . . . . . . . . . . . . . . 196
6.1.1.1 NFA without -transitions . . . . . . . . . . 196
6.1.1.2 NFA with -transitions . . . . . . . . . . . . 199
6.2 Dynamic programming . . . . . . . . . . . . . . . . . . . . . . 202
6.2.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 202
6.2.2 String matching . . . . . . . . . . . . . . . . . . . . . 202
6.2.2.1 Exact string matching . . . . . . . . . . . . . 202
6.2.2.2 Approximate string matching using Hamming
distance . . . . . . . . . . . . . . . . . . . . . 202
6.2.2.3 Approximate string matching using Leven-
shtein distance . . . . . . . . . . . . . . . . . 205
6.2.2.4 Approximate string matching using general-
ized Levenshtein distance . . . . . . . . . . . 208
6.2.3 Time and space complexity . . . . . . . . . . . . . . . 210
6.3 Bit parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . 211
6.3.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 211
6.3.2 String matching . . . . . . . . . . . . . . . . . . . . . 212
6.3.2.1 Exact string matching . . . . . . . . . . . . . 212
6.3.2.2 Approximate string matching using Hamming
distance . . . . . . . . . . . . . . . . . . . . . 213
6.3.2.3 Approximate string matching using Leven-
shtein distance . . . . . . . . . . . . . . . . . 215
6.3.2.4 Approximate string matching using general-
ized Levenshtein distance . . . . . . . . . . . 218
6.3.3 Other methods of bit parallelism . . . . . . . . . . . . 221
6.3.4 Time and space complexity . . . . . . . . . . . . . . . 222
References 223
5
6
1 Text retrieval systems
Text retrieval systems deal with the representation, storage, organization of,
and access to text documents. The organization of text documents should
provide the user with easy access to documents of interest. The database of
a retrieval system contains a collection of documents. The user can ask for
some documents. He must formulate his needs in the form of a query. The
query is processed by the search engine and the result is a set of selected
documents. This process is depicted in Fig. 1.1. The query is an expression
DATABASE
SEARCH ENGINE
SET OF SELECTED
DOCUMENTS QUERY
Figure 1.1: Retrieval system
containing keywords as basic elements. The simplest operation of the search
engine is the selection of documents containing some keywords of a given
query. In this text, we will concentrate on the algorithms used by search
engines. The simplest task can be formulated in this way:
Given text string T = t
1
t
2
. . . t
n
and pattern (keyword)
P = p
1
p
2
. . . p
m
, verify if string P is a substring of text T, where t
i
and p
i
are symbols of the alphabet.
This task is very simple but it is used very frequently. Very fast algo-
rithms are therefore necessary for this task. The design of algorithms that
process texts goes back at least to the 1970s. thirty years. In particular,
the last decade has produced many new results. We have recognized that
nite automata are very useful tools for understanding and solving many
text processing problems. We have found in some cases that well known al-
gorithms are in fact simulators of nondeterministic nite automata serving
as models of these algorithms.
Because the string is the central notion in this area, stringology has
become the nickname of this subeld of algorithmic research. To achieve
fast text searching, we can prepare either the pattern or the text or both.
This preparation is called preprocessing. We can use preprocessing as the
criterion for a general classication of text searching approaches. There are
four categories in this classication:
1. Neither the pattern nor the text is preprocessed. Elementary algo-
rithms belong in this category.
2. The pattern is preprocessed. Pattern matching automata belong in
this category.
3. The text is preprocessed. Factor automata and index methods belong
in this category.
7
4. Both the text and the pattern are preprocessed. Signature methods,
pattern matching automata and factor automata belong in this cate-
gory.
This classication is represented in Fig. 1.2.
Text preprocessing
NO YES
Pattern NO Elementary algorithms Factor automata, index
preprocessing methods
Pattern matching Pattern matching automata,
YES automata factor automata,
signature methods
Figure 1.2: Classication of text searching approaches
1.1 Basic notions and notations
Some basic notions will be used in the following chapters. This section
collects denitions of them.
An alphabet is a nonempty nite set of symbols. The alphabet can be
either ordered or unordered. The ordering is supposed to be the total. Most
operations can be used for either ordered or unordered alphabets (general
alphabets). Some specic operations can be used only for totally ordered
alphabets.
A string over a given alphabet is a nite sequence of symbols. Empty
string is empty sequence of symbols. We denote by A

the set of all strings


over alphabet A (including empty string ). This set is always innite. A
set of nonempty strings over alphabet A is denoted by A
+
. It holds that
A

= A
+
. The complement of alphabet A for some set of symbols
B, B A is denoted B = A B. Notation a means A a. The operation
concatenation is dened on the set of strings in this way: if x and y are
strings over A, then the concatenation of these strings is xy. This operation
is associative, i.e. (xy)z = x(yz). On the other hand, it is not commutative,
i.e. xy ,= yx. Empty string is the neutral element: x = x = x. The
set of strings A

over alphabet A is a free monoid with as the neutral


element. The length of string [x[ is the number of symbols of x. It holds that
[x[ 0, [[ = 0. We will use integer exponents for a string with repetitions:
a
0
= , a
1
= a, a
2
= aa, a
3
= aaa, . . ., for a A and x
0
= , x
1
= x, x
2
=
xx, x
3
= xxx, . . . , for x A

.
Denition 1.1
Set Pref(x), x A

, is the set of all prexes of string x:


Pref(x) = y : x = yu, u, x, y A

. 2
8
Denition 1.2
Set Su(x), x A

, is the set of all suxes of string x:


Su(x) = y : x = uy, u, x, y A

. 2
Denition 1.3
Set Fact(x), x A

, is the set of all substrings (factors) of string x:


Fact(x) = y : x = uyv, u, v, x, y A

. 2
Denition 1.4
Set Sub(x), x A

, is the set of all subsequences of string x:


Sub(x) = a
1
a
2
. . . a
m
: x = y
0
a
1
y
1
a
2
. . . a
m
y
m
,
y
i
A

, i = 0, 1, 2, . . . , m, a
j
A, j = 1, 2, . . . , m, m 0. 2
Denition 1.5
The terms proper prex, proper sux, proper factor, proper subsequence
are used for a prex, sux, factor, subsequence of string x which is not
equal to x. 2
The denitions of sets Pref, Su, Fact and Sub can be extended for nite
and innite sets of strings.
Denition 1.6
Set Pref(X), X A

, is the set of all prexes of all strings x X:


Pref(X) = y : x = yu, x X, u, x, y A

. 2
Denition 1.7
Set Su(X), X A

, is the set of all suxes of all strings x X:


Su(X) = y : x = yu, x X, w, x, y A

. 2
Denition 1.8
Set Fact(X), X A

, is the set of all substrings (factors) of all strings


x X:
Fact(X) = y : x = yu, x X, u, x, y A

. 2
Denition 1.9
Set Sub(X), X A

, is the set of all subsequences of all strings x X:


Sub(X) = a
1
a
2
. . . a
m
: x = y
0
a
1
y
1
a
2
. . . a
m
y
m
, x X, y
i
A

,
i = 0, 1, 2, . . . , m, a
j
A, j = 1, 2, . . . , m, m 0. 2
The denition of the abovementioned sets can also be extended for approx-
imate cases. In the following denitions D is a metrics, k is the distance.
Denition 1.10
The set of approximate prexes APref of string x is:
APref(x) = u : v Pref(x), D(u, v) k. 2
Denition 1.11
The set of approximate suxes ASu of string x is:
ASu(x) = u : v Su(x), D(u, v) k. 2
9
Denition 1.12
The set of approximate factors AFact of string x is:
AFact(x) = u : v Fact(x), D(u, v) k. 2
Denition 1.13
The set of approximate subsequences ASub of string x is:
ASub(x) = u : v Sub(x), D(u, v) k. 2
The term pattern matching is used for both string matching and sequence
matching. The term subpattern matching is used for matching substrings
or subsequences of a pattern.
Denition 1.14
The dont care symbol is a special universal symbol that matches any
other symbol, including itself.
Denition 1.15 (Basic pattern matching problems)
Given text T = t
1
t
2
. . . t
n
and pattern P = p
1
p
2
. . . p
m
, we may dene:
1. String matching: verify whether string P is a substring of text T.
2. Sequence matching: verify whether sequence P is a subsequence of
text T.
3. Subpattern matching: verify whether a subpattern of P (substring or
subsequence) occurs in text T.
4. Approximate pattern matching: verify whether string x occurs in text
T so that distance D(P, x) k for given k < m.
5. Pattern matching with dont care symbols: verify if pattern P con-
taining dont care symbols occurs in text T. 2
Denition 1.16 (Matching a sequence of patterns)
Given text T = t
1
t
2
. . . t
n
and a sequence of patterns (strings and/or se-
quences) P
1
, P
2
, . . . , P
s
. Matching of a sequence of patterns P
1
, P
2
, . . . , P
s
is
a verication whether an occurrence of pattern P
i
in text T is followed by
an occurrence of P
i+1
, 1 i < s. 2
Denitions 1.15 and 1.16 dene pattern matching problems as decision
problems, because the output is a Boolean value. A modied version of these
problems consists in searching for the rst, the last, or all occurrences of a
pattern and moreover the result may be the set of positions of the pattern
in the text.
Instead of just one pattern, one can consider a nite or innite set of
patterns.
Denition 1.17 (Distances of strings - general alphabets)
Three variants of distances between two strings x and y are dened as the
minimum number of editing operations:
10
1. replace (Hamming distance, R-distance),
2. delete, insert and replace (Levenshtein distance, DIR-distance),
3. delete, insert, replace and transpose of neighbour symbols (Damerau
distance, generalized Levenshtein distance, DIRT-distance),
needed to convert string x into string y. 2
The Hamming distance is a metrics on a set of strings of equal length.
The Levenshtein distance and the generalized Levenshtein distance are met-
rics on a set of strings not necessarily of equal length.
Denition 1.18 (Distance of strings - ordered alphabet)
Let A = a
1
, a
2
, . . . , a
p
be an ordered alphabet. Let a
i
, a
j
be symbols from
alphabet A, then the -distance of a
i
, a
j
is denined as
(a
i
, a
j
) = [i j[
1. -distance:
Let x, y be strings over alphabet A such that [x[ = [y[, then the -
distance of x, y is denined as
(x, y) = max
i{1..|x|}
(x
i
, y
i
)
2. -distance:
Let x, y be strings over alphabet A such that [x[ = [y[, then the -
distance of x, y is denined as
(x, y) =

i{1..|x|}
(x
i
, y
i
)
2
Denition 1.19 (Approximate pattern matching - ordered alpha-
bet)
Given text T = t
1
t
2
. . . t
n
and pattern P = p
1
p
2
. . . p
m
over a given ordered
alphabet A, then we dene:
1. -matching:
Find all occurrences of string x in text T so that [x[ = [P[ and
(P, x) k, where k is a given positive integer.
2. -matching:
Find all occurrences of string x in text T so that [x[ = [P[ and
(P, x) k, where k is a given positive integer.
3. (, )-matching:
Find all occurrences of string x in text T so that [x[ = [P[ and
(P, x) l and (P, x) k, where k, l are given positive integers
such that l k. 2
11
1.2 Classication of pattern matching problems
Onedimensional pattern matching problems for a nite size alphabet can
be classied according to several criteria. We will use six criteria for a clas-
sication leading to a six dimensional space in which one point corresponds
to a particular pattern matching problem.
Let us make a list of all dimensions including possible values in each
dimension:
1. Nature of the pattern:
string,
sequence.
2. Integrity of the pattern:
full pattern,
subpattern.
3. Number of patterns:
one,
nite number greater than one,
innite number.
4. Way of matching:
exact,
approximate matching with Hamming distance (R-matching),
approximate matching with Levenshtein distance (DIR-matching),
approximate matching with generalized Levenshtein distance
(DIRT-matching),
-approximate matching,
-approximate matching,
(, )-approximate matching.
5. Importance of symbols in a pattern:
take care above all symbols,
dont care above some symbols.
6. Sequences of patterns:
one,
nite sequence.
The above classication is represented in Figure 1.3. If we count the
number of possible pattern matching problems, we obtain N = 223722 =
336.
In order to facilitate references to a particular pattern matching problem,
we will use abbreviations for all problems. These abbreviations are summa-
rized in Table 1.1 (D means DIRmatching and G means DIRTmatching,
generalized Levenshtein distance).
12
Figure 1.3: Classication of pattern matching problems
Using this method, we can, for example, refer to exact string matching
of one string as an SFOECO problem.
Instead of a single pattern matching problem we will use the notion of
a family of pattern matching problems. In this case we will use symbol
? instead of a particular letter. For example SFO??? is the family of all
problems concerning one full string matching.
Each pattern matching problem has several instances. For example, an
SFOECO problem has the following instances:
1. verify whether a given string occurs in the text or not,
2. nd the rst occurrence of a given string,
3. nd the number of all occurrences of a given string,
4. nd all occurrences of a given string and their positions.
If we take into account all possible instances, the number of pattern matching
problems grows further.
13
Dimension 1 2 3 4 5 6
S F O E C O
Q S F R D S
I D
G

(, )
Table 1.1: Abbreviations for pattern matching problems
1.3 Two ways of pattern matching
There are two dierent ways in which matching of patterns can be per-
formed:
- forward pattern matching,
- backward pattern matching.
The basic principle of forward pattern matching is depicted in Fig. 1.4. The
TEXT
PATTERN
Figure 1.4: Forward pattern matching
text and the pattern are matched in the forward direction. This means that
the comparison of symbols is performed from left to right. All algorithms
for forward pattern matching must compare each symbol of the text at least
once. Therefore the lowest time complexity is equal to the length of the
text.
The basic principle of backward pattern matching is depicted in Fig 1.5.
TEXT
PATTERN
Figure 1.5: Backward pattern matching
The comparison of symbols is performed from right to left. There are three
main principles of backward pattern matching:
- looking for a repeated sux of the pattern,
14
- looking for a prex of the pattern,
- looking for an antifactor (a string which is not a factor) of the pattern.
Algorithms for backward pattern matching allow us to skip some part of the
text and therefore the number of comparisons can be lower than the length
of the text.
1.4 Finite automata
We will use nite automata in all subsequent Chapters as a formalism for
the description of various aspects of pattern matching. In this Section we
introduce basic notions from the theory of nite automata and we also show
some basic algorithms concerning them. The material included is not ex-
haustive and we recommend using the special literature covering this area
in detail.
Denition 1.20 (Deterministic nite automaton)
A deterministic nite automaton (DFA) is quintuple M = (Q, A, , q
0
, F),
where
Q is a nite set of states,
A is a nite input alphabet,
is a mapping from QA to Q, (QA Q)
q
0
Q is an initial state,
F Q is the set of nal states.
Denition 1.21 (Conguration of FA)
Let M = (Q, A, , q
0
, F) be a nite automaton. A pair (q, w) Q A

is a conguration of the nite automaton M. Conguration (q


0
, w) is called
an initial conguration, conguration (q, ), where q F, is called a nal
(accepting) conguration of the nite automaton M.
Denition 1.22 (Transition in DFA)
Let M = (Q, A, , q
0
, F) be a deterministic nite automaton. Relation
M

(QA

) (QA

) is called a transition in automaton M. If (q, a) = p,


then (q, aw)
M
(p, w) for each w A

. The k-power of the relation


M
will
be denoted by
k
M
. Symbols
+
M
and

M
denote a transitive and a transitive
reexive closure of relation
M
, respectively.
Denition 1.23 (Language accepted by DFA)
We will say that input string w A

is accepted by nite deterministic


automaton M = (Q, A, , q
0
, F) if (q
0
, w)

M
(q, ) for some q F.
Language L(M) = w : w T

, (q
0
, w)

(q, ), q F is the language


accepted by nite automaton M. String w L(M) if it consists only of
symbols from the input alphabet and there is a sequence of transitions such
that it leads from initial conguration (q
0
, w) to nal conguration (q, ),
q F.
15
Denition 1.24 (Complete DFA)
Finite deterministic automaton M = (Q, A, , q
0
, F) is said to be complete
if the mapping (q, a) is dened for each pair of states q Q and input
symbols a A.
Denition 1.25 (Nondeterministic nite automaton)
A nondeterministic nite automaton (NFA) is quintuple M = (Q, A, , q
0
, F),
where
Q is a nite set of states,
A is a nite input alphabet,
is a mapping from QA into the set of subsets of Q,
q
0
Q is an initial state,
F Q is the set of nal states.
Denition 1.26 (Transition in NFA)
Let M = (Q, A, , q
0
, F) be a nondeterministic nite automaton. Relation

M
(Q A

) (Q A

) will be called a transition in automaton M if


p (q
,
a) then (q, aw)
M
(p, w), for each w A

.
Denition 1.27 (Language accepted by NFA)
String w A

is said to be accepted by nondeterministic nite automaton


M = (Q, A, , q
0
, F), if there exists a sequence of transitions (q
0
, w)

(q, )
for some q F. Language L(M) = w : w A

, (q
0
, w)

(q, ) for some


q F is then the language accepted by nondeterministic nite automaton
M.
Denition 1.28 (NFA with -transitions)
A nondeterministic nite automaton with -transitions is quintuple M =
(Q, A, , q
0
, F), where
Q is a nite set of states,
A is a nite input alphabet,
is a mapping from Q(A ) into the set of subsets of Q,
q
0
Q is an initial state,
F Q is the set of nal states.
Denition 1.29 (Transition in NFA with -transitions)
Let M = (Q, A, , q
0
, F) be a nondeterministic nite automaton with -
transitions. Relation
M
(Q A

) (Q A

) will be called a transition


in automaton M if p (q
,
a), a A , then (q, aw)
M
(p, w), for each
w A

.
Denition 1.30 (CLOSURE)
Function CLOSURE for nite automaton M = (Q, A, , q
0
, F) is dened
as:
CLOSURE(q) = p : (q, )

(p, ), p Q.
Denition 1.31 (NFA with a set of initial states)
Nondeterministic nite automaton M with the set of initial states I is quin-
16
tuple M = (Q, A, , I, F), where:
Q is a nite set of states,
A is a nite input alphabet,
is a mapping from QA into the set of subsets of Q,
I Q is the non-empty set of initial states,
F Q is the set of nal states.
Denition 1.32 (Accessible state)
Let M = (Q, A, , q
0
, F) be a nite automaton. State q Q is called ac-
cessible if there exists string w A

such that there exists a sequence of


transitions from initial state q
0
into state q:
(q
0
, w)
M
(q, )
A state which is not accessible is called inaccessible.
Denition 1.33 (Useful state)
Let M = (Q, A, , q
0
, F) be a nite automaton. State q Q is called useful
if there exists a string w A

such that there exists a sequence of transitions


from state q into some nal state:
(q, w)
M
(p, ), p F.
A state which is not useful is called useless.
Denition 1.34 (Finite automaton)
Finite automaton (FA) is DFA or NFA.
Denition 1.35 (Equivalence of nite automata)
Finite automata M
1
and M
2
are said to be equivalent if they accept the
same language, i.e., L(M
1
) = L(M
2
).
Denition 1.36 (Sets of states)
Let M = (Q, A, , q
0
, F) be a nite automaton. Let us dene for arbitrary
a A set Q(a) Q as follows:
Q(a) = q : q (p, a), a A, p, q Q.
Denition 1.37 (Homogenous automaton)
Let M = (Q, A, , q
0
, F) be a nite automaton and Q(a) be sets of states
for all symbols a T. If for all pairs of symbols a, b A, a ,= b, it holds
Q(a)Q(b) = , then automaton M is called homogeneous. The collection of
sets Q(a) : a A is for the homogeneous nite automaton a decomposition
on classes having one of these two forms:
1.
Q =
_
aA
Q(a) q
0

in the case that q


0
, (q, a) for all q Q and
all a A,
2. Q =
_
aA
Q(a) in the case that q
0
(q, a) for some q Q, a
A. In this case q
0
Q(a).
17
Algorithm 1.38
Construction of a nondeterministic nite automaton without -transitions
equivalent to a nondeterministic nite automaton with -transitions.
Input: Finite automaton M = (Q, A, , q
0
, F) with -transitions.
Output: Finite automaton M

= (Q, A,

, q
0
, F

) without -transitions
equivalent to M.
Method:
1.

(q, a) =
_
pCLOSURE(q)
(p, a).
2. F

= q : CLOSURE(q) F ,= , q Q. 2
Algorithm 1.39
Construction of a nondeterministic nite automaton with a single initial
state equivalent to a nondeterministic nite automaton with several initial
states.
Input: Finite automaton M = (Q, A, , I, F) with a nonempty set I.
Output: Finite automaton M

= (Q

, A,

, q
0
, F) with a single initial
state q
0
.
Method: Automaton M

will be constructed using the following two


steps:
1. Q

= Q q
0
, q
0
, Q,
2.

(q
0
, ) = I,

(q, a) = (q, a) for all q Q and all a A. 2


The next Algorithm constructs a deterministic nite automaton equivalent
to a given nondeterministic nite automaton. The construction used is called
a subset construction.
Algorithm 1.40
Transformation of a nondeterministic nite automaton to a deterministic
nite automaton.
Input: Nondeterministic nite automaton M = (Q, A, , q
0
, F).
Output: Deterministic nite automaton M

= (Q

, A,

, q

0
, F

) such that
L(M) = L(M

).
Method:
1. Set Q

= q
0
will be dened, state q

0
= q
0
will be treated as
unmarked. (Please note that each state of a deterministic automaton
consists of a set of state of a nondeterministic automaton.)
2. If each state in Q

is marked then continue with step 4.


3. Unmarked state q

will be chosen from Q

and the following operations


will be executed:
18
(a)

(q

, a) =

(p, a) for p q

and for all a A,


(b) Q

= Q

(q

, a) for all a A,
(c) state q

will be marked,
(d) continue with step 2.
4. q

0
= q
0
.
5. F

= q

: q

, q

F ,= . 2
Note: Let us mention that all states of the resulting deterministic nite
automaton M

are accessible states. (See Def. 1.32)


Denition 1.41 (dsubset)
Let M
1
= (Q
1
, A,
1
, q
01
, F
1
) be a nondeterministic nite automaton. Let
M
2
= (Q
2
, A,
2
, q
02
, F
2
) be the deterministic nite automaton equivalent
to automaton M
1
. Automaton M
2
is constructed using the standard deter-
minization algorithm based on subset construction (see Alg. 1.40). Every
state q Q
2
corresponds to some subset d of Q
1
. This subset will be called
a dsubset (deterministic subset). 2
Notational convention:
A d-subset created during the determinization of a nondeterministic nite
automaton has the form: q
i1
, q
i2
, . . . , q
in
. If no confusion arises we will
write such d-subset as q
i1
q
i2
. . . q
in
.
Denition 1.42
A d-subset is simple if it contains just one element. The corresponding state
to it is called a simple state. A d-subset is multiple if it contains more than
one element. The corresponding state to it will be called a multiple state.2
Algorithm 1.43
Construction of a nite automaton for an union of languages.
Input: Two nite automata M
1
and M
2
.
Output: Finite automaton M accepting the language L(M) = L(M
1
)
L(M
2
).
Method:
1. Let M
1
= (Q
1
, A,
1
, q
01
, F
1
), M
2
= (Q
2
, A,
2
, q
02
, F
2
), Q
1
Q
2
= .
2. The resulting automaton M = (Q, A, , q
0
, F) is constructed using the
following steps:
(a) Q = Q
1
Q
2
q
0
, q
0
, Q
1
Q
2
,
(b) (q
0
, ) = q
01
, q
02
,
(q, a) =
1
(q, a) for all q Q
1
and all a A,
(q, a) =
2
(q, a) for all q Q
2
and all a A.
19
3. F = F
1
F
2
. 2
Algorithm 1.44
Construction of a nite automaton for the intersection of two languages.
Input: Two nite automata M
1
= (Q
1
, A,
1
, q
01
, F
1
), M
2
= (Q
2
, A,
2
, q
02
, F
2
).
Output: Finite automaton M = (Q, A, , q
0
, F), accepting language L(M) =
L(M
1
) L(M
2
).
Method:
1. Let Q = (q
01
, q
02
). State (q
01
, q
02
) will be treated as unmarked.
2. If all states in Q are marked go to step 4.
3. Take any unmarked state q = (q
n1
, q
m2
) from Q and perform these
operations:
(a) determine ((q
n1
, q
m2
), a) = (
1
(q
n1
, a),
2
(q
m2
, a)) for all a A,
(b) if both transitions
1
(q
n1
, a) and
2
(q
m2
, a) are dened then
Q = Q (
1
(q
n1
, a),
2
(a
m2
, a)) and state (
1
(q
n1
, a),
2
(q
m2
, a))
will be treated as unmarked only if it is a new state in Q,
(c) state (q
n1
, q
m2
) in Q will be treated as marked,
(d) go to step 2.
4. q
0
= (q
01
, q
02
).
5. F = q : q Q, q = (q
n1
, q
m2
), q
n1
F, q
m2
F. 2
1.5 Regular expressions
1.5.1 Denition of regular expressions
Denition 1.45
Regular expression V over alphabet A is dened as follows:
1. , , a are regular expressions for all a A.
2. If x, y are regular expressions over A then:
(a) (x +y) (union)
(b) (x y) (concatenation)
(c) (x)

(closure)
are regular expressions over A.
Denition 1.46
The value h(x) of regular expression x is dened as follows:
20
1. h() = , h() = , h(a) = a,
2. h(x +y) = h(x) h(y),
h(x y) = h(x) h(y),
h(x

) = (h(x))

.
The value of any regular expression is a regular language, and each reg-
ular language can be represented by some regular expression. Unnecessary
parentheses in regular expressions can be avoided by the convention that
precedence is given to regular operations. The closure operator has the
highest precedence, and the union operator has the lowest precedence.
The following axioms are dened for regular expressions:
A
1
: x + (y +z) = (x +y) +z (union associativity),
A
2
: x (y z) = (x y) z (concatenation associativity),
A
3
: x +y = y +x (union commutativity),
A
4
: (x +y) z = x z +y z (distributivity from the right),
A
5
: x (y +z) = x y +x z (distributivity from the left),
A
6
: x +x = x (union idempotention),
A
7
: x = x ( is a unitary element for the operation
concatenation),
A
8
: x = ( is a zero element for the operation con-
catenation),
A
9
: x + = x ( is a zero element for the operation
union),
A
10
: x

= +x

x
A
11
: x

= ( +x)

A
12
: x = x + x =

(solution of left regular equation),


A
13
: x = x + x =

(solution of right regular equation).


It has been proven that all other equalities between regular expressions
can be derived from these axioms.
1.5.2 The relation between regular expressions and nite au-
tomata
It is possible to construct for each regular expression V an equivalent nite
automaton M, which means for such on automaton that h(V ) = L(M).
There are several techniques for building up a nite automaton for a given
regular expression. The method shown here is based on the notion of adja-
cent symbols.
Algorithm 1.47
Construction of an equivalent nite automaton for a given regular expres-
sion.
21
Input: Regular expression V .
Output: Finite automaton M = (Q, T, , q
0
, F) such that h(V ) = L(M).
Method: Let A be an alphabet, over which expression V is dened.
1. Number by numbers 1, 2, ..., n all occurrences of symbols from A in
expression V so that each two occurrences of the same symbol will be
numbered by dierent numbers. The resultant regular expression will
be denoted by V

.
2. Build up the set of start symbols
Z = x
i
: x A, some string from h(V

) may start with symbol x
i
.
3. Construct the set P of adjacent symbols:
P = x
i
y
j
: symbols x
i
and y
j
may be adjacent in some string from
h(V

).
4. Construct set of nal symbols F in the following way:
F = x
i
: some string from h(V

) may end with symbol x
i
.
5. Set of states of the nite automaton
Q = q
0
x
i
: x T, i <1, n>.
6. Mapping will be constructed in the following way:
(a) (q
0
, x) includes x
i
for each x
i
Z such that x
i
was created by
the numbering of x.
(b) (x
i
, y) includes y
j
for each couple x
i
y
j
P such that y
j
was
created by the numbering of y.
(c) Set F is the set of nal states.
22
2 Forward pattern matching
The basic principles of the forward pattern matching approach are discussed
in this Chapter. We will discuss two approaches from the classication
shown in Fig. 1.2:
1. Neither the pattern nor the text is preprocessed. We will introduce
two programs implementing elementary algorithms for exact and ap-
proximate pattern matching using Hamming distance, in both cases
for a single pattern.
2. The pattern is preprocessed but the text is not preprocessed.
The preprocessing of the pattern is divided into several steps. The rst step
is the construction of a nondeterministic nite automaton which will serve
as a model of the solution of the pattern matching problem in question.
Using this model as a basis for the next step, we can construct either ae
deterministic nite automaton or a simulator, both equivalent to the basic
model. If the result of the preprocessing is a deterministic nite automaton
then the pattern matching is preformed so that the text is read as the
input of the automaton. Both the nondeterministic nite automaton as a
model and the equivalent deterministic nite automaton are constructed as
automata that are able to read any text. An occurrence of the pattern in
the text is found when the automaton is reaching a nal state. Finding the
pattern is then reported and reading of the text continues in order to nd
all following occurrences of the pattern including overlapping cases.
This approach using deterministic nite automata has one advantage:
each symbol of the text is read just once. If we take the number of steps
performed by the automaton as a measure of the time complexity of the
forward pattern matching then it is equal to the length of the text. On
the other hand, the use of deterministic nite automata may have space
problems with space complexity. The number of states of a deterministic
nite automaton may in some cases be very large in comparison with the
length of the pattern. Approximate pattern matching in an example of this
case. It is a limitation of this approach. A solution of this space problem
is the use of simulators of nondeterministic nite automata. We will show
three types of such simulators in the next Chapters:
1. use of the fail function,
2. dynamic programming, and
3. bit parallelism.
The space complexity of all of these simulators is acceptable. The time
complexity is greater than for deterministic nite automata in almost all
cases. It is even quadratic for dynamic programming.
Let us recall that text T = t
1
t
2
. . . t
n
and pattern P = p
1
p
2
. . . p
m
, and
all symbols of both the text and the pattern are from alphabet A.
23
2.1 Elementary algorithm
The elementary algorithm compares the symbols of the pattern with the
symbols of the text. The principle of this approach is shown in Fig. 2.1.
var TEXT[1..N] of char;
PATTERN[1..M] of char;
I,J: integer;
begin
I:=0;
while I N-M do
begin
J:=0;
while J<M and PATTERN[J+1]=TEXT[I+J+1] do J:=J+1;
if J=M then output(I+1);
I:=I+1; length of shift=1
end;
end;
Figure 2.1: Elementary algorithm for exact matching of one pattern
The program presented here implements the algorithm performing the
exact matching of one pattern. The meanings of the variables used in the
program are represented in Fig. 2.2. When the pattern is found then the
Figure 2.2: Meaning of variables in the program from Fig. 2.1
value of variable I is the index of the position just before the rst symbol
of the occurrence of the pattern in the text. Comment length of shift=1
means that the pattern is shifted one position to the right after each mis-
match or nding the pattern. The term shift will be used later.
We will use the number of symbol comparisons (see expression
PATTERN[J+1]=TEXT[I+J+1]) as the measure for the complexity of the
algorithm. The maximum number of symbol comparisons for the elementary
algorithm is
NC = (n m+ 1) m, (1)
where n is the length of the text and m is the length of the pattern. We
24
assume that n m. The time complexity is O(n m). The maximum
number of comparisons NC is reached for text T = a
n1
b and for pattern
P = a
m1
c, where a, b, c A, c ,= a. Elementary algorithm has no extra
space requirements.
The experimental measurements show that the number of comparisons
of elementary algorithm for texts written in natural languages is linear with
respect to the length of the text. It has been observed that a mismatch
of the symbols is reached very soon (at the rst or second symbol of the
pattern). The number of comparisons in this case is:
NC
nat
= C
L
(n m+ 1), (2)
where C
L
is a constant given by the experiments for given language L.
The value of this constant for English is C
E
= 1.07. Thus, the elementary
algorithm has linear time complexity (O(n)) for the pattern matching in
natural language texts.
The elementary algorithm can be used for matching a nite set of pat-
terns. In this case, the algorithm is used for each pattern separately. The
time complexity is
O(n
s

i=1
m
i
),
where s is the number of patterns in the set and m
i
is the length of the ith
pattern, i = 1, 2, . . . , s.
The next variant of the elementary algorithm is for approximate pattern
matching of one pattern using Hamming distance. It is shown in Fig. 2.3.
2.2 Pattern matching automata
In this Section, we will show basic models of pattern matching algorithms.
Moreover, we will show how to construct models for more complicated prob-
lems using models of simple problems.
Notational convention:
We replace names of states q
i
, q
ij
by i, ij, respectively, in subsequent tran-
sition diagrams. The reason for this is to improve the readability.
2.2.1 Exact string and sequence matching
The model of the algorithm for exact string matching (SFOECO problem)
for pattern P = p
1
p
2
p
3
p
4
is shown in Fig. 2.4. The SFOECO nondetermin-
istic nite automaton is constructed in this way:
1. Create the automaton accepting pattern P.
2. Insert the seloop for all symbols from alphabet A in the initial state.
25
var TEXT[1..N] of char;
PATTERN[1..M] of char;
I,J,K,NERR: integer;
K:=number of errors allowed;
begin
I:=0;
while I N-M do
begin
J:=0;
NERR:=0
while J<M and NERR<K do
begin if PATTERN[J+1],= TEXT[I+J+1] then NERR:=NERR+1;
J:=J+1
end;
if J=M then output(I+1);
I:=I+1; length of shift=1
end;
end;
Figure 2.3: Elementary algorithm for approximate matching of one pattern
using Hamming distance
A
p
1
p
2
p
3
p
4
0 1 2 3 4
Figure 2.4: Transition diagram of NFA for exact string matching (SFOECO
automaton) for pattern P = p
1
p
2
p
3
p
4
Algorithm 2.1 describes the construction of the SFOECO automaton in
detail.
Algorithm 2.1
Construction of the SFOECO automaton.
Input: Pattern P = p
1
p
2
. . . p
m
.
Output: SFOECO automaton M.
Method: NFA M = (q
0
, q
1
, . . . , q
m
, A, , q
0
, q
m
), where mapping is
constructed in the following way:
26
1. q
i+1
(q
i
, p
i+1
) for 0 i < m,
2. q
0
(q
0
, a) for all a A. 2
The SFOECO automaton has m+ 1 states for a pattern of length m.
A model of the algorithm for exact sequence matching (QFOECO prob-
lem) for pattern P = p
1
p
2
p
3
p
4
is shown in Fig. 2.5. The QFOECO nonde-
p
2
p
3
p
4
A
p
1
p
2
p
3
p
4
0 1 2 3 4
Figure 2.5: Transition diagram of NFA for exact sequence matching
(QFOECO automaton) for pattern P = p
1
p
2
p
3
p
4
terministic nite automaton is constructed as the SFOECO automaton with
the addition of some new seloops. The new seloops are added in all states
but the initial and nal ones for all symbols, with the exception of the sym-
bol for which there is already the transition to the next state. Algorithm 2.2
describes the construction of the QFOECO automaton in detail.
Algorithm 2.2
Construction of the QFOECO automaton.
Input: Pattern P = p
1
p
2
. . . p
m
.
Output: QFOECO automaton M.
Method: NFA M = (q
0
, q
1
, . . . , q
m
, A, , q
0
, q
m
), where mapping is
constructed in the following way:
1. q
i
(q
i
, a) for 0 < i < m and all a A and a ,= p
i+1
.
2. q
0
(q
0
, a) for all a A,
3. q
i+1
(q
i
, p
i+1
) for 0 i < m. 2
The QFOECO automaton has m+ 1 states for a pattern of length m.
2.2.2 Substring and subsequence matching
A model of the algorithm for exact substring matching (SSOECO problem)
for pattern P = p
1
p
2
p
3
p
4
is shown in Fig. 2.6.
Notational convention:
The following nondeterministic nite automata have a regular structure. For
clarity of expandion, we will use the following terminology:
State q
ij
is at depth i (a position in the pattern) and on level j.
27
The SSOECO nondeterministic nite automaton is constructed by com-
posing the collection of m copies of SFOECO automata. The composition
is done by inserting -transitions. These -transitions are inserted in the
diagonal direction. They start from the initial state of the zero level and
are directed to the next level of it. The next -transitions always start from
the end state of the previous -transition. As the nal step, the inacces-
sible states are removed. Algorithm 2.3 describes the construction of the
SSOECO automaton in detail.
Figure 2.6: Transition diagram of NFA for exact substring matching
(SSOECO automaton) for pattern P = p
1
p
2
p
3
p
4
Algorithm 2.3
Construction of the SSOECO automaton.
Input: Pattern P = p
1
p
2
. . . p
m
, SFOECO automaton M

= (Q

, A,

, q

0
, F

)
for P.
Output: SSOECO automaton M.
Method:
1. Create a sequence of m instances of SFOECO automata for pattern P
M

j
= (Q

j
, A,

j
, q
0j
, F

j
) for j = 0, 1, 2, . . . , m1. Let the states in Q

j
be
q
0j
, q
1j
, . . . , q
mj
.
2. Construct automaton M = (Q, A, , q
0
, F) as follows:
Q =

m1
j=0
Q

j
,
28
(q, a) =

j
(q, a) for all q Q, a A, j = 0, 1, 2, . . . , m1,
(q
00
, ) = q
11
,
(q
11
, ) = q
22
,
.
.
.
(q
m2,m2
, ) = q
m1,m1
,
q
0
= q
00
,
F = Q q
00
, q
11
, . . . , q
m1,m1
.
3. Remove all states which are inaccessible from state q
0
. 2
The SSOECO automaton has (m+1)+m+(m1)+. . .+2 =
m(m+3)
2
states.
The SSOECO automaton can be minimized. The direct construction of the
main part of the minimized version of this automaton is described by Algo-
rithm 3.14 (construction of factor automaton) and shown in Example 3.15.
It is enough to add seloops in the initial state for all symbols of the al-
phabet in order to obtain the SSOECO automaton. The advantage of the
nonminimized SSOECO automaton is that the unique state corresponds to
each substring of the pattern.
A model of the algorithm for exact subsequence matching (QSOECO
problem) for pattern P = p
1
p
2
p
3
p
4
is shown in Fig. 2.7. Construction of
the QSOECO nondeterministic nite automaton starts in the same way as
for the SSOECO automaton. The nal part of this construction is addition
of the -transitions. The diagonal -transitions star in all states having
transitions to following states on levels from 0 to m 1. Algorithm 2.4
describes the construction of the QSOECO automaton in detail.
Algorithm 2.4
Construction of the QSOECO automaton.
Input: Pattern P = p
1
p
2
. . . p
m
, QFOECO automaton M

= (Q

, A,

, q

0
, F

)
for P.
Output: QSOECO automaton M.
Method:
1. Create a sequence of m instances of QFOECO automata for pattern
P M

j
= (Q

j
, A,

j
, q
0j
, F

j
) for j = 0, 1, 2, . . . , m1. Let the states in
Q

j
be
q
0j
, q
1j
, . . . , q
mj
.
2. Construct automaton M = (Q, A, , q
0
, F) as follows:
Q =

m1
j=0
Q

j
,
(q, a) =

j
(q, a) for all q Q, a A, j = 0, 1, 2, . . . , m1,
(q
ij
, ) = q
i+1,j+1
, for i = 0, 1, . . . , m1, j = 0, 1, 2, . . . , m1,
q
0
= q
00
,
F = Q q
00
, q
11
, . . . , q
m1,m1
.
3. Remove the states which are inaccessible from state q
0
. 2
29
Figure 2.7: Transition diagram of NFA for exact subsequence matching
(QSOECO automaton) for pattern P = p
1
p
2
p
3
p
4
The QSOECO automaton has (m + 1) + m + (m 1) + . . . + 2 =
m(m+3)
2
states.
2.2.3 Approximate string matching - general alphabet
We will discuss three variants of approximate string matching corresponding
to the three denitions of distances between strings in the general alphabet:
Hamming distance, Levenshtein distance, and generalized Levenshtein dis-
tance.
Note:
The notion level of the state corresponds to the number of errors in the
nondeterministic nite automata for approximate pattern matching.
2.2.3.1 Hamming distance Let us recall that the Hamming distance
(R-distance) between strings x and y is equal to the minimum number of
editing operations replace which are necessary to convert string x into string
y (see Def. 1.17). This type of string matching using R-distance is called
string R-matching.
A model of the algorithm for string R-matching (SFORCO problem) for
string P = p
1
p
2
p
3
p
4
is shown in Fig. 2.8. The construction of the SFORCO
30
nondeterministic nite automaton again uses a composition of SFOECO
automata similarly to the construction of SSOECO or QSOECO automata.
The composition is done in this case by inserting diagonal transitions
starting in all states having transitions to next states on levels from 0 to k2.
The diagonal transitions are labelled by all symbols for which no transition
to the next state exists. These transitions represent replace operations.
Algorithm 2.5 describes the construction of the SFORCO automaton in
detail.
Figure 2.8: Transition diagram of NFA for string R-matching (SFORCO
automaton) for pattern P = p
1
p
2
p
3
p
4
, k = 3
Algorithm 2.5
Construction of the SFORCO automaton.
Input: Pattern P = p
1
p
2
. . . p
m
, k, SFOECO automaton M

= (Q

, A,

,
q

0
, F

) for P.
Output: SFORCO automaton M.
Method:
1. Create a sequence of k + 1 instances of SFOECO automata
M

j
= (Q

j
, A,

j
, q

0j
, F

j
) for j = 0, 1, 2, . . . , k. Let states in Q

j
be
q
0j
, q
1j
, . . . , q
mj
.
2. Construct SFORCO automaton M = (Q, A, , q
0
, F) as follows:
Q =

k
j=0
Q

j
,
31
(q, a) =

j
(q, a) for all q Q, a A, j = 0, 1, 2, . . . , k,
(q
ij
, a) = q
i+1,j+1
, for all i = 0, 1, . . . , m1, j = 0, 1, 2, . . . , k 1,
a A p
i+1
,
q
0
= q
00
,
F =

k
j=0
F

j
.
3. Remove all states which are inaccessible from state q
0
. 2
The SFORCO automaton has (m+ 1) +m+ (m1) +. . . +mk + 1 =
m(k+1)+1(k(k1))
2
states.
2.2.3.2 Levenshtein distance Let us recall that the Levenshtein dis-
tance (DIR distance) between strings x and y is equal to the minimum
number of editing operations delete, insert and replace which are necessary
to convert string x into string y (see Def. 1.17). This type of string matching
using DIR-distance is called string DIR-matching. A model of the algorithm
for string DIR-matching (SFODCO problem) for the string P = p
1
p
2
p
3
p
4
is shown in Fig. 2.9. Construction of the SFODCO nondeterministic nite
automaton is performed by an extension of the SFORCO automaton. The
extension is done by the following two operations:
1. Adding -transitions parallel to the diagonal transition of the
SFORCO automaton. These represent delete operations.
2. Adding vertical transitions starting in all states as -transitions. La-
belling added vertical transitions is the same as for diagonal transi-
tions. The vertical transitions represent insert operations.
Algorithm 2.6 describes the construction of the SFODCO automaton in
detail.
Algorithm 2.6
Construction of the SFODCO automaton.
Input: Pattern P = p
1
p
2
. . . p
m
, k, SFORCO automaton M

= (Q, A,

,
q
0
, F) for P.
Output: SFODCO automaton M for P.
Method: Let states in Q be
q
00
, q
10
, q
20
, . . . , q
m0
,
q
11
, q
21
, . . . , q
m1
,
.
.
.
q
k,k
, . . . , q
mk
.
Construct SFODCO automaton M = (Q, A, , q
0
, F) as follows:
(q, a) =

(q, a) for all q Q, a A,


(q
ij
, ) = q
i+1,j+1
, for i = 0, 1, . . . , m1, j = 0, 1, 2, . . . , k 1,
(q
ij
, a) = q
i,j+1
, for all i = 1, 2, . . . , m 1, j = 0, 1, 2, . . . , k 1, a
A p
i+1
. 2
32
Figure 2.9: Transition diagram of NFA for string DIR-matching (SFODCO
automaton) for pattern P = p
1
p
2
p
3
p
4
, k = 3
2.2.3.3 Generalized Levenshtein distance Let us recall that the gen-
eralized Levenshtein distance (DIRT-distance) between strings x and y is
equal to the minimum number of editing operations delete, insert, replace
and transpose which are necessary to convert string x into string y (see
Def 1.17). This type of string matching using DIRT-distance is called string
DIRT-matching.
A model of the algorithm for string DIRT-matching (SFOGCO prob-
lem) for pattern P = p
1
p
2
p
3
p
4
is shown in Fig. 2.10. Construction of the
SFOGCO nondeterministic nite automaton is performed by an extension of
the
SFODCO automaton. The extension is done by adding the at diagonal
transitions representing transpose operations of neighbor symbols. Algo-
rithm 2.7 describes the construction of the SFOGCO automaton in detail.
Algorithm 2.7
Construction of the SFOGCO automaton.
Input: Pattern P = p
1
p
2
. . . p
m
, k, SFODCO automaton
M

= (Q

, A,

, q
0
, F) for P.
Output: SFOGCO automaton M for P.
Method: Let states in Q

be
33
Figure 2.10: Transition diagram of NFA for string DIRT-matching
(SFOGCO automaton) for pattern P = p
1
p
2
p
3
p
4
, k = 3
q
00
, q
10
, q
20
, . . . , q
m0
,
q
11
, q
21
, . . . , q
m1
,
.
.
.
q
kk
, . . . , q
mk
.
Construct SFOGCO automaton M = (Q, A, , q
0
, F) as follows:
Q = Q

r
ij
: j = 1, 2, . . . , k, i = j 1, j, . . . , m2,
(q, a) =

(q, a) for all q Q, a A ,


(q
ij
, a) = r
ij
, j = 1, 2, . . . , k, i = j 1, j, . . . , m2, if (q
i+1,j
, a) = q
i+2,j
,
(r
ij
, a) = q
i+2,j+1
, j = 1, 2, . . . , k, i = j 1, j, . . . , m2, if (q
ij
, a) = q
i+1,j
.
2
2.2.4 Approximate string matching - ordered alphabet
We will discuss three variants of approximate string matching corresponding
to the three denitions of distances between strings in ordered alphabets:
-distance, -distance, and (, )-distance. The notation introduced in the
following denition will be used in this Section.
Denition 2.8
Let A be an ordered alphabet, A = a
1
, a
2
, . . . , a
|A|
. We denote following
sets in this way:
a
j
i
= a
ij
, a
i+j
,
34
a
j+
i
= a
i1
, a
i2
, . . . , a
ij
, a
i+1
, a
i+2
, . . . , a
i+j
,
a
j
i
= a
j+
i
a
i
.
Some elements of these sets may be missing when a
i
is close either to the
beginning or to the end of ordered alphabet A. 2
The denition is presented in Fig. 2.11.
a
i-3
a
i-2
a
i-1
a
i
a
i+1
a
i+2
a
i+3
a
i-3
a
i-2
a
i-1
a
i
a
i+1
a
i+2
a
i+3
a
i-3
a
i-2
a
i-1
a
i
a
i+1
a
i+2
a
i+3
a
i
2
a
i
2+
a
i
2*
Figure 2.11: Visualisation of a
j
i
, a
j+
i
, a
j
i
from Denition 2.8, j = 2
2.2.4.1 distance Let us note that two strings x and y have -distance
equal to k if the symbols on the equal positions have maximum -distance
equal to k (see Def. 1.18). This type of string matching using -distance is
called string -matching (see Def. 1.19).
A model of the algorithm for string -matching (SFOCO problem)
for string P = p
1
p
2
p
3
p
4
is shown in Fig. 2.12.
The construction of the SFOCO nondeterministic nite automaton is
based on the composition of k +1 copies of the SFOECO automaton with a
simple modication. The modication consists in changing the horizontal
transitions in all copies but the rst change is to transitions for all symbols
in p

i
, i = 2, 3, . . . , m. The composition is done by inserting diagonal
transitions having dierent angles and starting in all non-nal states of
all copies but the last one. They lead to all following next copies. The
inserted transitions represent replace operations. Algorithm 2.9 describes
the construction of the SFOCO automaton in detail.
Algorithm 2.9
Construction of SFOCO automaton
35
Figure 2.12: Transition diagram of NFA for string -matching (SFOCO
problem) for pattern P = p
1
p
2
p
3
p
4
Input: Pattern P = p
1
p
2
. . . p
m
, k, SFOECO automaton
M

= (Q

, A,

, q

0
, F

) for P.
Output: SFOCO automaton M.
Method:
1. Create a sequence of k + 1 instances of SFOECO automata
M

j
= (Q

j
, A,

j
, q

0j
, F

j
) for j = 0, 1, 2, . . . , k. Let states in Q

j
are
q
0j
, q
1j
, . . . , q
mj
.
2. Construct SFOCO automaton M = (Q, A, , q
0
, F) as follows:
Q =

k
j=0
Q

j
,
(q, a) =

0
(q, a) for all q Q

0
, a A,
(q
ij
, b) =

j
(q
i
, p
i+1
) for all b p

i+1
, i = 1, 2, . . . , m 1, j =
1, 2, . . . , k 1,
(q
ij
, a) = q
i+1,j+1
, q
i+1,j+2
, . . . , q
i+1,k
, for all i = 0, 1, . . . , m 1,
j = 0, 1, 2, . . . , k 1, a p
k
i+1
,
q
0
= q
00
,
F =

m
j=0
F

j
.
3. Remove all states which are inaccessible from state q
0
. 2
The SFOCO automaton has m(k + 1) + 1 states.
2.2.4.2 -distance Let us note that two strings x and y have -distance
equal to k if the symbols on the equal positions have -distance less or equal
to k and the sum of all these -distances is less or equal to k. The -distance
36
may be equal to -distance (see Def. 1.18). This type of string matching
using -distance is called string -matching (see Def. 1.19).
Model of algorithm for string -matching (SFOCO problem) for string
P = p
1
p
2
p
3
p
4
is shown in Fig. 2.13. Construction of the SFOCO nonde-
terministic nite automaton is based on composition of k + 1 copies of the
SFOECO automaton. The composition is done by insertion of diagonal
transitions having dierent angles and starting in all non-nal states of
the all copies but the last one. They are leading to all next copies. The
inserted transitions represent replace operations. Algorithm 2.11 describes
the construction of the SFOCO automaton in detail.
Figure 2.13: Transition diagram of NFA for string -matching (SFOCO
problem) for pattern P = p
1
p
2
p
3
p
4
, k = 3
Algorithm 2.10
Construction of SFOCO automaton.
Input: Pattern P = p
1
p
2
. . . p
m
, k, SFOECO automaton
M

= (Q

, A,

, q

0
, F

) for P.
Output: SFOCO automaton M.
Method:
1. Create a sequence of k + 1 instances of SFOECO automata
M

j
= (Q

j
, A,

j
, q

0j
, F

j
) for j = 0, 1, 2, . . . , k. Let states in Q

j
are
q
0j
, q
1j
, . . . , q
mj
.
2. Construct SFOCO automaton M = (Q, A, , q
0
, F) as follows:
Q =

k
j=0
Q

j
,
(q, a) =

j
(q, a) for all q Q, a A, j = 0, 1, 2, . . . , k,
37
(q
ij
, a) = q
i+1,j+1
, q
i+1,j+2
, . . . , q
i+1,k
, for all i = 0, 1, . . . , m 1,
j = 0, 1, 2, . . . , k 1, a p
k
i+1
,
q
0
= q
00
,
F =

m
j=0
F

j
.
3. Remove all states which are inaccessible from state q
0
. 2
The SFOCO automaton has m(k + 1) + 1 states.
2.2.4.3 (, )-distance Let us note that two strings x and y have
distance if the symbols on the equal positions have -distance less or equal
to l and the sum of these -distances is less or equal to k. The -distance is
strictly less than the -distance (see Def. 1.18). This type of string match-
ing using (, ) distance is called string (, ) matching (see Def. 1.19).
A model of the algorithm for string (, ) matching (SFO(, )CO prob-
lem) for the string P = p
1
p
2
p
3
p
4
is shown in Fig. 2.14. Construction of the
Figure 2.14: Transition diagram of NFA for string (, )-matching
(SFO(, )CO problem) for pattern P = p
1
p
2
p
3
p
4
, l = 2, k = 3
SFO(, )CO nondeterministic nite automaton is similar to the construc-
tion of the SFOCO automaton. The only dierence is that the number
of diagonal transitions is limited by l. Algorithm 2.11 describes the con-
struction of the SFO(, )CO automaton in detail.
Algorithm 2.11
Construction of the SFO(, )CO automaton.
Input: Pattern P = p
1
p
2
. . . p
m
, k, l, SFOECO automaton for P.
Output: SFO(, )CO automaton M.
38
Method: Let M

= (Q

, A,

, q

0
, F

) be a SFOECO automaton for given


pattern P.
1. Create a sequence of k + 1 instances of SFOECO automata
M

j
= (Q

j
, A,

j
, q

0j
, F

j
) for j = 0, 1, 2, . . . , k. Let states in Q

j
be
q
0j
, q
1j
, . . . , q
mj
.
2. Construct the SFO(, )CO automaton M = (Q, A, , q
0
, F) as fol-
lows:
Q =

k
j=0
Q

j
,
(q, a) =

j
(q, a) for all q Q, a A, j = 1, 2, . . . , m,
(q
ij
, a) = q
i+1,j+1
, q
i+1,j+2
, . . . , q
i+1,j+l
, for all i = 0, 1, . . . , m 1,
j = 0, 1, 2, . . . , l 1, a p
l
i+1
,
q
0
= q
00
,
F =

m
j=0
F

j
.
3. Remove all states which are inaccessible from state q
0
.
The SFO(, )CO automaton has less than m(k + 1) + 1 states. 2
2.2.5 Approximate sequence matching
Figure 2.15: Transition diagram of NFA for sequence R-matching
(QFORCO automaton)
39
Figure 2.16: Transition diagram of NFA for sequence DIR-matching
(QFODCO automaton)
Here we discuss six variants of approximate sequence matching in which
the following distances are used: Hamming distance, Levenshtein distance,
generalized Levenshtein distance, -distance, -distance, and (, )-distance.
There are two ways of constructing nondeterministic nite automata for ap-
proximate sequence matching:
- The rst way is to construct a QFO?CO nondeterministic nite au-
tomaton by the corresponding algorithms for approximate string match-
ing, to which we give the QFOECO automaton as the input.
- The second way is to transform a SFO?CO automaton to a QFO?CO
automaton by adding self loops for all symbols to all non-initial states
that have at least one outgoing transition, as shown in Algorithm 2.12.
Algorithm 2.12
Transformation of an SFO?CO automaton to a QFO?CO automaton.
Input: SFO?CO automaton M

= (Q

, A,

, q

0
, F

).
Output: QFO?CO automaton M.
Method: Construct automaton M = (Q, A, , q
0
, F) as follows:
1. Q = Q

.
2. (q, a) =

(q

, a) for each q Q, a A .
40
Figure 2.17: Transition diagram of NFA for sequence DIRT-matching
(QFOGCO automaton)
3. (q, a) =

(q

, a)q for all such q Q that (q, a) ,= , a p


i
, where
p
i
is the label of the outgoing transition from state q to the next state
at the same level.
4. q
0
= q

0
.
5. F = F

. 2
Transition diagrams of the resulting automata are depicted in Figs. 2.15,
2.16 and 2.17 for pattern P = p
1
p
2
p
3
p
4
, k = 3.
2.2.6 Matching of nite and innite sets of patterns
A model of the algorithm for matching a nite set of patterns is constructed
as the union of NFA for matching the individual patterns.
As an example we show the model for exact matching of the set of
patterns P = p
1
p
2
p
3
, p
4
p
5
, p
6
p
7
p
8
(SFFECO problem). This is shown in
Fig. 2.18. This automaton is in some contexts called dictionary matching
automaton. A dictionary is a nite set of strings.
The operation union of nondeterministic nite automata is the general
approach for the family of matching of nite set of patterns (??F??? family).
Moreover, the way of matching of each individual pattern must be dened.
41
Figure 2.18: Transition diagram of the nondeterministic nite automaton
for exact matching of the nite set of strings P = p
1
p
2
p
3
, p
4
p
5
, p
6
p
7
p
8

(SFFECO automaton)
The next algorithm describes this approach and assumes that the way
of matching is xed (exact, approximate, . . . ).
Algorithm 2.13
Construction of the ??F??? automaton.
Input: A set of patterns with a specication of the way of matching
P = P
1
(w
1
), P
2
(w
2
), . . ., P
r
(w
r
), where P
1
, P
2
, . . . , P
r
are patterns and
w
1
, w
2
, . . . , w
r
are specications of the ways of matching them.
Output: The ??F??? automaton.
Method:
1. Construct an NFA for each pattern P
i
, 1 i r, with respect to the
specication of matching w
i
.
2. Create the NFA for the language which is the union of all input lan-
guages of the automata constructed in step 1. The resulting automaton
is the ??F??? automaton. 2
Example 2.14
Let the input to Algorithm 2.13 be P = abc(SFOECO), def(QFOECO),
xyz(SSOECO). The result of step 1 of Algorithm 2.13 is shown in Fig. 2.19.
The transition diagram of the nal
_
S
Q
__
F
S
_
OECO automaton is
shown in Fig. 2.20. 2
The model of the algorithm for matching an innite set of patterns is
based on a nite automaton accepting this set. The innite set of patterns
is in this case dened by a regular expression. Let us present the exact
matching of an innite set of strings (SFIECO problem). The construction
of the SFIECO nondeterministic nite automaton is performed in two steps.
42
Figure 2.19: Transition diagram of nondeterministic nite automata for
individual patterns from Example 2.14
In the rst step, a nite automaton accepting the language dened by the
given regular expression is constructed. The seloop in its initial state for
all symbols from the alphabet is added in the second step. Algorithm 2.15
describes the construction of the SFIECO automaton in detail.
Algorithm 2.15
Construction of the SFIECO automaton.
Input: Regular expression R describing a set of strings over alphabet A.
Output: SFIECO automaton M.
Method:
1. Construct nite automaton M

= (Q, A,

, q
0
, F) such that L(M

) =
h(R), where h(R) is the value of the regular expression R.
2. Construct nondeterministic nite automaton M = (Q, A, , q
0
, F), where
(q, a) =

(q, a) for all q Qq


0
, a A ,
(q
0
, a) =

(q
0
, a) q
0
for all a A. 2
43
Figure 2.20: Transition diagram of the resulting nite automaton for the set
of patterns P from Example 2.14
Example 2.16
Let the regular expression R = ab

c+bc be given over alphabet A = a, b, c.


The result of step 1 of Algorithm 2.15 is shown in Fig. 2.21a. The nal result
of Algorithm 2.15 is shown in Fig. 2.21b. 2
The construction of the QFIECO nondeterministic nite automaton
starts, as for the SFIECO automaton, by the construction of the nite
automaton accepting the language dened by the given regular expression.
There are added seloops for all symbols from the alphabet to the initial
state and to all states having outgoing transitions for more than one symbol.
The seloops to all states having only one outgoing transition (for symbol
a) are added for all symbols but symbol a. Algorithm 2.17 describes the
construction of the QFIECO automaton in detail.
Algorithm 2.17
Construction of the QFIECO automaton.
Input: Regular expression R describing a set of strings over alphabet A.
Output: The QFIECO automaton.
Method:
1. Construct nite automaton M

= (Q, A,

, q
0
, F) such that L(M

) =
h(R), where h(R) is the value of the regular expression R.
2. Construct nondeterministic nite automaton M = (Q, A, , q
0
, F) where
(q, a) =

(q, a) for all q Q, a A,


44
Figure 2.21: Transition diagram of the nite automata of Example 2.16
(q, a) =

(q, a) q for all q Q, such that (q, a) ,= , a A a,


where there is an outgoing transition from state q for symbol a only,
(q, a) =

(q, a) q for all q Q and a A such that (q, a) ,=


for more than one symbol a A. 2
Example 2.18
Let a regular expression be R = ab

c + bc over alphabet A = a, b, c.
The result of Algorithm 2.17 is the automaton having transition diagram
depicted in Fig. 2.22. 2
2.2.7 Pattern matching with dont care symbols
Let us recall that the dont care symbol is the symbol matching any
other symbol from alphabet A including itself. The transition diagram of
the nondeterministic nite automaton for exact string matching with the
dont care symbol (SFOEDO problem) for pattern P = p
1
p
2
p
4
is shown
in Fig. 2.23.
An interesting point of this automaton is the transition from state 2
to state 3 corresponding to the dont care symbol. This is in fact a set of
45
Figure 2.22: Transition diagram of the resulting QFIECO automaton from
Example 2.18
Figure 2.23: Transition diagram of the nondeterministic nite automaton for
exact string matching with the dont care symbol (SFOEDO automaton)
for pattern P = p
1
p
2
p
4
transitions for all symbols of alphabet A. The rest of the automaton is the
same as for the SFOECO automaton.
The transition diagram of the nondeterministic nite automaton for ex-
act sequence matching with the dont care symbol (QFOEDO problem) for
the pattern P = p
1
p
2
p
4
is shown in Fig. 2.24.
Figure 2.24: Transition diagram of the nondeterministic nite automaton for
exact sequence matching with dont care symbol (QFOEDO automaton)
for pattern P = p
1
p
2
p
4
46
The transition for the dont care symbol is the same as for string match-
ing. However, the rest of the automaton is a slightly changed QFOECO
automaton. The self loop in state 2 is missing. The reason for this is that
the symbol following symbol p
2
is always the third element of the given
sequence, because we do not care about.
The construction of automata for other problems with dont care symbols
uses the principle of insertion of the sets of transitions for all symbols of
the alphabet to the place corresponding to the positions of the dont care
symbols.
Example 2.19
Let pattern P = p
1
p
2
p
4
be given. We construct the SFORDO automaton
(approximate string matching of one full pattern using Hamming distance)
for Hamming distance k = 3. The transition diagram of the SFORDO
automaton is depicted in Fig. 2.25. Let us note that transitions labelled by
A refer to a transition for an empty set of symbols, and may be removed.
Figure 2.25: Transition diagram of the nondeterministic nite automaton
for the SFORDO problem for pattern P = p
1
p
2
p
4
(transitions labelled by
A may be removed)
47
2.2.8 Matching a sequence of patterns
Matching a sequence of patterns is dened by Denition 1.16. The nondeter-
ministic nite automaton for matching a sequence of patterns is constructed
by making a cascade of automata for patterns in the given sequence. The
construction of the nondeterministic nite automaton for a sequence of pat-
terns starts by constructing nite automata for matching all elements of
the sequence. The next operation (making a cascade) is the insertion of
-transitions from all nal states of the automaton in the sequence to the
initial state of the next automaton in the sequence, if it exists. The following
algorithm describes this construction.
Algorithm 2.20
Construction of the ?????S automaton.
Input: Sequence of patterns P
1
(w
1
), P
2
(w
2
), . . . , P
s
(w
s
), where P
1
, P
2
, . . . , P
s
are patterns and w
1
, w
2
, . . . , w
S
are specications of their matching.
Output: ?????S automaton.
Method:
1. Construct a NFA M
i
= (Q
i
, A
i
,
i
, q
0i
, F
i
) for each pattern P
i
(w
i
),
1 i s, s > 1, with respect to the specication w
i
.
2. Create automaton M = (Q, A, , q
0
, F) as a cascade of automata in
this way:
Q =

s
i=1
Q
i
,
A =

s
i=1
A
i
,
(q, a) =
i
(q, a) for all q Q
i
, a A
i
, i = 1, 2, . . . , s,
(q, ) = (q
0,i+1
), for all q F
i
, 1 i s 1,
q
0
= q
01
,
F = F
s
. 2
The main point of Algorithm 2.20 is the insertion of -transitions from all
nal states of automaton M
i
to the initial state of automaton M
i+1
.
Example 2.21
Let the input to Algorithm 2.20 be the sequence abc, def, xyz. The resulting
SFOECS automaton is shown in Fig. 2.26. 2
2.3 Some deterministic pattern matching automata
In this Section we will show some deterministic nite automata obtained
by determinising the of nondeterministic nite automata of the previous
Section. A deterministic pattern matching automaton needs at most n steps
for pattern matching in the text T = t
1
t
2
. . . t
n
. This means that the time
complexity of searching is linear (O(n)) for all problems in the classication
described in Section 1.2.
48
A
A
A
a START
d
x
b
e
y
c
f
z
q
0
q
4
e
e
Figure 2.26: Transition diagram of the nondeterministic nite automaton
for matching the sequence of patterns P = abc(SFOECO), def (SFOECO),
xyz(SFOECO) (SFOECS automaton)
On the other hand, the use of deterministic nite automata has two
drawbacks:
1. The size of the pattern matching automaton depends on the cardinality
of the alphabet. Therefore this approach is suitable primarily for small
alphabets.
2. The number of states of a deterministic pattern matching automaton
can be much greater than the number of states of its nondeterministic
equivalent.
These drawbacks, the time and space complexity of the construction of a
deterministic nite automaton, are the Price we have to pay for fast search-
ing. Several methods for simulating the original nondeterministic pattern
matching automata have been designed to overcome these drawbacks. They
will be discussed in the following Chapters.
2.3.1 String matching
The deterministic SFOECO nite automaton for the pattern P = p
1
p
2
. . . p
m
is the result of determinisation of the nondeterministic SFOECO automaton.
Example 2.22
Let us have pattern P = abab over alphabet A = a, b. A transition
diagram of the SFOECO(abab) automaton and its deterministic equivalent
are depicted in Fig. 2.27. Transition tables of both automata are shown in
Table 2.1. The deterministic SFOECO automaton is a complete automaton
and has just m + 1 states for a pattern of length m. The number of steps
49
Figure 2.27: Transition diagrams of the nondeterministic and deterministic
SFOECO automata for pattern P = abab from Example 2.22
(number of transitions) made during pattern matching in a text of length n
is just n. 2
2.3.2 Matching of a nite set of patterns
The deterministic SFFECO automaton for matching a set of patterns
S = P
1
, P
2
, . . . , P
s
is the result of the determinisation of nondetermin-
istic SFFECO automaton.
Example 2.23
Let us have set of patterns S = ab, bb, babb over alphabet A = a, b. Tran-
sition diagram of the SFFECO(S) automaton and its deterministic equiva-
lent are depicted in Fig. 2.28. Transition tables of both automata are shown
in Table 2.2. 2
The deterministic pattern matching automaton for a nite set of patterns
has less than [S[ + 1 states, where [S[ =
s

i=1
[P
i
[ for P = P
1
, P
2
, . . . , P
s
.
The maximum number of states (equal to [S[+1) is reached in the case when
no two patterns have common prex. In Example 2.23 holds that [S[ +1 = 9
and patterns bb and babb have common prex b. Therefore the number of
states is 8 which is less than 9.
50
a b
0 0, 1 0
1 2
2 3
3 4
4
a) Nondeterministic SFOECO(abab)
automaton
a b
0 01 0
01 01 02
02 013 0
013 01 024
024 013 0
b) Deterministic SFOECO(abab)
automaton
Table 2.1: Transition tables of SFOECO(abab) automata from Example 2.22
a b
0 0, 1 0, 3, 7
1 2
2
3 4
4 5
5 6
6
7 8
8
a) Nondeterministic SFFECO
(ab, bb, babb) automaton
a b
0 01 037
01 01 0237
037 014 0378
0237 014 0378
014 01 02357
0378 014 0378
02357 014 03678
03678 014 0378
b) Deterministic SFFECO
(ab, bb, babb) automaton
Table 2.2: Transition tables of SFFECO(ab, bb, babb) automata from Ex-
ample 2.23
51
0 3
1
7
4
2
8
5 6
b
a
b
a
b
b
b b
a
b
START
a) Nondeterministic ({ }) automaton SFFECO ab,bb,babb
b) Deterministic automaton SFFECO ab,bb,babb ({ })
0 037
01
014 03678 02357
0237
0378
b a b b
b
a
START
b
b
a
a
a
a a
a
b
b
Figure 2.28: Transition diagrams of the nondeterministic and deterministic
SFFECO automata for S = (ab, bb, babb) from Example 2.23
2.3.3 Regular expression matching
The deterministic nite automaton for matching an innite set of patterns
is the result of the determinisation of SFIECO automaton.
52
Example 2.24
Let us have regular expression R = ab

c + bc over alphabet A = a, b, c
(see also Example 2.16). Transition diagram of the SFIECO(R) automaton
and its deterministic equivalent are shown depicted in Fig. 2.29. Transition
Figure 2.29: Transition diagrams of the nondeterministic and deterministic
SFIECO automata for R = ab

c +bc from Example 2.24


tables of both automata are shown in Table 2.3. 2
The space complexity of matching of regular expression can vary from linear
to exponential. The example of linear space complexity is matching of reg-
ular expression describing language containing one string. The exponential
space complexity is reached for example for expression:
R = a(a +b)
m1
53
a b c
0 0, 1 0, 2 0
1 1 3
2 3
3
a) Nondeterministic SFIECO
(ab

c +bc) automaton
a b c
0 01 02 0
01 01 012 03
02 01 02 03
012 01 012 03
03 01 02 0
b) Deterministic SFIECO
(ab

c +bc) automaton
Table 2.3: Transition tables of SFIECO(ab

c + bc) automata from Exam-


ple 2.24
a b
0 0, 1 0
1 2 2
2 3 3
3
a) Nondeterministic SFIECO
(a(a +b)(a +b)) automaton
a b
0 01 0
01 012 02
012 0123 023
0123 0123 023
02 013 03
023 013 03
013 012 02
03 01 0
b) Deterministic SFIECO
(a(a +b)(a +b)) automaton
Table 2.4: Transition tables of SFIECO(a(a + b)(a + b)) automata from
Example 2.25
Example 2.25
Let us show example of deterministic regular expression matching automa-
ton for R = a(a + b)(a + b) over alphabet A = a, b. Transition diagram
of the SFIECO(a(a +b)(a +b)) automaton and its deterministic equivalent
are depicted in Fig. 2.30. Transition tables of both automata are shown in
Table 2.4. We can see that the resulting deterministic automaton has 2
3
= 8
states. 2
54
Figure 2.30: Transition diagrams of the nondeterministic and deterministic
SFIECO automata for R = a(a +b)(a +b) from Example 2.25
55
2.3.4 Approximate string matching Hamming distance
The deterministic nite automaton for approximate string matching using
Hamming distance is the result of the determinisation of SFORCO automa-
ton.
Example 2.26
Let us have pattern P = aba over alphabet A = a, b and Hamming dis-
tance k = 1. Transition diagram of the SFORCO(aba, 1) automaton and its
deterministic equivalent are depicted in Fig. 2.31. Transition tables of both
automata are shown in Table 2.5. 2
Figure 2.31: Transition diagrams of the nondeterministic and deterministic
SFORCO(aba, 1) automata from Example 2.26
56
a b
0 0,1 0,4
1 5 2
2 3 6
3
4 5
5 6
6
a b
0 01 04
01 015 024
024 013 0456
013 015 024
04 01 045
015 0156 024
045 016 045
0456 016 045
0156 0156 024
016 015 024
Table 2.5: Transition tables of deterministic and nondeterministic
SFORCO(aba, 1) automata from Example 2.26
2.3.5 Approximate string matching Levenshtein distance
The deterministic nite automaton for approximate string matching using
Levenshtein distance is the result of the determinisation of SFODCO au-
tomaton.
Example 2.27
Let us have pattern P = aba over alphabet A = a, b and Levenshtein
distance k = 1. Transition diagram of the SFODCO(aba, 1) automaton
and its deterministic equivalent are depicted in Fig. 2.32. Transition tables
of both automata are shown in Table 2.6. 2
2.4 The state complexity of the deterministic pattern match-
ing automata
The states of a SFFECO automaton (dictionary matching automaton, nite
set of strings automaton) correspond to the prexes of strings in a nite
set U. To formalize this automaton, mapping h
U
is dened for each language
U as follows:
h
U
(v) = the longest sux of v that belongs to Pref(U),
for each v A

.
57
Figure 2.32: Transition diagrams of the nondeterministic and deterministic
SFODCO(aba, 1) automata from Example 2.27
Denition 2.28
Strings v and w are denoted by uw
1
and v
1
u when u = vw.
The following Theorem 2.29 adopted from [CH97b] is necessary for dic-
tionary matching automata construction.
Theorem 2.29
Let U A

. Then
1. for each v A

v A

U i h
U
(v) A

U,
2. h
U
() = ,
3. for each v A

, a A h
U
(va) = h
U
(h
U
(v)a).
58
a b
0 0,1 0,4 4
1 4,5 2,4 5
2 3,5 5,6 6
3
4 5
5 6
6
a b
0 01 045
01 01456 0245
045 016 045
01456 01456 0245
0245 01356 0456
01356 01456 0245
0456 016 045
016 01456 0245
Table 2.6: Transition tables of SFODCO(aba, 1) automata from Exam-
ple 2.27
Proof
If v A

U, then v is in the form wu, where w A

and u U. By the
denition of h
U
, u is necessarily a sux of h
U
(v); therefore h
U
(v) A

U.
Conversely, if h
U
(v) A

U, we have also v A

U, because h
U
(v) is a sux
of v. Which proves (1).
Property (2) clearly holds.
It remains to prove (3). Both words h
U
(va) and h
U
(v)a are suxes of
va, and therefore one of them is a sux of the other. Then two cases are
distinguished according to which word is a sux of the other.
First case: h
U
(v)a is a proper sux of h
U
(va) (hence h
U
(va) ,= ).
Consider the word w dened by w = h
U
(va)a
1
. Thus we have: h
U
(v)
is a proper sux of w, w is a sux of v, and w Pref(U). Since w is a
sux of v that belongs to Pref (U), but strictly longer than h
U
(v), there is
a contradiction in the maximality of [h
U
(v)[, so this case is impossible.
Second case: h
U
(va) is a sux of h
U
(v)a. Then, h
U
(va) is a sux of
h
U
(h
U
(v)a). Since h
U
(v)a is a sux of va, h
U
(h
U
(v)a) is a sux of h
U
(va).
Both properties imply h
U
(va) = h
U
(h
U
(v)a) and the expected result follows.
2
Now the dictionary matching automaton can be constructed according
to Theorem 2.30, borrowed from [CH97b].
59
Theorem 2.30
Let X be a nite language. Then the automaton M = (Q, A, q
0
, , F), where
Q = q
x
[ x Pref(X), q
0
= q

, (q
p
, a) = q
h
X
(pa)
, p Pref(X), a A,
F = q
x
[ x Pref(X) A

X, recognizes the language A

X. This automa-
ton is deterministic and complete.
Proof
Let v A

. It follows from properties (2) and (3) of Theorem 2.29 that


after reading v the automaton will be in the state q
h
X
(v)
. If v A

X, it
must hold h
X
(v) A

X from (1) of Theorem 2.29; which shows that q


h
X
(v)
is a nal state, and nally that v is recognized by the automaton.
Conversely, if v is recognized by the automaton, we have h
X
(v) A

X
by denition of the automaton. This implies that v A

X from (1) of
Theorem 2.29 again. 2
Transition diagram of an example of dictionary matching automaton is
shown in Figure 2.33.
Figure 2.33: Transition diagram of dictionary matching automaton for lan-
guage aba, aab, bab
2.4.1 Construction of a dictionary matching automaton
The rst method of dictionary matching automata construction follows di-
rectly from Theorem 2.30. But, as will be shown in this section, dictionary
matching automata can be build using standard algorithms. This method
is described in Algorithm 2.31.
Algorithm 2.31
Construction of a dictionary matching automaton for a given nite language.
60
Input: Finite language X.
Output: Dictionary matching automaton accepting language A

X.
Method:
1. Create a tree-like nite automaton accepting language X (using Algo-
rithm 2.32),
2. Add a self loop (q
0
, a) = (q
0
, a) q
0
for each a A to M.
3. Using the subset construction (see Algorithm 1.40) make a determin-
istic dictionary matching automaton accepting language A

X.
2
Algorithm 2.32
Construction of a deterministic automaton accepting the set of strings.
Input: Finite language X.
Output: Deterministic nite automaton M = (Q, A, , q
0
, F) accepting lan-
guage X.
Method:
1. Q = q
x
[ x Pref(X),
2. q
0
= q

,
3. (q
p
, a) =
_
q
pa
in case when pa Pref(X),
undened otherwise,
4. F = q
p
[ p X. 2
Example 2.33
Let us create a dictionary matching automaton for language aba,
aab, bab to illustrate this algorithm. The outcome from the step (2) of
Algorithm 2.31 is shown in Figure 2.34 and the result of the whole algo-
rithm is shown in Figure 2.35. 2
What is the result of this algorithm? As shown in Theorem 2.34, au-
tomata created according to Theorem 2.30 are equivalent to automata cre-
ated according to Algorithm 2.31.
Theorem 2.34
Given nite language X, nite automaton M
1
= (Q
1
, A,
1
, q

, F
1
) accept-
ing language A

X created by Algorithm 2.31 is equivalent to nite automa-


ton M
2
= (Q
2
, A,
2
, q

, F
2
) accepting the same language but created by
Theorem 2.30.
Proof
The rst step is to show that after reading a string w automaton M
1
will be
in state q =

1
(q

, w) = q
x
[ x Su(w) Pref(X). Let us remind the
61
Figure 2.34: Transition diagram of nondeterministic dictionary matching
automaton for the language aba, aab, bab
reader that deterministic automaton M
1
was created from nondeterministic
automaton M

1
= (Q

1
, A,

1
, q

, F

1
) by the subset construction. As follows
from the subset construction algorithm, after reading string w automaton
M
1
will be in state q =

1
(q

, w) =

(q

, w). Thus it is enough to


show that

(q

, w) = q
x
[ x Su(w) Pref(X). Given string v
Su(w)Pref(X), w can be written as uv, u A

. Thus, there is a sequence


of moves (q

, uv)
|u|
M

1
(q

, v)
|v|
M

1
(q
v
, ), so q
v

(q

, w). Conversely,
consider q
v

(q

, w). Since M

1
can read arbitrarily long words only
by using the seloop for all a A in the initial state, the sequence of
moves must be as follows (q

, w)

1
(q

, v)
|v|
M

1
(q
v
, ). Consequently
v Su(w) Pref(X).
Since the set of states Q
1
T(q
x
[ x Pref(X)), it is possible to
dene an isomorphism f[
Q
1
: T(q
x
[ x Pref(X)) q
x
[ x Pref(X)
as follows:
f(q) = p
w
q, [w[ is maximal.
Now it is necessary to show that
1. q
1
, q
2
Q
1
, q
1
,= q
2
f(q
1
) ,= f(q
2
),
2. p Q
2
q Q
1
, f(q) = p,
3. f(q
1
0
) = q
2
0
4. f(
1
(q, a)) =
2
(f(q), a),
5. f(F
1
) = F
2
.
62
Figure 2.35: Transition diagram of deterministic dictionary matching au-
tomaton for language aba, aab, bab created from nondeterministic one by
the subset construction
Let us suppose p =

1
(q

, u), q =

1
(q

, v), p ,= q, and f(p) = f(q).


But from the denition of f, it must hold p = q
x
[ x Su(u) Pref(X) =
q
x
[ q
y
= f(p), x Su(y)Pref(X) and q = q
x
[ x Su(v)Pref(X) =
q
x
[ q
y
= f(q), x Su(y) Pref(X), which implies p = q. This is the
contradiction, so it proves (1).
Property (2) holds because for all u Pref(X), q
u

1
(q

, u) =
q
x
[ x Su(u) Pref(X). Thus f(

1
(, u)) = q
u
.
Property (3) clearly holds.
It will be shown that f(

1
(q

, u)) =

2
(q

, u) which implies the prop-


erty (4). It is known that f(

1
(q

, u)) = f(q
x
[ x Su(u) Pref(X))
and

2
(u

, u) = q
h
X
(u)
. Thus, the preposition holds from the denitions of
f and h
X
.
Property (5) clearly holds from previous properties and thus both au-
tomata accept the same language. 2
The main consequence of the previous Theorem is that during the trans-
formation of a nondeterministic tree-like automaton with the self loop for all
a A in the initial state to the deterministic one, the number of states does
not increase.
But it is possible to show more. It is easy to see that a deterministic
dictionary matching automaton accepting language A

X can be build from


any acyclic automaton accepting language X by the last two steps of Algo-
rithm 2.31. As can be seen in the next Theorem, the number of states of an
automaton created in such way cannot be greater than the number of states
of nite automaton accepting the same language and created by Algorithm
63
2.31 from scratch.
Theorem 2.35
Given acyclic automaton M = (Q, A, , q
0
, F) accepting language X, nite
automaton M
1
= (Q
1
, A,
1
, q
1
0
, F
1
) accepting language A

X created by last
two steps of Algorithm 2.31 contains at most the same number of states as
nite automaton M
2
= (Q
2
, A,
2
, q
2
0
, F
2
) accepting the same language and
created by Algorithm 2.31 from scratch.
Proof
Let us remind that deterministic automaton M
1
was created from nonde-
terministic automaton M

= (Q, A,

, q
0
, F) by the subset construction.
The set of active states of automaton M

after reading u A is

(q
0
, u),
which is equal to

(q
0
, h
X
(u)). Let us denote the active state of automaton
M
1
after reading u by q. Subset construction ensures that

(q
0
, u) q.
So, in the worst case for all q Q
1
it holds q =

(q
0
, u), u Pref(X),
which completes the proof. 2
2.4.2 Approximate string matching
In order to prove the upper bound of the state complexity of deterministic
nite automata for approximate string matching, it is necessary to limit the
number of states of the dictionary matching automata accepting language
A

X with respect to the size of language X.


Theorem 2.36
Given acyclic nite automaton M accepting language X, the number of
states of the deterministic dictionary matching automaton created from M
is
O(

wX
[w[).
Proof
Because of Theorem 2.35, the number of states of such way created deter-
ministic dictionary matching automaton is at most the same as the number
of states of a tree-like nite automaton accepting X, whose number of states
is in the worst case equal to 1 +

wX
[w[. 2
2.4.2.1 Hamming distance At rst, it is necessary to dene the nite
automaton for approximate string matching using Hamming distance. It
is the Hamming automaton M(A

H
k
(p)) accepting language A

H
k
(p),
where H
k
(p) = u [ u A

, D
H
(u, p) k for the given pattern p A

and
the number of allowed errors k 1.
64
Since H
k
(p) is the nite language, it is possible to estimate the number
of states of the deterministic automaton M(A

H
k
(p)) using Theorem 2.36.
The only concern is to compute the size of the language H
k
(p).
Theorem 2.37
The number of strings generated by at most k replace operations from the
pattern p = p
1
p
2
. . . p
m
is O
_
[A[
k
m
k
_
. 2
Proof
The set of strings created by exactly i (0 i k) operations replace are
made by replacing exactly i symbols of p by other symbols. There are
_
m
i
_
possibilities for choosing i symbols from m. Each chosen symbol can be
replaced by [A[ 1 symbols, so the number of generated strings is at most
_
m
i
_
([A[ 1)
i
= ([A[ 1)
i
O
_
m
i
_
= O
_
[A[
i
m
i
_
,
because
_
m
i
_
= O(m
i
). The set of strings created by at most k operations
replace is the union of the abovementioned sets of strings. Thus, the cardi-
nality of this set is
k

i=0
O
_
[A[
i
m
i
_
= O
_
[A[
k
m
k
_
.
2
Since the number of strings generated by the replace operation is now
known, it is possible to estimate the number of states of the deterministic
Hamming automaton.
Theorem 2.38
The number of states of deterministic nite automaton M(A

H
k
(p)), p =
p
1
p
2
. . . p
m
is
O
_
[A[
k
m
k+1
_
.
Proof
As shown in Theorem 2.36 the number of states of this automaton is at most
the same as the size of language H
k
(p). As for all u H
k
(p) holds [u[ = m,
the size of language H
k
(p) is
O(

uH
k
(p)
[u[) = O
_
m[A[
k
m
k
_
= O
_
[A[
k
m
k+1
_
.
2
65
2.4.2.2 Levenshtein distance The same approach as in Section 2.4.2.1
can be used to bound the number of states of a deterministic automa-
ton for approximate string matching using Levenshtein distance. It is the
Levenshtein automaton M(A

L
k
(p)) accepting language A

L
k
(p), where
L
k
(p) = u [ u A

, D
L
(u, p) k for given pattern p A

and the number


of allowed errors k 1.
So, the number of strings generated from pattern by insert and delete
operations is to be found.
Theorem 2.39
The number of strings generated by at most k insert operations from the
pattern p = p
1
p
2
. . . p
m
is O
_
[A[
k
m
k
_
.
Proof
Imagine that all above mentioned strings are generated from empty strings
by sequential symbol addition. Then the number of strings generated by
exactly i insert operations can be transformed to the number of tours that
can be found in a chessboard like oriented graph from the position (0, 0) to
position (m, i) multiplied by A
i
(it is possible to insert an arbitrary symbol
from the alphabet). Position (0, 0) represents empty string, position (m, i)
represents pattern p with i inserted symbols, and each move of the tour
represents addition of the one symbol to the generated string. Move (x, y)
(x+1, y) represents addition of the (x+1)-st symbol of the pattern, while
move (x, y) (x, y + 1) represents addition of an inserted symbol to the
pattern. An example of this graph is shown in Figure 2.36. The number in
each node represents the number of tours leading to the node. The number
Figure 2.36: Chessboard like oriented graph representing strings generated
by insert operation
of tours c
x,y
can be dened by recursive formula
c
x,0
= 1 0 x,
c
0,y
= 1 0 y,
c
x,y
= c
x1,y
+c
x,y1
1 x 1 y.
66
Because of the Pascal triangle method of binomial number computation, it
is clear that
c
x,y
=
_
x +y
min(x, y)
_
.
In order to continue, it is possible to use following equation:
_
x +z
z
_
=
(x + 1) (x + 2) . . . (x +z 1) (x +z)
1 2 . . . (z 1) z
=
=
x + 1
1

x + 2
2
. . .
x + (z 1)
z 1

x +z
z
=
=
_
x
1
+ 1
_

_
x
2
+ 1
_
. . .
_
x
z 1
+ 1
_

_
x
z
+ 1
_
.
In case that x 2 all fractions but the rst one are smaller than (or equal
to) x. Thus
_
x +z
z
_
(x + 1) x
z1
= x
z
+x
z1
= O(x
z
) .
As the number of allowed errors is smaller than the length of pattern, i < m,
c
m,i
=
_
m+i
min(m, i)
_
=
_
m+i
i
_
= O
_
m
i
_
.
Thus, the number of strings generated by exactly i insert operations is
O
_
[A[
i
m
i
_
. Since the set of strings generated by at most k insert opera-
tions is the union of the above mentioned sets of strings, the cardinality of
this set is
k

i=0
O
_
[A[
i
m
i
_
= O
_
[A[
k
m
k
_
.
2
Theorem 2.40
The number of strings generated by at most k delete operations from pattern
p = p
1
p
2
. . . p
m
is O
_
m
k
_
.
Proof
Sets of strings generated by exactly i delete operations consist of strings that
are made by deleting exactly i symbols fromp. There are
_
m
i
_
possibilities for
choosing i symbols from m, so the number of such strings is at most
_
m
i
_
=
O
_
m
i
_
. Since the set of strings generated by at most k delete operations is
the union of the above mentioned sets, the number of strings within this set
is
k

i=0
O
_
m
i
_
= O
_
m
k
_
.
2
67
Now it is known the number of strings generated by each edit opera-
tion, so it is possible to estimate the number of strings generated by these
operations all at once.
Theorem 2.41
The number of strings generated by at most k replace, insert, and delete
operations from pattern p = p
1
p
2
. . . p
m
is O
_
[A[
k
m
k
_
.
Proof
The number of strings generated by at most k replace, insert, delete op-
erations can be computed as a sum of the number of strings generated by
these operations for exactly i errors allowed for 0 i k. Such strings are
generated by a combination of above mentioned operations, so the number
of generated strings is
k

x=0
kx

y=0
kxy

z=0
O([A[
x
m
x
)
. .
replace x
symbols
O([A[
y
m
y
)
. .
insert y
symbols
O(m
z
)
. .
delete z
symbols
=
=
k

x=0
kx

y=0
kxy

z=0
O
_
[A[
x+y
m
x+y+z
_
=
= O
_
[A[
k
m
k
_
2
The last step is to bound the number of states of the deterministic dic-
tionary matching automaton.
Theorem 2.42
The number of states of deterministic nite automaton M(A

L
k
(p)), p =
p
1
p
2
. . . p
m
is
O
_
[A[
k
m
k+1
_
.
Proof
The proof is the same as for Theorem 2.38. Since for all u L
k
(p) holds
[u[ m+k, the size of language L
k
(p) is
O(

uL
k
(p)
[u[) = O
_
(m+k)[A[
k
m
k
_
=
_
[A[
k
m
k+1
_
.
2
2.4.2.3 Generalized Levenshtein distance It is clear that the con-
cepts from Sections 2.4.2.1 and 2.4.2.2 will be used also for generalized Lev-
enshtein distance.
The nite automaton for approximate string matching using generalized
Levenshtein distance M(A

G
k
(p)) will be called the nite automaton ac-
cepting language A

G
k
(p), where G
k
(p) = u [ u A

, D
G
(u, p) k for
given pattern p A

and the number of allowed errors k 1.


68
The only unexplored operation is the transpose operation.
Theorem 2.43
The number of strings generated by at most k transpose operations from
pattern p = p
1
p
2
. . . p
m
is O
_
m
k
_
.
Proof
The strings generated by exactly i transpose operations are made by trans-
posing exactly i pairs of symbols from p. There is less than
_
m1
i
_
possi-
bilities for choosing i pairs from m symbols for transpose operation, when
each symbol can participate only in one pair. The number of such generated
strings is at most
_
m1
i
_
= O
_
m
i
_
. Since the set of strings generated by
at most k transpose operations is the union of the above mentioned sets of
strings, its cardinality is
k

i=0
O
_
m
i
_
= O
_
m
k
_
.
2
Now, the number of strings generated by all operations dened by gen-
eralized Levenshtein distance is to be found.
Theorem 2.44
The number of strings generated by at most k replace, insert, delete, and
transpose operations from the pattern p = p
1
p
2
. . . p
m
is O
_
[A[
k
m
k
_
.
Proof
The number of strings generated by at most k replace, insert, delete, and
transpose operations can be computed as a sum of the numbers of strings
generated by these operations for exactly i errors allowed for 0 i k. Such
strings are generated by a combination of above mentioned operations. So
the number of generated strings is
k

w=0
kw

x=0
kwx

y=0
kwxy

z=0
O([A[
w
m
w
)
. .
replace w
symbols
O([A[
x
m
x
)
. .
insert x
symbols
O(m
y
)
. .
delete y
symbols
O(m
z
)
. .
transpose z
pairs
=
=
k

w=0
kw

x=0
kwx

y=0
kwxy

z=0
O
_
[A[
w+x
m
w+x+y+z
_
=
= O
_
[A[
k
m
k
_
2
Finally, the number of states of the deterministic dictionary matching
automaton will be found.
69
Theorem 2.45
The number of states of deterministic nite automaton M(A

G
k
(p)), p =
p
1
p
2
. . . p
m
is
O
_
[A[
k
m
k+1
_
.
Proof
The proof is the same as for Theorem 2.38. Since for all u G
k
(p) holds
[u[ m+k the size of language G
k
(p) is
O(

uG
k
(p)
[u[) = O
_
(m+k)[A[
k
m
k
_
= O
_
[A[
k
m
k+1
_
.
2
2.4.2.4 distance The nite automaton for approximate string match-
ing using distance M(A

k
(p)) will be called the nite automaton accept-
ing language A

k
(p), where A is an ordered alphabet,
k
(p) = u [ u
A

, D

(u, p) k for the given pattern p A

and the number of allowed


errors k 1.
Since it will be used the same approach as in previous sections, it is
necessary to compute cardinality of the set
k
(p). In order to do that, it is
necessary to prove two auxiliary lemmas.
Theorem 2.46
For all i 1, j 1, (j 1)
i+1
+ (j 1)
i
+j
i
j
i+1
.
Proof
It will be shown by induction on i. It holds for i = 1:
(j 1)
2
+ (j 1) +j = j
2
2j + 1 +j 1 +j = j
2
j
2
.
Consider the assumption is fullled for i 1. Then for i + 1
(j 1)
i+2
+ (j 1)
i+1
+j
i+1
=
= (j 1)(j 1)
i+1
+ (j 1)(j 1)
i
+j j
i

as j 1
j(j 1)
i+1
+j(j 1)
i
+j j
i
=
= j
_
(j 1)
i+1
+ (j 1)
i
+j
i
_

from induction assumption
j j
i+1
= j
i+2
which completes the proof. 2
70
Theorem 2.47
For all i 3, j 2,

i
x=0
(j 1)
x
+

i1
x=0
(j 1)
x
j
i
.
Proof
It will be shown by induction on i. It is satised for i = 3:
3

x=0
(j 1)
x
+
2

x=0
(j 1)
x
= 2(j 1)
0
+ 2(j 1)
1
+ 2(j 1)
2
+ (j 1)
3
=
= 2 + 2j 2 + 2j
2
4j + 2 +j
3
2j
2
+j j
2
+ 2j 1 =
= j
3
j
2
+j + 1
since j 2
j
3
Consider the assumption is satised for i 3. Than for i + 1
i+1

x=0
(j 1)
x
+
i

x=0
(j 1)
x
=
= (j 1)
i+1
+ (j 1)
i
+
i

x=0
(j 1)
x
+
i1

x=0
(j 1)
x

from induction assumption


(j 1)
i+1
+ (j 1)
i
+j
i

from Lemma 2.46


j
i+1
(3)
which completes the proof. 2
Theorem 2.48
The number of strings generated by from the pattern p = p
1
p
2
. . . p
m
with
at most k allowed errors in distance is O
_
m
k
_
.
Proof
The number of strings generated from the pattern p of the length m for
exactly k errors can be computed as the number of dierent paths in the
transition diagram of an automaton M(
k
(p)) from the initial state q
0,0
to
the nal state q
k,m
, where the transition diagram of the automaton M(
k
(p))
is the same as the transition diagram of the automaton M(A

k
(p)) (shown
in Figure REF) without the self loop for the whole alphabet in the initial
71
state q
0,0
(q
0,0
, a), a A. The number of these paths can be computed
by following recurrent formula:
c
i,j
=
_

_
1 i = 0, j = 0
0 i > 0, j = 0
c
i,j1
+ 2

i1
x=0
c
x,j1
otherwise
.
Let us show in several steps, that c
i,j
2j
i
.
i = 0, j > 0: c
0,j
= c
0,j1
= c
0,j2
= . . . = 1 2 j
0
i = 1, j > 0: By induction on j. It is satised for j = 1 because
c
1,1
= 2 2 1
1
. Consider the assumption holds for j > 0. Than for
j + 1
c
1,j+1
= c
1,j
+ 2 c
0,j

from induction assumption and the fact that c
0,j
= 1
2j + 2 = 2(j + 1) 2(j + 1)
1
i = 2, j > 0: By induction on j. It is satised for j = 1 because
c
2,1
= 2 2 1
2
. Consider the assumption holds for j 1. Than for
j + 1
c
2,j+1
= c
2,j
+ 2 c
1,j
+ 2 c
0,j

from induction assumption
2j
2
+ 2 2j + 2 1 = 2(j
2
+ 2j + 1)
2(j + 1)
1
= 2(j + 1)
2
i 3, j = 1: c
i,1
= c
i,0
+ 2 c
i1,0
+ 2 c
i2,0
+. . . + 2 c
0,0
= 2.
i 3, j 2: By induction on j. It was shown in previous step that
the assumption holds for j 1 1. Then for j
c
i,j
= c
i,j1
+ 2
i1

x=0
c
x,j1

(4)
from induction assumption
2(j 1)
i
+ 2
i1

x=0
2(j 1)
x

x=0
2(j 1)
x
+
i1

x=0
2(j 1)
x

from Lemma 2.47


2j
i
72
Thus the number of string generated from pattern of the length m with
exactly i allowed errors in distance is O
_
m
i
_
.
Since the set of strings generated from pattern of the length m with at
most k allowed errors is the union of the above mentioned sets of strings,
its cardinality is
k

i=0
O
_
m
i
_
= O
_
m
k
_
.
2
Finally, the number of states of the deterministic dictionary matching
automaton will be found.
Theorem 2.49
The number of states of the deterministic nite automaton M(A

k
(p)),
p = p
1
p
2
. . . p
m
is
O
_
m
k+1
_
.
Proof
The proof is the same as for Lemma 2.38. Since for all u
k
(p), it holds
[u[ = m the size of the language
k
(p) is
O(

u
k
(p)
[u[) = O
_
m m
k
_
= O
_
m
k+1
_
.
2
2.5 (, ) distance
The nite automaton for approximate string matching using () distance
M(A

(
l

k
)(p)) will be called the nite automaton accepting language
A

(
l

k
)(p), where A is an ordered alphabet, (
l

k
)(p) = u [ u A

,
D

(u, p) l, D

(u, p) k for the given pattern p A

and the number of


allowed errors k, l N.
It is obvious that it will be used the same approach as in previous sec-
tions. Thus it is necessary to compute cardinality of the set (
l

k
)(p).
Let us start by special case when l = 1. In order to do that, it is necessary
to prove one auxiliary lemma.
Theorem 2.50
For all i 2, j 1, (j 1)
i
+ 2(j 1)
i1
j
i
.
Proof
It will be shown by induction on i. It holds for i = 2:
(j 1)
2
+ 2(j 1) = j
2
2j + 1 + 2j 2 = j
2
1 j
2
.
Consider the assumption is fullled for i 2. Than for i + 1
(j 1)
i+1
+ 2(j 1)
i
= (j 1)
_
(j 1)
i
+ 2(j 1)
i1
_

73
from induction assumption
(j 1)j
i
= j
i+1
j
i
j
i+1
which completes the proof. 2
Theorem 2.51
The number of strings generated from pattern p of the length m with at
most k allowed errors in distance and at most 1 allowed error in dis-
tance is O
_
m
k
_
.
Proof
The number of strings generated from the pattern p of the length m for
exactly k errors can be computed as the number of dierent paths in the
transition diagram of an automaton M((
l

k
)(p)) from the initial state
q
0,0
to the nal state q
k,m
, where the transition diagram of the automa-
ton M((
l

k
)(p)) is the same as the transition diagram of the automaton
M(A

(
l

k
)(p)) (shown in Figure REF) without the self loop for the whole
alphabet in the initial state q
0,0
(q
0,0
, a), a A. The number of these
paths can be computed by following recurrent formula:
c
i,j
=
_

_
1 i = 0, j = 0
c
i,j1
+ 2c
i1,j1
0 i j, 0 < j
0 otherwise
.
Let us show in several steps that c
i,j
2j
i
.
i = 0, j > 0: c
0,j
= c
0,j1
= . . . = c
0,0
= 1 2j
0
i = 1, j > 0: By induction on j. It is satised for j = 1 because
c
1,1
= c
1,0
+ 2c
0,0
= 2 2. Consider the assumption holds for j > 0.
Than for j + 1
c
1,j+1
= c
1,j
+ 2c
0,j

from induction assumption and the fact that c
0,j
= 1
2j + 2 = 2(j + 1) 2(j + 1)
1
.
i 2, j 1; By induction on j. It is satised for j = 1 because
c
i,1
= c
i,0
+ 2c
i1,0
= 0. Consider the condition holds for j 1 1.
Than for j
c
i,j
= c
i,j1
+ 2c
i1,j1

from induction assumption
2(j 1)
i
+ 2 2(j 1)
j1
from Lemma 2.50
2j
i
.
74
Thus the number of string generated from pattern of the length m with
exactly i allowed errors in distance and 1 error in distance is O
_
m
i
_
.
Since the set of strings generated from pattern of the length m with at
most k allowed errors in distance and 1 error in distance is the union
of the above mentioned sets of strings, its cardinality is
k

i=0
O
_
m
i
_
= O
_
m
k
_
.
2
The other special case is that l k. It is quite easy to see, that this is the
same case as when just distance is used. The number of strings generated
in this case, which was given by Lemma 2.48, is O
_
m
k
_
.
Since the asymptotic number of strings generated by combined ()
distance is the same in both cases (l = 1 and l k), the number of allowed
errors in distance does not aect the asymptotic number of generated
strings.
Now it is possible to estimate the number of states of the deterministic
dictionary matching automaton.
Theorem 2.52
The number of states of the deterministic nite automaton M(A

(
l

k
)(p)),
p = p
1
p
2
. . . p
m
is
O
_
m
k+1
_
.
Proof
The proof is the same as for Lemma 2.38. Since for all u (
l

k
)(p), it
holds [u[ = m the size of the language (
l

k
)(p) is
O(

u(
k

k
)(p)
[u[) = O
_
m m
k
_
= O
_
m
k+1
_
.
2
2.6 distance
The nite automaton for approximate string matching using distance
M(A

k
(p)) will be called the nite automaton accepting language A

k
(p),
where A is an ordered alphabet,
k
(p) = u [ u A

, D

(u, p) k for the


given pattern p A

and the number of allowed errors k N.


Since it will be used the same approach as in previous sections, it is
necessary to compute cardinality of the set
k
(p).
Theorem 2.53
The number of strings generated by from the pattern p = p
1
p
2
. . . p
m
with
at most k allowed errors in distance is O((2k + 1)
m
).
75
Proof
Since distance is computed as a maximum of distances of individual sym-
bols at corresponding positions, each symbol can be replaced by at most 2k
dierent symbols (k symbols that are smaller and k symbols that are bigger
in the alphabet ordering). Therefore, at each position can be at most 2k +1
dierent symbols. As, the length of the pattern is m, distance generates
at most (2k + 1)
m
dierent strings. 2
Finally, let us compute the number of states of the deterministic dictio-
nary matching automaton.
Theorem 2.54
The number of states of the deterministic nite automaton M(A

k
(p)),
p = p
1
p
2
. . . p
m
is
O(m(2k + 1)
m
) .
Proof
The proof is the same as for Lemma 2.38. Since for all u
k
(p), it holds
[u[ = m the size of the language
k
(p) is
O(

u
k
(p)
[u[) = O(m(2k + 1)
m
) = O(m(2k + 1)
m
) .
2
76
3 Finite automata accepting parts of a string
In this Chapter we explain how to construct nite automata accepting all
prexes, suxes, factors, and subsequences of a given string. At the end we
show the construction of factor oracle automaton accepting all factors of a
given string and moreover some of its subsequences.
3.1 Prex automaton
Having string x = a
1
a
2
. . . a
n
we can express set Pref(x) (see Def. 1.1) using
the following two forms of regular expressions:
R
Pref
= +a
1
+a
1
a
2
+. . . +a
1
a
2
. . . a
n
= +a
1
( +a
2
( +. . . +a
n1
( +a
n
) . . .)).
Using the rst form of regular expression R
Pref
, we can construct the nite
automaton accepting set Pref(x) using Algorithm 3.1.
Algorithm 3.1
Construction of the prex automaton I (union of prexes).
Input: String x = a
1
a
2
. . . a
n
.
Output: Finite automaton M accepting language Pref(x).
Method: We use description of language Pref(x) by regular expression:
R
Pref
= +a
1
+a
1
a
2
+. . . +a
1
a
2
. . . a
n
.
1. Construct n nite automata M
i
accepting strings a
1
a
2
. . . a
i
for all
i = 0, 1, . . . , n.
2. Construct automaton M accepting union of languages L(M
i
),
i = 0, 1, . . . , n.
L(M) = L(M
0
) L(M
1
) L(M
2
) . . . L(M
n
). 2
Transition diagram of the prex automaton constructed by Algorithm 3.1 is
depicted in Fig. 3.1.
If we use the second form of regular expression:
R
Pref
= +a
1
( +a
2
( +. . . +a
n1
( +a
n
) . . .)),
we can construct the nite automaton using Algorithm 3.2.
Algorithm 3.2
Construction of prex automaton II (set of neighbours).
Input: String x = a
1
a
2
. . . a
n
.
Output: Finite automaton M accepting language Pref(x).
Method: We use description of language Pref(x) by regular expression:
R
Pref
= +a
1
( +a
2
( +. . . +a
n1
( +a
n
) . . .)).
1. We will use the method of neighbours.
(a) The set of initial symbols: IS = a
1
.
77
Figure 3.1: Transition diagram of the nite automaton accepting language
Pref(a
1
a
2
. . . a
n
) constructed for regular expression R
Pref
= +a
1
+a
1
a
2
+
. . . +a
1
a
2
. . . a
n
from Algorithm 3.1
(b) The set of neighbours: NS = a
1
a
2
, a
2
a
3
, . . . , a
n1
a
n
.
(c) The set of nal symbols: FS = a
1
, a
2
, . . . , a
n
.
2. Construct automaton
M = (q
0
, q
1
, q
2
, . . . , q
n
, A, , q
0
, F)
where (q
0
, a
1
) = q
1
because IS = a
1
,
(q
i
, a
i+1
) = q
i+1
for all i = 1, 2, . . . , n 1,
because a) each state q
i
, i = 1, 2, . . . , n1, coresponds to the
prex a
1
a
2
. . . a
i
,
b) a
i
a
i+1
NS,
F = q
0
, q
1
, . . . , q
n
because the set of nal symbols is
FS = a
1
, a
2
, . . . , a
n
and h(R
Pref
). 2
Transition diagram of the resulting prex is automaton M depicted in
Fig. 3.2.
a
2
a
3
a
n
a
1
START
q
0
q
1
q
2
q
n
Figure 3.2: Transition diagram of the nite automaton accepting language
Pref(a
1
a
2
. . . a
n
) constructed for regular expression R
Pref
= +a
1
(+a
2
(+
. . . +a
n1
( +a
n
) . . .)) from Algorithm 3.2
Example 3.3
Let us have string x = abab. Construct automata accepting Pref(x) us-
ing both methods of their construction. Using Algorithm 3.1, we obtain
prex automaton M
1
having the transition diagram depicted in Fig. 3.3.
Algorithm 3.2 yields prex automaton M
2
having transition diagram de-
picted in Fig. 3.4. Both automata M
1
and M
2
are accepting language
78
Figure 3.3: Transition diagram of prex automaton M
1
accepting Pref(abab)
from Example 3.3
START
0
a b a b
1 2 3 4
Figure 3.4: Transition diagram of prex automaton M
2
accepting Pref(abab)
from Example 3.3
Pref(abab) and therefore they should be equivalent. Let us show it:
As prex automaton M
1
is nondeterministic, let us construct its determin-
istic equivalent M

1
. Its transition diagram is depicted in Fig. 3.5 and it is
obvious that both automata M

1
and M
2
are equivalent. 2
Figure 3.5: Transition diagram of deterministic automaton M

1
from Exam-
ple 3.3
The second variant of the construction of the prex automaton is more
straightforward than the rst one. Therefore we will simplify it for the
practical use in the following algorithm. As the states in this automaton
correspond to the length of respective prexes, we will use integer numbers
as labels of states.
Algorithm 3.4
The construction of a nite automaton accepting set Pref(x).
Input: String x = a
1
a
2
. . . a
n
.
79
Output: Deterministic nite automaton M = (Q, A, , q
0
, F) accepting set
Pref(x).
Method:
Q = 0, 1, 2, . . . , n,
A is the set of all dierent symbols in x,
(i 1, a
i
) = i, i = 1, 2, . . . , n,
q
0
= 0,
F = 0, 1, 2, . . . , n. 2
Example 3.5
Let us construct the deterministic nite automaton accepting Pref(abba) =
, a, ab, abb, abba using Algorithm 3.4. The resulting automaton M =
(0, 1, 2, 3, 4, a, b, , 0, 0, 1, 2, 3, 4). Its transition diagram is depicted in
Fig 3.6. 2
0 1 2 3 4
a b b a START
Figure 3.6: Transition diagram of nite automaton M accepting set
Pref(abba) from Example 3.5
3.2 Sux automaton
Having string x = a
1
a
2
. . . a
n
, we can express set Su(x) (see Def. 1.2) using
the following two forms of regular expressions:
R
Su
(x) =a
1
a
2
. . . a
n
+a
2
a
3
. . . a
n
+. . . +a
n
+
=(. . . ((a
1
+)a
2
+)a
3
+. . . +)a
n
+.
Using the rst form of regular expression R
Su
we can construct the nite
automaton accepting set Su(x) using Algorithm 3.6. Let us call it the sux
automaton for string x.
Algorithm 3.6
Construction of the sux automaton I (union of suxes).
Input: String x = a
1
a
2
. . . a
n
.
Output: Finite automaton M accepting language Su(x).
Method: We use description of language Su(x) by regular expression:
R
Su
= a
1
a
2
. . . a
n
+a
2
. . . a
n
+. . . +a
n
+.
1. Construct n nite automata M
i
accepting strings a
i
a
i+1
. . . a
n
for
i = 1, 2, . . . , n. Construct automaton M
0
accepting empty string.
2. Construct automaton M
N
accepting union of languages L(M
i
),
i = 0, 1, 2, . . . , n, i.e. L(M
N
) = L(M
0
) L(M
1
) L(M
2
) . . . L(M
n
).
80
3. Construct deterministic automaton M equivalent to automaton M
N
.
2
Transition diagram of the sux automaton constructed by Algorithm 3.6 is
depicted in Fig. 3.7.
Figure 3.7: Transition diagram of nite automaton accepting language
Su(a
1
a
2
. . . a
n
) constructed for regular expression R
Su
= a
1
a
2
. . . a
n
+
a
2
. . . a
n
+. . . +a
n
+
If we use the second form of the regular expression:
R
Su
(x) = (. . . (a
1
+)a
2
+)a
3
+. . . +)a
n
+,
we can construct the nite automaton using Algorithm 3.7.
Algorithm 3.7
Construction of the sux automaton II (use of transitions).
Input: String x = a
1
a
2
. . . a
n
.
Output: Finite automaton M accepting language Su(x).
Method: We use description of language Su(x) by regular expression
R
Su
(x) = (. . . ((a
1
+)a
2
+)a
3
+. . . +)a
n
+.
1. Construct nite automaton M
1
accepting string x = a
1
a
2
. . . a
n
.
M
1
= (q
0
, q
1
, . . . , q
n
, A, , q
0
, q
n
),
where (q
i
, a
i+1
) = q
i+1
for all i = 0, 1, . . . , n 1.
2. Construct nite automaton M
2
= (q
0
, q
1
, . . . , q
n
, A,

, q
0
, q
n
) from
the automaton M
1
by inserting transitions:
(q
0
, ) = q
1
, q
2
, . . . , q
n1
, q
n
.
3. Replace all transitions in M
2
by nontransitions. The resulting
automaton is M
3
.
4. Construct deterministic nite automaton M equivalent to automaton
M
3
. 2
Sux automaton M
2
constructed by Algorithm 3.7 has, after step 2., tran-
sition diagram depicted in Fig. 3.8. Sux automaton M
3
has, after step
81
Figure 3.8: Transition diagram of sux automaton M
2
with transitions
constructed in step 2. of Algorithm 3.7
3. of Algorithm 3.7, its transition diagram depicted in Fig. 3.9.
Figure 3.9: Transition diagram of sux automaton M
3
after the removal of
transitions in the step 3. of Algorithm 3.7
We can use an alternative method for the construction of the sux au-
tomaton described by the second form of regular expression:
R
Su
(x) = (. . . (a
1
+)a
2
+)a
3
+. . . +)a
n
+.
Algorithm 3.8 uses this method.
Algorithm 3.8
Construction of the sux automaton III (using more initial states).
Input: String x = a
1
a
2
. . . a
n
.
Output: Finite automaton M accepting language Su(x).
Method: We use description of language Su(x) by regular expression
R
Su
(x) = (. . . ((a
1
+)a
2
+)a
3
+. . . +)a
n
+.
1. Construct nite automaton M
1
accepting string x = a
1
a
2
. . . a
n
.
M
1
= (q
0
, q
1
, q
2
, . . . , q
n
, A, , q
0
, q
n
),
where (q
i
, a
i+1
) = q
i+1
for all i = 0, 1, . . . , n 1.
2. Construct nite automaton M
2
= (q
0
, q
1
, q
2
, . . . , q
n
, A, , I, q
n
)
from automaton M
1
having this set of initial states:
I = q
0
, q
1
, . . . , q
n1
, q
n
.
3. Construct deterministic automaton M equivalent to automaton M
2
.
We use following steps for the construction:
(a) Using Algorithm 1.39 construct automaton M
3
equivalent to au-
tomaton M
2
having just one initial state and -transitions from
the initial state to all other states.
82
(b) Using Algorithm 1.38 construct automaton M
4
without -transitions
equivalent to automaton M
3
.
(c) Using Algorithm 1.40 construct deterministic automaton M equiv-
alent to automaton M
4
. 2
Transition diagram of sux automaton M
2
constructed in step 2. of Algo-
rithm 3.8 is depicted in Fig. 3.10.
Figure 3.10: Transition diagram of sux automaton M
2
accepting language
Su(a
1
a
2
. . . a
n
) constructed in step 2. of Algorithm 3.8
Example 3.9
Let us have string x = abab. Construct automata accepting Su(x) using all
three methods of their construction. Using Algorithm 3.6 we obtain (after
the step 2.) sux automaton M
1
having the transition diagram depicted
in Fig. 3.11. Algorithm 3.7 yields, after step 2., sux automaton M
2
with
a b
b
a
a
a
b
b
b
b
START
0 1 2
2
3
3
3
4
4
4
4
Figure 3.11: Transition diagram of sux automaton M
1
accepting Su(abab)
from Example 3.9
the transition diagram depicted in Fig. 3.12. Algorithm 3.8 yields, after
the step 2., sux automaton M
3
with the transition diagram depicted in
Fig. 3.13. All automata M
1
, M
2
and M
3
accepts language Su(abab) and
therefore they should be equivalent. Let us show it.
As sux automaton M
1
is nondeterministic, let us construct equivalent
deterministic automaton M

1
. Transition table and transition diagram of
83
0 1 2 3 4
a b a b START
b a b
Figure 3.12: Transition diagram of sux automaton M
2
accepting Su(abab)
from Example 3.9
Figure 3.13: Transition diagram of nondeterministic sux automaton M
3
accepting Su(abab) from Example 3.9
sux automaton M

1
are depicted in Fig. 3.14. Automaton M

1
can be min-
a b
0 1

13

24

24

3
3 4
3

Figure 3.14: Transition table and transition diagram of deterministic sux


automaton M

1
from Example 3.9
imized as the states in pairs (2, 4

), (2

, 4

), 3, 3

, and 4, 4

are equiv-
alent. Transition diagram of minimized sux automaton M

1
is depicted in
Fig. 3.15.
84
0 1 2 3 4
a b a b
START
b
Figure 3.15: Transition diagram of deterministic sux automaton M

1
after
minimization from Example 3.9
Sux automaton M
2
is also nondeterministic and after the determiniza-
tion we obtain automaton M

2
having transition diagram depicted in Fig. 3.16.
Figure 3.16: Transition diagram of deterministic sux automaton M

2
from
Example 3.9
Sux automaton M
3
has ve initial states. Construction of equivalent
automaton M
3
is shown step by step in Fig. 3.17.
Sux automata M

1
, M

2
, and M

3
(see Figs 3.15, 3.16, 3.17) are obviously
equivalent. 2
We can use the experience from the possible constructions of the sux
automaton in the practical algorithm.
Algorithm 3.10
Construction of a nite automaton accepting set Su(x).
Input: String x = a
1
a
2
. . . a
n
.
Output: Deterministic nite automaton M = (Q, A, , q
0
, F) accepting set
Su(x).
Method:
1. Construct nite automaton M
1
= (Q
1
, A,
1
, q
0
, F
1
) accepting string x
and empty string:
Q
1
= 0, 1, 2, . . . , n,
A is the set of all dierent symbols in x,

1
(i 1, a
i
) = i, i = 1, 2, . . . , n,
q
0
= 0,
F
1
= 0, n.
2. Insert additional transitions into automaton M
1
leading from initial
state 0 to states 2, 3, . . . , n:
(0, a) = i if (i 1, a) = i for all a A, i = 2, 3, . . . , n.
The resulting automaton is M
2
.
85
a) Transition diagram of sux automaton with one initial state and
-transitions
b) Transition diagram of sux automaton after removal of -transitions
c) Transition diagram of deterministic sux automaton M

3
Figure 3.17: Three steps of construction of deterministic sux automaton
M

3
from Example 3.9
3. Construct deterministic nite automaton M equivalent to automaton
M
2
. 2
Denition 3.11 (Terminal state of the sux automaton)
The nal state of the sux automaton having no outgoing transition is
called terminal state. 2
Denition 3.12 (Backbone of the sux automaton)
The backbone of sux automaton M for string x is the longest continuous
sequence of states and transitions leading from the initial state to terminal
state of M. 2
Example 3.13
Let us construct the deterministic nite automaton accepting Su(abba) =
abba, bba, ba, a, using Algorithm 3.10. Automaton M
1
= (0, 1, 2, 3, 4,
a, b,
1
, 0, 0, 4) accepting strings , abba has the transition diagram
depicted in Fig. 3.18. Finite automaton M
2
= (0, 1, 2, 3, 4, a, b,
2
, 0,
0, 4) with additional transitions has the transition diagram depicted in
Fig. 3.19. The nal result of this construction is deterministic nite au-
tomaton M = (0, 14, 2, 23, 3, 4, a, b, , 0, 4) which transition table is
shown in Table 3.1. dsubsets of automaton M are: 0, 14, 2, 23, 3, 4. Tran-
86
0 1 2 3 4
a b b a START
Figure 3.18: Transition diagram of nite automaton M
1
accepting string
abba from Example 3.13
0 1 2 3 4
a b b a START
b b a
Figure 3.19: Transition diagram of nondeterministic nite automaton M
2
with additional transitions from Example 3.13
sition diagram of automaton M is depicted in Fig. 3.20. 2
0 14 2
23
3 4
a b b a START
b
b a
Figure 3.20: Transition diagram of deterministic nite automaton M ac-
cepting set Su(abba) = abba, bba, ba, a, from Example 3.13
3.3 Factor automaton
Factor automaton is in some sources called Directed Acyclic Word Graph
(DAWG).
Having string x = a
1
a
2
. . . a
n
, we can express set Fact(x) (see Def. 1.3)
using regular expressions:
1. R
Fac1
(x) = (. . . (( a
1
+)a
2
+)a
3
+. . . +)a
n
+
+(. . . (( a
1
+)a
2
+)a
3
+. . . +)a
n
+(. . . (( a
1
+)a
2
+)a
3
+. . . +)a
n1
+. . .
+ a
1
,
2. R
Fac2
(x) =+a
1
(+a
2
( +. . . +a
n1
(+a
n
) . . .))
+a
1
(+a
2
( +. . . +a
n1
(+a
n
) . . .))
+ a
2
( +. . . +a
n1
(+a
n
) . . .)
+. . .
+ a
n
.
87
a b
0 14 23
14 2
2 3
23 4 3
3 4
4
Table 3.1: Transition table of deterministic nite automaton M from Ex-
ample 3.13
The rst variant of the regular expression corresponds to that set Fact(x)
is exactly the set of all suxes of all prexes of x. The second variant cor-
responds to the fact that set Fact(x) is exactly the set of all prexes of all
suxes of x. It follows from these possibilities of understanding of both reg-
ular expressions that the combination of methods of constructing the prex
and sux automata can be used for the construction of factor automata.
For the rst variant of the regular expression we can use Algorithms 3.6,
3.7, 3.8 and 3.10 as a base for the construction of sux automata and to
modify them in order to accept all prexes of all suxes accepted by sux
automata. Algorithm 3.14 makes this modication by setting all states nal.
Algorithm 3.14
Construction of the factor automaton.
Input: String x = a
1
a
2
. . . a
n
.
Output: Finite automaton M accepting language Fact(x).
Method:
1. Construct sux automaton M
1
for string x = a
1
a
2
. . . a
n
using any of
Algorithms 3.6, 3.7, 3.8 or 3.10.
2. Construct automaton M
2
by setting all states of automaton M
1
nal
states.
3. Perform minimization of automaton M
2
. The resulting automaton is
automaton M. 2
The resulting deterministic automaton need not be minimal and therefore
the minimization takes place as the nal operation.
Example 3.15
Let us have string x = abbbc. We construct the factor automaton accept-
ing set Fact(x) using all four possible ways of its construction. The rst
method is based on Algorithm 3.6. Factor automaton M
1
has after step 2.
of Algorithm 3.6 transition diagram depicted in Fig. 3.21.
88
a b
b
b
b
b
b c
c
c
c
c
b
b
b
START
0 1 2
2
3
3
3
4 5
5
5
5
5
4
4
4
Figure 3.21: Transition diagram of factor automaton M
1
accepting set
Fact(abbbc) from Example 3.15
As factor automaton M
1
is nondeterministic, we do its determinisation.
Transition table and transition diagram of deterministic factor automaton
M

1
are depicted in Fig. 3.22. This automaton is not minimal because sets
of states 4, 4

, 5, 5

, 5

, 5

, 5

are equivalent. The transition diagram of


minimal factor automaton M

1
is depicted in Fig. 3.23. 2
The second method of the construction of the factor automaton is based
on Algorithm 3.7. Factor automaton M
2
after step 2 of Algorithm 3.7 has
the transition diagram depicted in Fig. 3.24. Factor automaton M
2
is nonde-
terministic and therefore we do its determinisation. The resulting determin-
istic factor automaton M

2
is minimal and its transition table and transition
diagram are depicted in Fig. 3.25.
89
a b c
0 1 2

1 2
2 3
3 4
4 5
5
2

Figure 3.22: Transition table and transition diagram of deterministic factor


automaton M

1
accepting set Fact(abbbc) from Example 3.15
90
b
a b
b
b
b
b c
c
c
c
START
0 1 2
234
3
34
4 5
Figure 3.23: Transition diagram of minimal factor automaton M

1
accepting
set Fact(abbbc) from Example 3.15
Figure 3.24: Transition diagram of nondeterministic factor automaton M
2
accepting set Fact(abbbc) from Example 3.15
a b c
0 1 234 5
1 2
2 3
3 4
4 5
234 34 5
34 4 5
5
a b
b
b
b
b
b c
c
c
c
START
0 1 2
234
3
34
4 5
Figure 3.25: Transition table and transition diagram of deterministic factor
automaton M

2
accepting set Fact(abbbc) from Example 3.15
91
The third method of the construction of the factor automaton is based
on Algorithm 3.8. Factor automaton M
3
has after step 2. of Algorithm 3.8
transition diagram depicted in Fig. 3.26. Automaton M
3
has more than
Figure 3.26: Transition diagram of factor automaton M
3
accepting set
Fact(abbbc) from Example 3.15
one initial state, therefore we transform it to automaton M

3
having just
one initial state. Its transition table and transition diagram is depicted in
Fig. 3.27. 2
a b c
012345 1 234 5
1 2
2 3
3 4
4 5
234 34 5
34 4
5
Figure 3.27: Transition table and transition diagram of factor automaton
M

3
with just one initial state accepting set Fact(abbbc) from Example 3.15
Note: We keep notions of terminal state and the backbone (see Defs. 3.11
and 3.12) also for factor automaton.
3.4 Parts of sux and factor automata
We will need to use in some applications only some parts of sux and factor
automata instead of their complete forms. We identied three cases of useful
92
parts of both automata:
1. Backbone of sux or factor automaton.
2. Front end of sux or factor automaton.
3. Multiple front end of sux or factor automaton.
3.4.1 Backbone of sux and factor automata
The backbone of sux or automaton for string x is such part of the automa-
ton where all states and transitions which does not correspond to prexes
of x are removed. A general method of extraction of the backbone of sux
or factor automaton is the operation intersection.
Algorithm 3.16
Construction of the backbone of sux (factor) automaton.
Input: String x = a
1
a
2
. . . a
n
, deterministic sux (factor) automaton M
1
=
(Q
1
, A,
1
, q
01
, F
1
) for x, deterministic prex automaton M
2
= (Q
2
, A,
2
,
q
02
, F
2
) for x.
Output: Backbone M = (Q, A, , q
0
, F) of sux (factor) automaton M
1
=
(Q
1
, A,
1
, q
01
, F
1
).
Method: Construct automaton M accepting intersection Su (x) Pref(x)
(Fact(x) Pref(x)) using Algorithm 1.44. 2
Example 3.17
Let us construct backbone of the sux automaton for string x = abab. Tran-
sition diagrams of input automata M
1
and M
2
and output automaton M
are depicted in Fig. 3.28. 2
The resulting backbone is similar to input sux automaton M
1
. The only
change is that transition from state 0
S
to state 2
S
4
S
for input symbol b
is removed. Moreover, we can see that the resulting backbone in Example
3.17 is equivalent to the input prex automaton M
2
. The important point
of this construction is that d-subsets of sux automaton M
1
are preserved
in the resulting automaton M. This fact will be useful in some applications
described in next Chapters.
The algorithm for extraction of the backbone of the sux or factor au-
tomaton is very simple and straightforward. Nevertheless it can be used for
extraction of backbones of sux and factor automata for set of strings and
for approximate sux and factor automata as well.
3.4.2 Front end of sux or factor automata
A front end of a sux or factor automaton for string x is a nite automaton
accepting prexes of factors of string x having limited length which is strictly
less than the length of string x.
93
Figure 3.28: Construction of the backbone of sux automaton M
1
from
Example 3.17
For the construction of a front end part of factor automaton we adapt
Algorithm 1.40 for transformation of nondeterministic nite automaton to
a deterministic nite automaton. The adaptation consists of two points:
1. To append information on the minimal distance of a state of an au-
tomaton from its initial state. The minimal distance is, in this case,
the minimal number of transitions, which are necessary to reach the
state in question from the initial state.
2. To stop construction of deterministic automaton as soon as all states
having desired limited distance from the initial state are constructed.
Denition 3.18
Let M = (Q, A, , q
0
, F) be an acyclic nite automaton accepting language
L(M). Front end of automaton M for given limit h is a minimal deter-
ministic nite automaton M
h
accepting at least all prexes of strings from
language L(M) having length less or equal to h. This language will be de-
noted by L
h
. 2
Algorithm 3.19
Transformation of an acyclic nondeterministic nite automaton to a deter-
ministic nite automaton with states having distance from the initial state
less or equal to given limit.
94
Input: Nondeterministic acyclic nite automaton M = (Q, A, , q
0
, F),
limit h of maximal distance.
Output: Deterministic nite automaton M
h
= (Q
h
, A,
h
, q
0h
, F
h
) such that
L
h
(M) = L
h
(M
h
).
Method:
1. Set Q
h
= (q
0
, 0) will be dened, state q
0
, 0 will be treated as
unmarked.
2. If each state in Q
h
is marked then continue with step 5.
3. If there is no unmarked state (q, l) in Q
h
, where l is less than h then
continue with step 5.
4. An unmarked state (q, l) will be choosen from Q
h
and the following
operations will be executed:
(a)
h
((q, l), a) = (q

, l + 1) for all a A, where q

= (p, a) for all


a A, p q,
(b) if (q

, l

) Q
h
then Q
h
= Q
h
(q

, min(l+1, l

)) and
h
((q, l), a) =
(q

, min(l + 1, l

),
(c) state (q, l) will be marked,
(d) continue with step 2.
5. q
0h
= q
0
, 0.
6. F
h
= (q, l) : (q, l) Q
h
, q F ,= . 2
Example 3.20
Let us have string x = abab. Construct front end of the factor automaton
for string x of length h = 2. Transition diagram of nondeterministic factor
automaton M for string x is depicted in Fig. 3.29. Transition diagram of
Figure 3.29: Transition diagram of nondeterministic factor automaton M
for string x = abab from Example 3.20
the front end of deterministic factor automaton M
h
for h = 2 is depicted in
Fig. 3.30. 2
95
Figure 3.30: Transition diagram of the front end of deterministic factor
automaton M
2
from Example 3.20
3.4.3 Multiple front end of sux and factor automata
Let us recall denition of multiple state of deterministic nite automaton
(see Def. 1.42). Multiple front end of a sux or factor automaton is the part
of them containing multiple states only. For the construction of such part of
sux or factor automaton we again adapt Algorithm 1.40 for transformation
of nondeterministic nite automaton to a deterministic nite automaton.
The adaption is very simple:
If some state constructed during determinisation is simple state then we
omit it.
Algorithm 3.21
Construction of multiple states part of sux or factor automaton.
Input: Nondeterministic sux or factor automaton M = (Q, A, , q
0
, F).
Output: Part M

= (Q

, A,

, q

0
, F

) of the deterministic factor automaton


for M containing only multiple states (with exception of the initial state).
Method:
1. Set Q

= q
0
will be dened, state q
0
will be treated as unmarked.
2. If all states in Q

are marked then continue with step 4.


3. An unmarked state q will be choosen from Q

and the following oper-


ations will be executed:
(a)

(q, a) = (p, a) for all p q and for all a A,


(b) if

(q, a) is a multiple state then Q

= Q

(q, a),
(c) the state q Q

will be marked,
(d) continue with step 2.
4. q

0
= q
0
.
5. F

= q : q Q

, q F ,= . 2
Example 3.22
Let us have string x = abab as in Example 3.20. Construct multiple front end
of the factor automaton for string x. Transition diagram of nondeterministic
factor automaton M for string x is depicted in Fig. 3.29. Transition diagram
of multiple front end M

of factor automaton M is depicted in Fig. 3.31. 2


96
Figure 3.31: Transition diagram of the multiple states part of deterministic
factor automaton M

from Example 3.22


3.5 Subsequence automata
The set of subsequences of string x = a
1
a
2
. . . a
n
(see Def. 1.4) can be
described by the following regular expression:
R
Sub
(x) = (a
1
+)(a
2
+) . . . (a
n
+).
Therefore the next algorithm is based on the insertion of transition.
Algorithm 3.23
The construction of a subsequence automaton accepting set Sub(x).
Input: String x = a
1
a
2
. . . a
n
.
Output: Deterministic subsequence automaton M = (Q, A, , q
0
, F) ac-
cepting set Sub(x).
Method:
1. Construct nite automaton M
1
= (Q
1
, A,
1
, q
0
, F
1
) accepting all pre-
xes of string x:
Q
1
= 0, 1, 2, . . . , n,
A is the set of all dierent symbols in x,

1
(i 1, a
i
) = i, i = 1, 2, . . . , n,
q
0
= 0, F
1
= 0, 1, 2, . . . , n.
2. Insert transitions into automaton M
1
leading from each state to
its next state. Resulting automaton M
2
= (Q, A,
2
, q
0
, F
1
), where

2
=
1

, where

(i 1, ) = i, i = 1, 2, . . . , n.
3. Replace all transitions by nontransitions. The resulting automa-
ton is M
3
.
4. Construct deterministic nite automaton M equivalent to automaton
M
3
. All its states will be nal states. 2
Example 3.24
Let us construct the deterministic nite automaton accepting set
Sub(abba) = , a, b, ab, ba, aa, bb, aba, bba, abb, abba. Let us mention, that
strings aa, aba are subsequences of x = abba but not its factors. Us-
ing Algorithm 3.23 we construct the subsequence automaton. Automaton
M
1
= (0, 1, 2, 3, 4, a, b,
1
, 0, 0, 1, 2, 3, 4) accepting all prexes of string
abba has transition diagram depicted in Fig. 3.35.
97
Finite automaton M
2
with inserted transitions has transition diagram
depicted in Fig 3.32. Nondeterministic nite automaton M
3
after the elim-
0 1 2 3 4
a b b a START
Figure 3.32: Transition diagram of automaton M
2
with transitions ac-
cepting all subsequences of string abba from Example 3.24
ination of transitions has transition diagram depicted in Fig. 3.33. The
0 1 2 3 4
a b b a START
b
b
a
a
b
a
Figure 3.33: Transition diagram of nondeterministic nite automaton M
3
accepting set Sub(abba) after elimination of the transitions from Exam-
ple 3.24
nal result of this construction is deterministic nite automaton (subse-
quence automaton) M. Its transition table is shown in Table 3.2. Transition
a b
0 14 23
14 4 23
23 4 3
3 4
4
Table 3.2: Transition table of automaton M from Example 3.24
diagram of automaton M is depicted in Fig. 3.34. 2
98
0 14 23 3 4
a b b a START
b
a
a
Figure 3.34: Transition diagram of deterministic subsequence automaton M
accepting set Sub(abba) from Example 3.24
3.6 Factor oracle automata
The factor oracle automaton for given string x accepts all factors of x and
possibly some subsequences of x.
Factor oracle automaton is similar to the factor automaton, but it has
always n + 1 states, where n = [x[. It is possible to construct factor oracle
automaton from factor automaton. This construction is based on the notion
of corresponding states in a factor automaton.
Denition 3.25
Let M be the factor automaton for string x and q
1
, q
2
be dierent states of
M. Let there exist two sequences of transitions in M:
(q
0
, x
1
)

(q
1
, ), and
(q
0
, x
2
)

(q
2
, ).
If x
1
is a sux of x
2
and x
2
is a prex of x then q
1
and q
2
are corresponding
states. 2
The factor oracle automaton can be constructed by merging the correspond-
ing states.
Example 3.26
Let us construct the deterministic nite automaton accepting set
Fact(abba) = , a, b, ab, bb, ba, abb, bba, abba using Algorithm 3.7. Automa-
ton M
1
= (0, 1, 2, 3, 4, a, b,
1
, 0, 0, 1, 2, 3, 4) accepting all prexes of
the string abba has the transition diagram depicted in Fig. 3.35. Finite au-
0 1 2 3 4
a b b a START
Figure 3.35: Transition diagram of nite automaton M
1
accepting set of all
prexes of the string abba from Example 3.26
tomaton M
2
with inserted transitions has the transition diagram depicted
99
in Fig. 3.36. Nondeterministic nite automaton M
3
after the elimination of
transitions has the transition diagram depicted in Fig. 3.37.
Figure 3.36: The transition diagram of automaton M
2
with transitions
accepting all factors of string abba from Example 3.26
0 1 2 3 4
a b b a START
b b a
Figure 3.37: Transition diagram of nondeterministic factor automaton M
3
after the elimination of transitions from Example 3.26
The nal result of this construction is deterministic factor automaton
M. Its transition table is shown in Table 3.3. The transition diagram of
a b
0 14 23
14 2
2 3
23 4 3
3 4
4
Table 3.3: The transition table of the automaton M from Example 3.26
automaton M is depicted in Fig. 3.38. The corresponding states in this
automaton are: 2 and 23. If we make this two states equivalent, then we
obtain factor oracle automaton Oracle(abba) with the transition diagram
depicted in Fig. 3.39. The language accepted by the automaton is:
L(Oracle(abba)) =, a, b, ab, bb, ba, abb, bba, abba, aba
=Fact(abba) aba.
String aba is not factor of abba but it is its subsequence. 2
The approach used in Example 3.26 has this drawback: the intermediate
result is a factor automaton and its number of states is limited by 2n 2
100
0 14 2
23
3 4
a b b a START
b
b a
Figure 3.38: Transition diagram of factor automaton M accepting set
Fact(abba) from Example 3.26
0 14 2,23 3 4
a b b a START
b a
Figure 3.39: Transition diagram of the Oracle(abba) from Example 3.27
while the number of states of factor oracle automaton is always equal to n+1,
where n is the length of string x. Fortunately, a factor oracle automaton
can be constructed directly during the determinization of nondeterministic
factor automaton. In this case, it is necessary to x the identication of
corresponding states.
Interpretation A of Denition 3.25: Using our style of the numbering of
states, the identication of corresponding states can be done as follows:
Two states
p = i
1
, i
2
, . . . , i
n1
, q = j
1
, j
2
, . . . , j
n2

are corresponding states provided that set of states of nondeterministic fac-


tor automaton is ordered and the lowest states are equal which means that
i
1
= j
1
.
Example 3.27
During the determinisation of nondeterministic factor automaton M
3
(see
Fig. 3.37) we will identify states 2 and 23 as corresponding states and the
factor oracle automaton having transition diagram depicted in Fig. 3.39 can
be constructed directly. 2
A problem can appear during the construction of factor oracle automaton
by merging of corresponding states of the respective factor automaton. The
problem is that the resulting factor oracle automaton can be nondetermin-
istic. Let us show this problem in next Example.
Example 3.28
Let us have text T = abbcabcd. The construction of factor oracle automaton
using merging of states of the respective factor automaton is shown step by
step in Fig. 3.40.
We can see, in Fig. 3.40 d), that the factor oracle automaton resulting
101
by merging of corresponding states 236, 26 and 47, 4 of the factor au-
tomaton having transition diagram depicted in Fig. 3.40 c) is nondetermin-
istic. The nondeterminism is caused by transition (236, c) = 47, 7. The
transition table of the nondeterministic factor oracle automaton depicted in
Fig. 3.40 d) is Table 3.4. The result of the determinization of this factor
a b c d
0 15 236 47 8
15 236
236 3 47, 7
3 47
47 5 8
5 6
6 7
7 8
8
Table 3.4: Transition table of the nondeterministic factor oracle automaton
having transition diagram depicted in Fig. 3.40 d) (see Example 3.28)
oracle automaton is the deterministic factor oracle automaton having the
transition diagram depicted in Fig. 3.40 e) and Table 3.5 as the transition
table. We can see in this table, that states 47 and 47, 7 are equivalent.
This fact is expresed in Fig. 3.40 e). 2
a b c d
0 15 236 47 8
15 236
236 3 47, 7
3 47
47 5 8
47, 7 5 8
5 6
6 7
7 8
8
Table 3.5: Transition table of the deterministic factor oracle automaton from
Example 3.28; let us note, that states 47 and 47, 7 are equivalent
Let us discuss the problem of the nondeterminism of factor oracle au-
tomaton after merging of states.
102
0 1 2 3 4 5 6 7 8
a
e e e e e e e e
b b c a b c d
START
a) transition diagram of the nondeterministic factor automaton with -
transitions
0 1 2 3 4 5 6 7 8
a b
b
b
b
c
c
a
a
b
b
c
c
d
a
START
b) transition diagram of the nondeterministic factor automaton after re-
moval of -transitions
c) transition diagram of the deterministic factor automaton
0 15 236 3 47 5 6 7 8
a b b
b
c
c
c
d
d
c a b c d
START
d) transition diagram of the nondeterministic factor oracle automaton
0 15 236 3 47,7 5 6 7 8
a b b
b
c
c
d
d
c a b c d
START
e) transition diagram of the deterministic factor oracle automaton
Figure 3.40: Construction of factor oracle automaton from Example 3.28
103
Let (q, a) = q
1
, q
2
, for some a A, be the nondeterministic part
of the transition function. State q
2
has greater depth than state q
1
. The
factor oracle automaton is homogenous automaton and due to its construc-
tion the d-subset (see Def. 1.41) of q
2
is a subset of d-subset of q
1
. If it
holds that d-subset(q
1
) = d-subset(q
1
) d-subset(q
2
). It follows from this,
that
(q
1
, a) = (q
1
, q
2
, a) for all a A.
The practical consequence of this reasoning is, that in the case of nonde-
terminism shown above, it is enough to remove longer transition. More
precisely: (q, a) = q
1
.
There is possible for some texts to construct factor oracle automaton hav-
ing less transitions than the factor oracle automaton constructed by merging
corresponding states according to Denition 3.25. Let us show the possibil-
ity using an example.
Example 3.29
Let text the be T = abcacdace. The construction of factor oracle automaton
using merging of states according to Denition 3.25 is shown step by step in
Fig. 3.41. Moreover another principle of merging states of respective factor
automaton (see Fig. 3.41c) can be used. This is merging of states (3,58,358).
2
The principle used in Example 3.29 is based on the following interpretation
B of Denition 3.25:
States having d-subsets:
p = i
1
, i
2
, . . . , i
n1
, q = j
1
, j
2
, . . . , j
n2
are corresponding states for [p[ >
[q[ when q p.
From this follows that states 358 and 58 are corresponding states and
states 3 and 358 are corresponding states according to the interpretation A
of Denition 3.25.
Factor oracle accepts language L(Oracle(x)) for string x. Language L
contains set Fact(x) and moreover some subsequences. Let us do character-
isation of the language accepted by factor oracles.
The main idea behind the method of this characterisation is an operation
called contraction. The contraction consists in removal of some string from x
starting with a repeating factor a continuing by gap to some of its repetition.
Let us show this principle using an example.
Example 3.30
Let text be T = gaccattctc. We start by construction of factor automa-
ton for text T. Transition diagram of nondeterministic factor automa-
ton M
N
is depicted in Fig. 3.42a. Transition table of deterministic factor
104
a) transition diagram of the nondeterministic factor automaton with -
transitions
b) transition diagram of the nondeterministic factor automaton after re-
moval of -transitions
c) transition diagram of the deterministic factor automaton
d) transition diagram of the factor oracle automaton after merging pair of
states (3,358) and (5,58) has 8 external transition
e) transition diagram of deterministic factor oracle automaton after merg-
ing states (3,58,358) has 7 external transitions
Figure 3.41: Construction of factor oracle automaton and its optimised
variant from Example 3.29
105
automaton M
D
is shown in Table 3.6. Transition diagram of determinis-
tic factor automaton M
D
is depicted in Fig. 3.42b. The corresponding states
are these pairs of states:
(2, 25), (3, 348A), (6, 679), (8, 8A).
On the base of this correspondence we can construct factor oracle automaton
M
O
having transition diagram depicted in Fig. 3.42c.
a) transition diagram of nondeterministic factor automaton M
N
b) transition diagram of deterministic factor automaton M
D
c) transition diagram of factor oracle automaton M
O
Figure 3.42: Transition diagrams of automata M
N
, M
D
and M
O
for text
T = gaccattctc from Example 3.30
Using factor automaton M
D
we can construct repetition table R shown in
Table 3.7. 2
Using the repetition table we can construct a set of contractions. Set Contr
of contractions is a set of pairs (i, j) where
i is the starting position of contraction,
j is the rst position behind the contraction.
The contractions are closely related to the repetition of factors in text T.
The contraction consists in removing the rst occurrence of some repeating
106
a c g t
0 25 348A 1 679
1 2
25 3 6
2 3
3 4
348A 5 4 9
4 5
5 6
6 7
679 8A 7
7 8
8 9
8A 9
9 A
A
Table 3.6: Transition table of deterministic factor automaton M
D
for text
T = gaccattctc from Example 3.30
d-subset Factor Repetitions
25 a (2, F), (5, G)
348A c (3, F), (4, S), (8, G), (A, G)
679 t (6, F), (7, S), (9, G)
8A tc (8, F), (A, S)
Table 3.7: Repetition table R for text T = gaccattctc from Example 3.30
factor and the gap starting behind it and ending just before some next oc-
currence of it. The contraction can be combined provided that the repetition
of some factor is a repetition with gap. It cannot be combined in case of its
overlapping and if one contraction contains another as a substring.
Using contractions described above, we obtain set of pairs Contr(T) for
text T.
Example 3.31
Let text be T = gaccattctc as in Example 3.30. The set of contractions
based on repetition table R is:
Contr(T) = (2, 5), (3, 4), (3, 8), (3, A), (6, 7), (6, 9), (7, 9).
Let us list set SC(T) of all strings created from T using this contractions
and containing also string T.
107
T = g a c c a t t c t c = g a c c a t t c t c ()
C(7, 9) = g a c c a t

t

c t c = g a c c a t t c
C(3, 4) = g a

c c a t t c t c = g a c a t t c t c ()
C(3, 4), (7, 9) = g a

c c a t

t

c t c = g a c a t t c
C(3, 10) = g a

a

t

t

c

t c = g a c
C(6, 7) = g a c c a

t t c t c = g a c c a t c t c ()
C(6, 9) = g a c c a

t

t

c t c = g a c c a t c
C(3, 4), (6, 7) = g a

c c a

t t c t c = g a c a t c t c ()
C(3, 4), (6, 9) = g a

c c a

t

t

c t c = g a c a t c
C(2, 5) = g

a

c a t t c t c = g a t t c t c ()
C(2, 5), (7, 9) = g

a

c a t

t

c t c = g a t t c
C(2, 5), (6, 7) = g

a

c a

t t c t c = g a t c t c ()
C(2, 5), (6, 9) = g

a

c a

t

t

c t c = g a t c
C(3, 8)) = g a

a

t

t c t c = g a c t c ()
2
The set of strings SC(T) created by contractions can contain some string
which are substring of other elements of set SC. Such strings can be from
SC removed because they are redundant.
Example 3.32
The set SCO(T) for text T = gaccattctc from Example 3.30 contains after
optimisation these strings (marked in Example 3.31 by *):
SCO(T) = gaccattctc, gacattctc, gaccatctc, gacatctc, gattctc, gatctc, gactc.
Language accepted by factor oracle M
0
for T is:
L(M
0
) = Fact(SCO(T)). 2
3.7 The complexity of automata for parts of strings
The maximum state and transition complexities of prex, sux, factor,
subsequence, and factor oracle automata are summarized in Table 3.8. The
length of the string is always equal to n. [A[ is the size of alphabet A.
The factor automaton having maximal state and transition complexities is
depicted in Fig. 3.23. This complexity is reached for strings ab
n2
c. The
complexity of the sux automaton is in many cases the same as of the
factor automaton. But in some cases the factor automaton can have less
states than the sux automaton for the same string. The reason for thisis
fact that some factor automata may be minimized because they have all
states nal.
Example 3.33
Let us have text T = abb. We will construct the sux automaton accepting
Su(abb) and the factor automaton accepting Fact(abb). The nondetermin-
istic and deterministic sux automata have transition diagrams depicted
108
Type of automaton No. of states No. of transitions
Prex automaton n + 1 n
Sux automaton 2n 2 3n 4
Factor automaton 2n 2 3n 4
Subsequence automaton n + 1 [A[.n
Factor oracle automaton n + 1 2n 1
Table 3.8: Maximum state and transition complexities of automata accept-
ing parts of string
in Fig. 3.43. The nondeterministic and deterministic factor automata have
Figure 3.43: Transition diagrams of the sux automata for string x = abb
from Example 3.33
transition diagrams depicted in Fig. 3.44. The deterministic factor automa-
ton can be minimized as states 2 and 23 are equivalent. This is not true
for the deterministic sux automaton as 23 is a nal state and 2 is not a
nal state. Therefore the minimal factor automaton accepting Fact(abb)
has transition diagram depicted in Fig. 3.45. 2
The reader can verify, that for string x = ab
n
, n > 1, the factor automaton
has less states than the sux automaton.
3.8 Automata for parts of more than one string
All nite automata constructed above in this Chapter can be used as a
base for the construction of the same types of automata for a nite set
of strings. Construction of a nite automaton accepting set Pref (S) (see
Def. 1.6) where S is a nite set of strings from A
+
we will do in the similar
109
0 1 2 3
a b
b
b
b
START
0 1 2
23
3
a b
b
b
b
START
Figure 3.44: Transition diagrams of the factor automata for string x = abb
from Example 3.33
0 1 2,23 3
a b
b
b START
Figure 3.45: Transition diagram of the minimal factor automaton accepting
Fact(abb) from Example 3.33
way as for one string in Algorithm 3.4.
Algorithm 3.34
Construction of a nite automaton accepting set Pref S, S A
+
.
Input: A nite set of strings S = x
1
, x
2
, . . . , x
|s|
.
Output: Prex automaton M = (Q, A, , q
0
, F) accepting set Pref (S).
Method:
1. Construct nite automata M
i
= (Q
i
, A
i
,
i
, q
0i
, F
i
) accepting set
Pref (x
i
) for i = 1, 2, . . . , [S[ using Algorithm alg@ktis3-01.
2. Construct deterministic nite automaton M = (Q, A, , q
0
, F) accept-
ing set Pref(S) = Pref(x
1
) Pref(x
2
) . . . Pref(x
|S|
). 2
Example 3.35
Let us construct the prex automaton for set of strings S = abab, abba.
Finite automata M
1
and M
2
have transition diagrams depicted in Fig. 3.46.
110
Figure 3.46: Transition diagrams of automata M
1
and M
2
from Exam-
ple 3.35
Prex automaton M accepting set Pref(abab, abba) has transition diagram
depicted in Fig. 3.47. 2
Figure 3.47: Transition diagram of prex automaton accepting set
Pref(abab, abba) from Example 3.35
Construction of sux and factor automata for a nite set of strings we will
do in the similar way as for one string (see Sections 3.2 and 3.3). One of
principles of their construction is formalized in the following Algorithm for
sux and factor automata. An Xautomaton, in the next Algorithm, means
sux or factor automaton.
Algorithm 3.36
Construction of Xautomaton for a nite set of strings.
Input: Finite set of strings S = x
1
, x
2
, . . . , x
|S|
.
Output: Deterministic Xautomaton M = (Q, A, , q
0
, F) for set S.
Method:
1. Construct Xautomata M
1
, M
2
, . . . , M
|S|
with transitions (see
Fig. 3.8) for all strings x
1
, x
2
, . . . , x
|S|
.
2. Construct automaton M

accepting language
L(M

) = L(M
1
) L(M
2
) . . . L(M
|S|
).
111
3. Construct automaton M
N
by removing transitions.
4. Construct deterministic nite automaton M equivalent to automaton
M
N
. 2
Example 3.37
Let us construct the factor automaton for set of strings S = abab, abba.
First, we construct factor automata M
1
and M
2
for both strings in S. Their
transition diagrams are depicted in Figs 3.48 and 3.49, respectively.
Figure 3.48: Transition diagram of factor automaton M
1
accepting
Fact(abab) from Example 3.37
Figure 3.49: Transition diagram of factor automaton M
2
accepting
Fact(abba) from Example 3.37
In the second step we construct automaton M

accepting language L(M) =


Fact(abab) Fact(abba). Its transition diagram is depicted in Fig. 3.50.
In the third step we construct automaton M
N
by removing transitions.
Its transition diagram is depicted in Fig. 3.51.
The last step is the construction of deterministic factor automaton M.
Its transition table is shown in Table 3.9.
112
Figure 3.50: Transition diagram of factor automaton M

accepting set
Fact(abab) Fact(abba) from Example 3.37
a b
0 1
1
1
2
3
1
4
2
2
1
2
2
3
2
4
1
1
1
1
2
3
1
4
2
2
1
2
2
4
1
2
1
2
2
3
2
4
1
3
1
4
2
3
2
2
1
2
2
4
1
3
1
3
2
3
1
4
2
4
1
3
1
4
1
3
2
4
2
4
1
4
2
Table 3.9: Transition table of automaton M from Example 3.37
113
b
b
b
b
b
a
a
b
a
b
b
a
a
a
START
0
1
2
2
2
3
2
4
2
1
1
2
1
3
1
4
1
Figure 3.51: Transition diagram of factor automaton M
N
accepting set
Fact(abab) Fact(abba) from Example 3.37
114
Transition diagram of resulting deterministic factor automaton M is de-
picted in Fig. 3.52. 2
b
b
b a
a
b
b
b
a
a
START
multilpe front end
0 1 1 3 4
1 2 1 2
2 2 4
1 2 1
2 2 3 4
1 2 2 1
3 4
1 2
3
2
3
1
4
2
4
1
Figure 3.52: Transition diagram of deterministic factor automaton M ac-
cepting set Fact(abab) Fact(abba) from Example 3.37
3.9 Automata accepting approximate parts of a string
Finite automata constructed above in this Chapter are accepting exact parts
of string (prexes, suxes, factors, subsequences). It is possible to use the
lessons learned from their constructions for the construction of automata
accepting approximate parts of string. The main principle of algorithms
accepting approximate parts of string x is:
1. Construct nite automaton accepting set:
Approx (x) = y : D(x, y) k.
2. Construct nite automaton accepting approximate parts of string us-
ing principles similar to that for the construction of automata accept-
ing the exact parts of string.
We show this principle using Hamming distance in the next example.
Denition 3.38
Set H
k
(P) of all strings similar to string P is:
H
k
(P) = X : X A

, D
H
(X, P) k.
where D
H
(X, P) is the Hamming distance. 2
Example 3.39
Let string be x = abba. We construct approximate prex automaton for
Hamming distance k = 1. For this purpose we use a modication of Algo-
rithm 2.5. The modication consists in removing the seloop in the initial
state q
0
. After that we make all state nal states. Transition diagram of
resulting Hamming prex automaton is depicted in Fig. 3.53.
115
Figure 3.53: Transition diagram of the Hamming prex automaton ac-
cepting APref(abba) for Hamming distance k = 1 from Example 3.39
Example 3.40
Let string be x = abba. We construct approximate factor automaton using
Hamming distance k = 1.
1. We use Algorithm 3.10 modied for factor automaton (see Algo-
rithm 3.14). For the construction of nite automaton accepting string
x and all strings with Hamming distance equal to 1, we use a modi-
cation of Algorithm 2.5. The modication consists in removing the
self loop in initial state q
0
. Resulting nite automaton has transition
diagram depicted in Fig. 3.54.
b
b a
b
b
a
a
b a a b
START
1 2 3 4
1 0 2 3 4
Figure 3.54: Transition diagram of the Hamming automaton accepting
H
1
(abba) with Hamming distance k = 1 from Example 3.40
2. We use the principle of inserting the transitions from state 0 to states
1,2,3 and 4. Moreover, all states are xed as nal states. Transition
diagram with inserted transition is depicted in Fig. 3.55.
3. We replace transitions by nontransitions. The resulting automa-
ton has transition diagram depicted in Fig. 3.56.
4. The nal operation is the construction of the equivalent deterministic
nite automaton. Its transition table is shown in Table 3.10.
The transition diagram of the resulting deterministic approximate factor
automaton is depicted in Fig. 3.57. All states of it are nal states. 2
Example 3.41
Let us construct backbone of the Hamming factor automaton from Exam-
116
e e e
b
b a
b
b
a
a
b a a b
START
1' 2' 3' 4'
1 0 2 3 4
e
Figure 3.55: Transition diagram of the Hamming factor automaton with
transitions inserted and nal states xed from Example 3.40
b
b
b
a
b
b
b
a
a
a
b a
a
a
a
b
b
START
1' 2' 3' 4'
1 0 2 3 4
Figure 3.56: Transition diagram of the Hamming factor automaton after
removal of transitions from Example 3.40
ple 3.40. The construction of the backbone consists in intersection of Ham-
ming factor automaton having transition diagram depicted in Fig. 3.57 and
Hamming prex automaton having transition diagram depicted in Fig. 3.53.
The result of this intersection is shown in Fig. 3.58. In this automaton pairs
of states:
((2

, 2

P
), (32

, 2

P
)) and
((3

, 3

P
), (3

, 3

P
))
are equivalent.
The backbone equivalent to one depicted in Fig. 3.58 we obtain using
the following approach. We can recognize, that set of states of the Hamming
factor automaton (see Fig. 3.57):
(24,324) and (34,3,34)
are equivalent. After minimization we obtain Hamming factor automaton
depicted in Fig. 3.59. The backbone of the automaton we obtain by removal
of transition drawn by the dashed line. 2
117
a b
0 142

231

142

23

231

32

23

3
3

32

4 3

3 4 4

4
Table 3.10: Transition table of the deterministic Hamming factor automa-
ton from Example 3.40
Figure 3.57: Transition diagram of the deterministic Hamming approxi-
mate factor automaton for x = abba, Hamming distance k = 1 from Exam-
ple 3.40
118
a b
(0, 0
P
) (142

, 1
P
) (231

, 1

P
)
(142

, 1
P
) (2

, 2

P
) (23

, 2
P
)
(231

, 1

P
) (32

, 2

P
)
(23

, 2
P
) (3

, 3

P
) (3, 3
P
)
(2

, 2

P
) (3

, 3

P
)
(32

, 2

P
) (3

, 3

P
)
(3

, 3

P
) (4

, 4

P
)
(3, 3
P
) (4, 4
P
) (4

, 4

P
)
(3

, 3

P
) (4

, 4

P
)
Table 3.11: Transition table of the backbone of Hamming factor automaton
for string x = abba
Figure 3.58: Transition diagram of the backbone of Hamming factor au-
tomaton for string x = abba, from Example 3.41
START
a b
b
b a a b
b
b
a
a
a
0 142 3 ' '
231'4'
23'
2'4'
' ' 32 4
3'4
3'
3'4'
3 4
4'
Figure 3.59: Minimized Hamming factor automaton for string x = abba
from Example 3.41
119
4 Borders, repetitions and periods
4.1 Basic notions
Denition 4.1 (Proper prex)
The proper prex is any element of Pref(x) not equal to x. 2
Denition 4.2 (Border)
Border of string x A
+
is any proper prex of x, which is simultaneously its
sux. The set of all borders of string x is bord(x) = (Pref(x)x)Su(x).
The longest border of x is Border(x). 2
Denition 4.3 (Border of a nite set of string)
Border of set of strings S = x
1
, x
2
, . . . , x
|S|
is any proper prex of some
x
i
S which is the sux of some x
j
S, i, j < 1, [S[ >. The set of all
borders of the set S is
mbord(S) =
|S|
_
i=1
|S|
_
j=1
(Pref(x
i
) x
i
) Su(x
j
).
The longest border which is the sux of x
i
, i < 1, [S[ > belongs to the set
mBorder(S).
mBorder(S) = u
i
: u
i
mbord(S), u
i
is the longest sux of x
i
,
i < 1, [S[ >.
2
Denition 4.4 (Period)
Every string x A
+
can be written in the form:
x = u
r
v,
where u A
+
and v Pref(u). The length of the string u, p = [u[ is a
period of the string x, r is an exponent of the string x and u is a generator
of x. The shortest period of string x is Per(x). The set of all periods of x is
periods (x). String x is pure periodic if v = . 2
Denition 4.5 (Normal form)
Every string x A
+
can be written in the normal form:
x = u
r
v,
where p = [u[ is the shortest period Per(x), therefore r is the highest expo-
nent and v Pref(u). 2
Denition 4.6 (Primitive string)
If string x A
+
has the shortest period equal to its length, then we call it
the primitive string. 2
Let us mention that for primitive string x holds that Border(x) = .
120
Denition 4.7 (Border array)
The border array [1..n] of string x A
+
is a vector of the lengths of the
longest borders of all prexes of x:
[i] = [Border(x[1..i])[ for i = 1, 2, . . . , n. 2
Denition 4.8 (Border array of a nite set of strings)
The mborder array m[1..n] of a set of strings S = x
1
, x
2
, . . . , x
|S|
is a
vector of the longest borders of all prexes of strings from S:
m[h] =[mBorder(x
1
, x
2
, . . . , x
i1
, x
i
[1..j], x
i+1
, . . . x
|S|
)[ for
i 1, [S[), j 1, [x
i
[).
The values of variable h are used for the labelling of states of nite
automaton accepting set S.
h
|S|

l=1
[x
l
[.
2
Denition 4.9 (Exact repetition in one string)
Let T be a string, T = a
1
a
2
. . . a
n
and a
i
= a
j
, a
i+1
= a
j+1
, . . . , a
i+m
=
a
j+m
, i < j, m 0. String x
2
= a
j
a
j+1
. . . a
j+m
is an exact repetition of
string x
1
= a
i
a
i+1
. . . a
m
. x
1
or x
2
are called repeating factors in text T.
2
Denition 4.10 (Exact repetition in a set of strings)
Let S be a set of strings, S = x
1
, x
2
, . . . , x
|S|
and x
pi
= x
qj
, x
pi+1
=
x
qj+1
, . . . , x
pm
= x
qm
, k ,= l or k = l and i < j, m 0.
String x
qj
x
qj+1
. . . x
qm
is an exact repetition of string x
pi
x
pi+1
. . . x
pm
.
2
Denition 4.11 (Aproximate repetition in one string)
Let T be a string, T = a
1
a
2
. . . a
n
and D(a
i
a
i+1
. . . a
i+m
, a
j
a
j+1
. . . a
j+m
)
k, where m, m

0, D is a distance, 0 < k < n. String a


j
a
j+1
. . . a
j+m
is
an approximate repetition of string a
i
a
i+1
. . . a
i+m
. 2
The approximate repetition in the set of strings can be dened in the similar
way.
Denition 4.12 (Type of repetition)
Let x
2
= a
j
a
j+1
. . . a
j+m
be an exact or approximate repetition of x
1
=
a
i
a
i+1
. . . a
i+m
, i < j, in one string.
Then if j i < m then the repetition is with an overlapping (O),
if j i = m then the repetition is a square (S),
if j i > m then the repetition is with a gap (G).
2
4.2 Borders and periods
The Algorithms in this Section show, how to nd borders of a string and its
periods.
121
The main topic of our interest is pattern matching. The main goal is to
nd all occurences of a given pattern in a text. This is very simple when
the pattern is a primitive string. There are no possible two occurences of
the primitive pattern with an overlapping. This is not true for patterns
having nonempty Border. In this situation there are possible two or more
occurences of a pattern with an overlapping. Such situation for pattern p
is visualised in Fig. 4.1. Fig. 4.2 shows cluster of occurences of highly
Figure 4.1: Visualisation of two possible occurences of pattern p with an
overlapping
periodic pattern p = ababa. For pattern p holds:
bord(p) = , a, aba,
Border(p) = aba,
periods (p) = 2, 4,
Per (p) = 2,
p = (ab)
2
a is the normal form of p, r = 2.
Figure 4.2: Visualisation of a cluster of occurences of highly periodic
pattern p = ababa
The distance of two consecutive occurences in the depicted cluster is given
by Per(p) = 2. The maximum number of occurences with overlapping in
such cluster is given by the exponent and is equal to 3 in our case. It is
equal to the exponent for pure periodic patterns.
122
4.2.1 Computation of borders
We use a deterministic sux automaton and analysis of its backbone for the
computation of borders for given string.
Algorithm 4.13
Computation of borders.
Input: String x A
+
.
Output: bord(x), Border(x).
Method:
1. Construct deterministic sux automaton M for the string x.
2. Do analysis of the automaton M:
(a) set bord := ,
(b) nd all sequences of transitions on the backbone of the sux
automaton M leading from the initial state to some nal state
but the terminal one,
(c) if the labelling of some sequence of transitions is x
i
then bord(x) :=
bord(x) x
i
for all i = 1, 2, . . . , h, where h is the number of se-
quences of transitions found in step b)
3. Select the longest element y of the set bord(x) and set Border(x) := y.
2
Note: The dashed lines will be used for the parts of sux and factor au-
tomata which are out of the backbone.
Example 4.14
Let us compute bord(x) and Border(x) for string x = ababab using Algo-
rithm 4.13. Sux automaton M has the transition diagram depicted in
Fig. 4.3.
a b
b
a b a b
START
0 135 246 35 46 5 6
Figure 4.3: Transition diagram of the sux automaton for string x = ababab
from Example 4.14
There are three sequences of transitions in M in question:
0

0
0
a
135
b
246
0
a
135
b
246
a
35
b
46
Then bord(x) = , ab, abab and Border(x) = abab. 2
123
Example 4.15
Let us compute bord(x) and Border(x) for string x = abaaba using Algo-
rithm 4.13. Sux automaton M has the transition diagram depicted in
Fig. 4.4.
a b
b
a
a a b a
START
0 25 1346 36 4 5 6
Figure 4.4: Transition diagram of the sux automaton for string x = abaaba
from Example 4.15
There are three sequences of transitions in M in question:
0

0
0
a
1346,
0
a
1346
b
25
a
36.
Then bord(x) = , a, aba, Border(x) = aba. 2
4.2.2 Computation of periods
The computation of periods for a given string can be based on the relation
between borders and periods. Such relation is expressed in the following
Lemma.
Lemma 4.16
Let x A
+
be a string having border v then p = [x[ [v[ is a period of
string x. 2
Proof:
We can display string x in the form:
Therefore we can write x in the form x = uv, where v Pref(u) and thus
p = [u[ is a period of x. 2
Algorithm 4.17
Computation of periods.
Input: String x A
+
.
Output: Set periods (x) = p
1
, p
2
, . . . , p
h
h 0, the shortest period
Per(x).
124
Method:
1. Compute bord(x) = x
1
, x
2
, . . . , x
h
and Border(x) using Algorithm 4.13.
2. Compute periods(x) = p
1
, p
2
, . . . , p
h
, where p
i
= [x[ [x
i
[, 1 i
h, x
i
bord(x).
3. Compute Per(x) = [x[ [Border(x)[. 2
Example 4.18
Let us compute set of periods and the longest period Per(x) for string x =
ababab. The set of borders bord(x) and Border(x) computed in Example 4.14
are:
bord(ababab) = , ab, abab,
Border(ababab) = abab.
The resulting set periods(x) = 6, 4, 2.
The shortest period Per(x) = 2.
The normal form of string x = (ab)
3
. 2
4.3 Border arrays
Let us remind the denition of the border array (see Def. 4.7). Borders and
periods are related to situtations when a pattern is found. Border arrays are
related to a situation when some prex of the pattern is found. Figure 4.5
shows a situation when prex u of pattern p is found. After that a mismatch
occurs. There is possible that next occurence of prex u of pattern p may
occur again and it depends on the length of Border(u).
Figure 4.5: Visualisation of two possible occurences of prex u of pattern p
The next Algorithm is devoted to the computation of a border array of
one string.
Algorithm 4.19
Computing of border array.
Input: String x A
+
.
125
Output: Border array [1..n], where n = [x[.
Method:
1. Construct nondeterministic factor automaton M
1
for x.
2. Construct equivalent deterministic factor automaton M
2
and preserve
dsubsets.
3. Initialize all elements of border array [1..n] by the value zero.
4. Do analysis of multiple dsubsets of the deterministic factor automaton
for states on the backbone from left to right:
If the dsubset has the form i
1
, i
2
, . . . , i
h
(this sequence is ordered),
then set [j] := i
1
for j = i
2
, i
3
, . . . , i
h
. 2
Example 4.20
Let us construct the border array for string x = abaababa (Fibonacci string
f
5
) using Algorithm 4.19. Nondeterministic factor automaton M
1
has transi-
tion diagram depicted in Fig. 4.6. Equivalent deterministic factor automaton
0 1 2 3 4 5 6 7 8
a b a a b a b a
b a a b a b a
START
Figure 4.6: Transition diagram of nondeterministic factor automaton M
1
for string x = abaababa from Example 4.20
M
2
with preserved dsubsets has the transition diagram depicted in Fig. 4.7.
Now we do analysis of dsubsets starting with the dsubset 1, 3, 4, 6, 8 and
0 13468 257 368 4 5 6 7 8
a b a a b a b a
a
b b
START
Figure 4.7: Transition diagram of deterministic factor automaton M
2
for
string x = abaababa from Example 4.20
continuing to the right. The result of this analysis is shown in the next table:
126
Analyzed state Values of border array elements
13468 [3] = 1, [4] = 1, [6] = 1, [8] = 1
257 [5] = 2, [7] = 2
368 [6] = 3, [8] = 3
Resulting border array (abaababa) is summed up in the next table:
i 1 2 3 4 5 6 7 8
symbol a b a a b a b a
[i] 0 0 1 1 2 3 2 3
2
Denition 4.21
The backbone of factor automaton M of a set of strings S = x
1
, x
2
, . . . ,
x
|S|
is a part of factor automaton M enabling sequences of transitions for
all strings from the set S starting in the initial state and nothing else. 2
Denition 4.22
The depth of state q of the factor automaton on its backbone is the number
of the backbone transitions which are necessary to reach state q from the
initial state. 2
Algorithm 4.23
Input: Set of strings S = x
1
, x
2
, . . . , x
|S|
, x
i
A
+
, i < 1, [S[ >.
Output: mborder array m[1..n],
n
|S|

i=1
[x
i
[.
Method:
1. Construct nondeterministic factor automaton M
1
for S.
2. Construct equivalent deterministic factor automaton M
2
and preserve
dsubsets.
3. Extract the backbone of M
2
creating automaton M
3
= (Q, A, , q
0
, F).
4. Set n := [Q[ 1.
5. Initialize all elements of mborder array m[1..n] by the value zero.
6. Do analysis of multiple dsubsets of automaton M
3
from left to right
(starting with states having minimal depth):
If the dsubset has the form i
1
, i
2
, . . . , i
h
(this sequence is ordered
according to the depth of each state), then set m[j] := i
1
for j =
i
2
, i
3
, . . . , i
h
. 2
127
a b
0 13
1
4
2
23
2
4
1
13
1
4
2
24
1
23
2
4
1
3
1
4
2
3
2
24
1
3
1
3
2
3
1
4
2
4
1
3
1
4
1
3
2
4
2
4
1
4
2
Table 4.1: Transition table of deterministic factor automaton M
2
from Ex-
ample 4.24
Note: We will use the labelling of states reecting the depth of them instead
of the running numbering as in Denition 4.8.
Example 4.24
Let us construct the mborder array for set of strings S = abab, abba. In the
rst step, we construct nondeterministic factor automaton M
1
for set S. Its
transition diagram is depicted in Fig. 4.8. Table 4.1 is the transition table
START b
b
a
a
b
a
b
b
a
b
a
0
3
2
4
2
1 2
3
1
4
1
Figure 4.8: Transition diagram of nondeterministic factor automaton M
1
for set x = abab, abba from Example 4.24
of deterministic factor automaton M
2
. Its transition diagram is depicted in
Fig. 4.9. The dashed lines and circles show the part of the automaton out of
the backbone. Therefore the backbone of M
3
is drawn by solid lines. Now
we do analysis of dsubsets. Result is in the next table:
Analyzed state Values of mborder array elements
13
1
4
2
m(4
2
) = m(3
1
) = 1
24
1
m(4
1
) = 2
128
START b
b
a
a
b
a
b b
a b
0
3
2
3 4
1 2
4
2
13 4
1 2
24
1
23 4
2 1
3
1
4
1
Figure 4.9: Transition diagram of deterministic factor automaton M
2
for
the x = abab, abba from Example 4.24
Resulting mborder array m(S) is shown in the next table:
state 1 2 3
1
3
2
4
1
4
2
symbol a b a b b a
m[state] 0 0 1 0 2 1
2
4.4 Repetitions
4.4.1 Classication of repetitions
Problems of repetitions of factors in a string over a nite size alphabet can
be classied according to various criteria. We will use ve criteria for clas-
sication of repetition problems leading to ve-dimensional space in which
each point corresponds to the particular problem of repetition of a factor
in a string. Let us make a list of all dimensions including possible values
in each dimension:
1. Number of strings:
- one,
- nite number greater than one,
- innite number.
2. Repetition of factors (see Denition 4.12):
- with overlapping,
- square,
- with gap.
3. Specication of the factor:
- repeated factor is given,
- repeated factor is not given,
- length l of the repeated factor is given exactly,
- length of the repeated factor is less than given l,
- length of the repeated factor is greater than given l,
- nding the longest repeated factor.
4. The way of nding repetitions:
129
- exact repetition,
- approximate repetition with Hamming distance (R-repetition),
- approximate repetition with Levenshtein distance (DIR-repeti-
tion),
- approximate repetition with generalized Levenshtein distance
(DIRT-repetition),
- -approximate repetition,
- -approximate repetition,
- (, )-approximate repetition.
5. Importance of symbols in factor:
- take care of all symbols,
- dont care of some symbols.
The above classication is visualised in Figure 4.10. If we count the
number of possible problems of nding repetitions in a string, we obtain
N = 3 3 2 7 2 = 272.
Approximate repetition (dimension 4):
5
Specification
Importance Number
3
Factor is
given
No factor
is given
2
Repetition
Square
Overlapping
Gaps
1
Finite
Infinite
One
4
Distance
Exact
Care
Don't care
A
p
p
ro
x
im
a
te
R-matching DIRT-matching G-matching
DIR-matching D-matching ( )-matching D,G
Figure 4.10: Classication of repetition problems
In order to facilitate references to a particular problem of repetition
in a string, we will use abbreviations for all problems. These abbreviations
are summarized in Table 4.2.
130
Dimension 1 2 3 4 5
O O F E C
F S N R D
I G D
T

(, )
Table 4.2: Abbreviations of repetition problems
Using this method, we can, for example, refer to the overlapping exact
repetition in one string of a given factor where all symbols are considered
as the OOFEC problem.
Instead of the single repetition problem we will use the notion of a family
of repetitions in string problems. In this case we will use symbol ? instead
of a particular symbol. For example ?S??? is the family of all problems
concerning square repetitions.
Each repetition problem can have several instances:
1. verify whether some factor is repeated in the text or not,
2. nd the rst repetition of some factor,
3. nd the number of all repetitions of some factor,
4. nd all repetitions of some factor and where they are.
If we take into account all possible instances, the number of repetitions
in string problems grows further.
4.4.2 Exact repetitions in one string
In this section we will introduce how to use a factor automaton for nding
exact repetitions in one string (O?NEC problem). The main idea is based on
the construction of the deterministic factor automaton. First, we construct
a nondeterministic factor automaton for a given string. The next step is
to construct the equivalent deterministic factor automaton. During this
construction, we memorize d-subsets. The repetitions that we are looking
for are obtained by analyzing these d-subsets. The next algorithm describes
the computation of d-subsets of a deterministic factor automaton.
Algorithm 4.25
Computation of repetitions in one string.
Input: String T = a
1
a
2
. . . a
n
.
Output: Deterministic factor automaton M
D
accepting Fact(T) and
131
dsubsets for all states of M
D
.
Method:
1. Construct nondeterministic factor automaton M
N
accepting Fact(T):
(a) Construct nite automaton M accepting string T = a
1
a
2
. . . a
n
and all its prexes.
M = (q
0
, q
1
, q
2
, . . . , q
n
, A, , q
0
, q
0
, q
1
, . . . , q
n
),
where (q
i
, a
i+1
) = q
i+1
for all i 0, n 1).
(b) Construct nite automaton M

from the automaton M by insert-


ing transitions:
(q
0
, ) = q
1
, q
2
, . . . , q
n1
, q
n
.
(c) Replace all transitions by nontransitions. The resulting au-
tomaton is M
N
.
2. Construct deterministic factor automaton M
D
equivalent to automa-
ton M
N
and memorize the dsubsets during this construction.
3. Analyze dsubsets to compute repetitions. 2
Factor automaton M

constructed by Algorithm 4.25 has, after step 1.b, the


transition diagram depicted in Fig. 4.11. Factor automaton M
N
has, after
Figure 4.11: Transition diagram of factor automaton M

with transitions
constructed in step 1.b of Algorithm 4.25
step 1.c of Algorithm 4.25, the transition diagram depicted in Fig. 4.12.
Figure 4.12: Transition diagram of factor automaton M
N
after the removal
of transitions in step 1.c of Algorithm 4.25
The next example shows the construction of the deterministic factor
automaton and the analysis of the dsubsets.
132
Let us make a note concerning labelling: Labels used as the names of
states are selected in order to indicate positions in the string. This labelling
will be useful later.
Example 4.26
Let us use text T = ababa. At rst, we construct nondeterministic factor
automaton M

(ababa) = (Q

, A,

, 0, Q

) with -transitions. Its transition


diagram is depicted in Figure 4.13.
Figure 4.13: Transition diagram of factor automaton M

(ababa) from Ex-


ample 4.26
Then we remove -transitions and resulting nondeterministic factor au-
tomaton M
N
(ababa) = (Q
N
, A,
N
, 0, Q
N
) is depicted in Figure 4.14 and its
transition table is Table 4.3. 2
a b
b a b a
a b a START
0 1 2 3 4 5
Figure 4.14: Transition diagram of nondeterministic factor automaton
M
N
(ababa) from Example 4.26
State a b
0 1, 3, 5 2, 4
1 2
2 3
3 4
4 5
5
Table 4.3: Transition table of nondeterministic factor automaton M
N
(ababa)
from Example 4.26
As next step, we construct equivalent deterministic factor automaton
M
D
(ababa) = (Q
D
, A,
D
, 0, Q
D
). During this operation we memorize the
133
created d-subsets. We suppose, taking into account the labelling of the
states of the nondeterministic factor automaton, that d-subsets are ordered
in the natural way. The extended transition table (with ordered d-subsets) of
deterministic factor automaton M
D
(ababa) is shown in Table 4.4. Transition
diagram of M
D
is depicted in Figure 4.15.
State d-subset a b
D
0
0 1, 3, 5 2, 4
D
1
1, 3, 5 2, 4
D
2
2, 4 3, 5
D
3
3, 5 4
D
4
4 5
D
5
5
Table 4.4: Transition table of automaton M
D
(ababa) from Example 4.26
a b
b
a b a START
0 1,3,5 2,4 3,5 4 5
Figure 4.15: Transition diagram of deterministic factor automaton
M
D
(ababa) from Example 4.26
Now we start the analysis of the resulting d-subsets:
d-subset d(D
1
) = 1, 3, 5 shows that factor a repeats at positions 1, 3 and 5
of the given string, and its length is one. d-subset d(D
2
) = 2, 4 shows that
factor ab repeats, and its occurrence in the string ends at positions 2 and 4
and its length is two. Moreover, sux b of this factor also repeats at the same
positions as factor ab. d-subset d(D
3
) = 3, 5 shows that factor aba repeats,
and its occurrence in the string ends at positions 3 and 5 and its length is
three. Moreover, its sux ba also repeats at the same positions. Sux a of
factor aba also repeats at positions 3 and 5, but we have already obtained
this information during analysis of the d-subset d(D
1
) = 1, 3, 5. Analysis
of the d-subsets having only single states brings no further information on
repeating factors.
A summary of these observations we collect in a repetition table. The
repetition table contains one row for each d-subset. It contains d-subset,
repeating factor, and a list of repetitions. The list of repetitions indicates
position of the repeating factor and the type of repetition. 2
Denition 4.27
Let T be a string. The repetition table for T contains the following items:
1. d-subset,
134
2. corresponding factor,
3. list of repetitions of the factor containing elements of the form (i, x
i
),
where i is the position of the factor in string T,
X
i
is the type of repetition:
F - the rst occurrence of the factor,
O - repetition with overlapping,
S - repetition as a square,
G - repetition with a gap. 2
Repetition table for string T = ababa from Example 4.26 is shown in Ta-
ble 4.5.
d-subset Factor List of repetitions
1, 3, 5 a (1, F) (3, G), (5, G)
2, 4 ab (2, F) (4, S)
2, 4 b (2, F) (4, G)
3, 5 aba (3, F) (5, O)
3, 5 ba (3, F) (5, S)
Table 4.5: Repetition table of ababa
Construction of the repetition table is based on the following observa-
tions illustrated in Figure 4.16 and Lemmata 4.28 and 4.29 show its correct-
ness.
Lemma 4.28
Let T be a string and M
D
(T) be the deterministic factor automaton for T
with states labelled by corresponding d-subsets. If factor u = a
1
a
2
. . . a
m
,
m 1, repeats in string T and its occurrences start at positions x + 1 and
y + 1, x ,= y then there exists a d-subset in M
D
(T) containing the pair
x +m, y +m.
Proof
Let M
N
(T) = (Q
N
, A,
N
, q
0
, Q
N
) be the nondeterministic factor automaton
for T and let u = a
1
a
2
. . . a
m
be the factor starting at positions x + 1 and
y + 1 in T, x ,= y. Then there are transitions in M
N
(T) from state 0 to
states x + 1 and y + 1 for symbol a
1
, (
N
(0, a
1
) contains x + 1 and y + 1).
It follows from the construction of M
N
(T) that:
135
a
1
a
1
a
1
a
1
a
2
a
2
a
m
a
m
a
2
a
2
a
m
a
m
S
T
A
R
T
0
x
+
1
y
+
1
x
+
2
y
+
2
x
+
m
y
+
m
x
y
Figure 4.16: Repeated factor u = a
1
a
2
. . . a
m
in M
N
(T)
136

N
(x + 1, a
2
) = x + 2,
N
(y + 1, a
2
) = y + 2,

N
(x + 2, a
3
) = x + 3,
N
(y + 2, a
3
) = y + 3,
.
.
.
.
.
.

N
(x +m1, a
m
) = x +m,
N
(y +m1, a
m
) = y +m.
Deterministic factor automaton M
D
(T) = (Q
D
, A,
D
, D
0
, Q
D
) then con-
tains states D
0
, D
1
, D
2
, . . . D
m
having this property:

D
(D
0
, a
1
) = D
1
, x + 1, y + 1 D
1
,

D
(D
1
, a
2
) = D
2
, x + 2, y + 2 D
2
,
.
.
.
.
.
.

D
(D
m1
, a
m
) = D
m
, x +m, y +m D
m
.
We can conclude that the d-subset D
m
contains the pair x+m, y +m.
2
Lemma 4.29
Let T be a string and let M
D
(T) be the deterministic factor automaton for
T with states labelled by corresponding d-subsets. If a d-subset D
m
contains
two elements x +m and y +m then there exists factor u = a
1
a
2
. . . a
m
,
m 1, starting at both positions x and y in string T.
Proof
Let M
N
(T) be the nondeterministic factor automaton for T. If a d-subset
D
m
contains elements from x +m, y +m then it holds for
N
of M
N
(T):
x +m, y +m
N
(0, a
m
), and

N
(x +m1, a
m
) = x +m,

N
(y +m1, a
m
) = y +m for some a
m
A.
Then d-subset D
m1
such that
D
(D
m1
, a
m
) = D
m
must contain
x +m1, y +m1 such that x +m1, y +m1
N
(0, a
m1
),

N
(x +m2, a
m1
) = x +m1,

N
(y +m2, a
m1
) = y +m1
and for the same reason D-subset D
1
must contain x + 1, y + 1 such that
x + 1, y + 1
N
(0, a
1
) and
N
(x, a
1
) = x + 1,
N
(y, a
1
) = y + 1.
Then there exists the sequence of transitions in M
D
(T) :
a
1
a
2
a
3
a
n
START
D
0
D
1
D
2
D
n
Figure 4.17: Repeated factor u = a
1
a
2
. . . a
m
in M
D
(T)
(D
0
, a
1
a
2
. . . a
m
) (D
1
, a
2
. . . a
m
)
(D
2
, a
3
. . . a
m
)
137
.
.
.
(D
m1
, a
m
)
(D
m
, ),
where
x + 1, y + 1 D
1
,
.
.
.
x +m, y +m D
m
.
This sequence of transitions corresponds to two dierent sequences of
transitions in M
N
(T) going through state x + 1:
(0, a
1
a
2
. . . a
m
) (x + 1, a
2
. . . a
m
)
(x + 2, a
3
. . . a
m
)
.
.
.
(x +m1, a
m
)
(x +m, ),
(x, a
1
a
2
. . . a
m
) (x + 1, a
2
. . . a
m
)
(x + 2, a
3
. . . a
m
)
.
.
.
(x +m1, a
m
)
(x +m, ).
Similarly two sequences of transitions go through state y + 1:
(0, a
1
a
2
. . . a
m
) (y + 1, a
2
. . . a
m
)
(y + 2, a
3
. . . a
m
)
.
.
.
(y +m1, a
m
)
(y +m, ),
(y, a
1
a
2
. . . a
m
) (y + 1, a
2
. . . a
m
)
(y + 2, a
3
. . . a
m
)
.
.
.
(y +m1, a
m
)
(y +m, ).
It follows from this that the factor u = a
1
a
2
. . . a
m
is present twice in
string T in dierent positions x + 1, y + 1. 2
The following Lemma is a simple consequence of Lemma 4.29.
138
Lemma 4.30
Let u be a repeating factor in string T. Then all factors of u are also re-
peating factors in T. 2
Denition 4.31
If u is a repeating factor in text T and there is no longer factor of the form
vuw, v or w ,= but not both, which is also a repeating factor, then we will
call u the maximal repeating factor. 2
Denition 4.32
Let M
D
(T) be a deterministic factor automaton. The depth of each state D
of M
D
is the length of the longest sequence of transitions leading from the
initial state to state D. 2
If there exists a sequence of transitions from the initial state to state
D which is shorter than the depth of D, it corresponds to the sux of the
maximal repeating factor.
Lemma 4.33
Let u be a maximal repeating factor in string T. The length of this factor
is equal to the depth of the state in M
D
(T) indicating the repetition of u.
2
Proof
The path for maximal repeating factor u = a
1
a
2
. . . a
m
starts in the initial
state, because states x + 1 and y + 1 of nondeterministic factor automaton
M
N
(T) are direct successors of its initial state and therefore
D
(D
0
, a
1
) = D
1
and x +1, y +1 D
1
. Therefore there exists a sequence of transitions in
deterministic factor automaton M
D
(T):
(D
0
, a
1
a
2
. . . a
m
) (D
1
, a
2
. . . a
m
)
(D
2
, a
3
. . . a
m
)
.
.
.
(D
m1
, a
m
)
(D
m
, )
2
There follows one more observation from Example 4.26.
Lemma 4.34
If some state in M
D
(T) has a corresponding d-subset containing one element
only, then its successor also has a corresponding d-subset containing one
element.
139
Proof
This follows from the construction of the deterministic factor automaton.
The transition table of nondeterministic factor automaton M
N
(T) has more
than one state in the row for the initial state only. All other states have at
most one successor for a particular input symbol. Therefore in the equiv-
alent deterministic factor automaton M
D
(T) the state corresponding to a
d-subset having one element may have only one successor for one symbol,
and this state has a corresponding d-subset containing just one element. 2
We can use this observation during the construction of deterministic
factor automaton M
D
(T) in order to nd some repetition. It is enough to
construct only the part of M
D
(T) containing d-subsets with at least two
elements. The rest of M
D
(T) gives no information on repetitions.
Algorithm 4.35
Constructing a repetition table containing exact repetitions in a given string.
Input: String T = a
1
a
2
. . . a
n
.
Output: Repetition table R for string T.
Method:
1. Construct deterministic factor automaton
M
D
(T) = (Q
D
, A,
D
, 0, Q
D
) for given string T.
Memorize for each state q Q
D
:
(a) d-subset D(q) = r
1
, r
2
, . . . , r
p
,
(b) d = depth(q),
(c) maximal repeating factor for state q maxfactor(q) = x, [x[ = d.
2. Create rows in repetition table R for each state q having D(q) with
more than one element:
(a) the row for maximal repeating factor x of state q has the form:
(r
1
, r
2
, . . . , r
p
, x, (r
1
, F), (r
2
, X
2
), (r
3
, X
3
), . . . , (r
p
, X
p
),
where X
i
, 2 i p, is equal to
i. O, if r
i
r
i1
< d,
ii. S, if r
i
r
i1
= d,
iii. G, if r
i
r
i1
> d,
(b) for each sux y of x (such that the row for y was not created
before) create the row of the form:
(r
1
, r
2
. . . , r
p
, y, (r
1
, F), (r
2
, X
2
), (r
3
, X
3
), . . . , (r
p
, X
p
),
where X
i
, 2 i p, is deduced in the same manner. 2
An example of the repetition table is shown in Example 4.26 for string
T = ababa.
140
4.4.3 Complexity of computation of exact repetitions
The time and space complexity of the computation of exact repetitions in
string is treated in this Chapter.
The time complexity is composed of two parts:
1. The complexity of the construction of the deterministic factor automa-
ton. If we take the number of states and transitions of the resulting
factor automaton then the complexity is linear. More exactly the
number of its states is
NS 2n 2,
and the number of transitions is
NT 3n 4.
2. The second part of the overall complexity is the construction of repeti-
tion table. The number of rows of this table is the number of dierent
multiple d-subsets. The highest number of multiple d-subsets has the
factor automaton for text T = a
n
. Repeating factors of this text are
a, a
2
, . . . , a
n1
.
There is necessary, for the computation of repetitions using factor automata
approach, to construct the part of deterministic factor automaton containing
only all multiple states. It is the matter of fact, that a simple state has
at most one next state and it is simple one, too. Therefore, during the
construction of deterministic factor automaton, we can stop construction of
the next part of this automaton as soon as we reach a simple state.
Example 4.36
Let us have text T = a
n
, n > 0. Let us construct deterministic factor
automaton M
D
(a
n
) for text T. Transition diagram of this automaton is
depicted in Fig. 4.18. Automaton M
D
(a
n
) has n+1 states and n transitions.
Figure 4.18: Transition diagram of deterministic factor automaton M
D
(a
n
)
for text T = a
n
from Example 4.36
Number of multiple states is n 1.
To construct this automaton in order to nd all repetitions, we must
construct the whole automaton including the initial state and the state n
(terminal state). Repetition table R has the form shown in Table 4.6. 2
The opposite case to the previous one is the text composed of symbols
which are all dierent. The length of such text is limited by the size of
alphabet.
141
d-subset Factor List of repetitions
1, 2, . . . , n a (1, F), (2, S), (3, S), . . . , (n, S)
2, . . . , n aa (2, F), (3, O), (4, O), . . . , (n, O)
.
.
.
n 1, n a
n1
(n 1, F), (n, O)
Table 4.6: Repetition table R for text T = a
n
from Example 4.36
Example 4.37
Let the alphabet be A = a, b, c, d and text T = abcd. Deterministic factor
automaton M
D
(abcd) for text T has transition diagram depicted in Fig. 4.19.
Automaton M
D
(abcd) has n+1 states and 2n1 transitions. All respective
Figure 4.19: Transition diagram of deterministic factor automaton
M
D
(abcd) for text T = abcd from Example 4.36
d-subsets are simple. To construct this automaton in order to nd all repe-
titions, we must construct all next states of the initial state for all symbols
of the text. The number of these states is just n. The repetition table is
empty. 2
Now, after the presentation both limit cases, we will try to nd some case
inbetween with the maximal complexity. We guess, that the next example
is showing it. The text selected in such way, that all proper suxes of the
prex of the text appear in it and therefore they are repeating.
Example 4.38
Let the text be T = abcdbcdcdd. Deterministic factor automaton M
D
(T) has
the transition diagram depicted in Fig. 4.20. Automaton M
D
has 17 states
and 25 transitions while text T has 10 symbols. The number of multiple
d-subsets is 6. To construct this automaton in order to nd all repetitions,
we must construct all multiple states and moreover the states corresponding
to single d-subsets: 0, 1, 5, 8, A.
The results is, that we must construct 11 states from the total number
of 17 states. Repetition table R is shown in Table 4.7. 2
It is known, that the maximal state and transition complexity of the
142
Figure 4.20: Transition diagram of deterministic factor automaton M
D
(T)
for text T = abcdbcdcdd from Example 4.38
d-subset Factor List of repetitions
25 b (2, F), (5, G)
36 bc (3, F), (6, G)
47 bcd (4, F), (7, S)
368 c (3, F), (6, G), (8, G)
479 cd (4, F), (7, G), (9, G)
479A d (4, F), (7, G), (9, G), (10, S)
Table 4.7: Repetition table R for text T = abcdbcdcdd from Example 4.38
factor automaton is reached for text T = ab
n2
c. Let us show such factor
automaton in this context.
Example 4.39
Let the text be T = ab
4
c. Deterministic factor automaton M
D
(T) has tran-
sition diagram depicted in Fig. 4.21. Automaton M
D
(T) has 10 (2 6 2)
Figure 4.21: Transition diagram of deterministic factor automaton M
D
(T)
for text T = ab
4
c from Example 4.39
states and 14 (3 6 4) transitions while the text has 6 symbols. The num-
ber of multiple states is 3 (6 3). To construct this automaton in order to
nd all repetitions, we must construct the 3 multiple states and moreover 3
143
simple states. Therefore we must construct 6 states from the total number
of 10 states. Repetition table R is shown in Table 4.8. 2
d-subset List of repetitions
2345 (b, F)(2, F), (3, S), (4, S), (5, S)
345 (bb, F)(3, F), (4, O), (5, O)
45 (bbb, F)(4, F), (5, O)
Table 4.8: Repetition table R for text T = ab
4
c from Example 4.39
We have used, in the previous examples, three measures of complexity:
1. The number of multiple states of the deterministic factor automaton.
This number is equal to the number of rows in the resulting repetition
table, because each row of the repetition table corresponds to one
multiple d-subset. Moreover, it corresponds to the number of repeating
factors.
2. The number of states which are necessary to construct in order to
get all information on repetitions. We must reach the simple state on
all pathes starting in the initial state. We already know that there
is at most one successor of a simple state and it is a simple state,
too (Lemma 4.34). The number of such states which is necessary to
construct is therefore greater than the number of multiple states.
3. The total number of repetitions (occurrences) of all repeating factors
in text. This number corresponds to the number of items in the last
column of the repetition table headed by List of repetitions.
The results concerning the measures of complexity from previous examples
are summarized in the Table 4.9.
No. of No. of
Text multiple necessary No. of repetitions
states states
a
n
n 1 n + 1 (n
2
+n 2)/2
a
1
a
2
. . . a
n
0 n + 1 0
(all symbols unique)
a
1
a
2
. . . a
m
a
2
. . . a
m
. . . (m
2
m)/2 (m
2
+m)/2

m1
i=1
i(mi + 1)
a
m1
a
m
a
m
ab
n2
c n 3 n (n
2
3n)/2
Table 4.9: Measures of complexity from Examples 4.36, 4.37, 4.38, 4.39
144
Let us show how the complexity measures from Table 4.9 have been
computed.
Example 4.40
Text T = a
n
has been used in Example 4.36. The number of multiple
states is n 1 which is the number of repeating factors. The number of
necessary states is n+1 because the initial and the terminal states must be
constructed. The number of repetitions is given by the sum:
n + (n 1) + (n 2) +. . . + 2 =
n
2
+n 2
2
.
2
Example 4.41
Text T = abcd has been used in Example 4.37. This automaton has no
multiple state. The number of necessary states is n + 1. It means, that in
order to recognize that no repetition exists in such text all states of this
automaton must be constructed. 2
Example 4.42
Text T = abcdbcdcdd used in Example 4.38 has very special form. It consists
of prex abcd followed by all its proper suxes. It is possible to construct
such text only for some n. Length n of the text must satisfy condition:
n =
m

i=1
i =
m
2
+m
2
,
where m is the length of the prex in question. It follows that
m =
1

1 + 8n
2
and therefore m = O(

n).
The number of multiple states is
(m1) + (m2) +. . . + 1 =
m
2
m
2
.
The number of necessary states we must increase by m which is the
number of simple states being next states of the multiple states and the
initial state. Therefore the number of necessary states is (m
2
+ m)/2. The
number of repetitions is
m+ 2(m+ 1) + 3(m2) +. . . + (m1)2 =
m1

i=1
i(mi + 1).
Therefore this number is O(m
2
) = O(n). 2
Example 4.43
Text T = ab
n2
c used in Example 4.39 leads to the factor automaton having
maximal number of states and transitions. The number of multiple states is
145
equal to n3 and the number of necessary states is equal to n. The number
of repetitions is
(n 2) + (n 3) +. . . + 2 =
n
2
3n
2
.
2
It follows from the described experiments that the complexity of determin-
istic factor automata and therefore the complexity of computation of repe-
titions for a text of length n has these results:
1. The number of multiple states is linear. It means that the repetition
table has O(n) rows. It is the space complexity of the computation all
repeated factors.
2. The number of necessary states is again linear. It means that time
complexity of the computation all repeated factors is O(n).
3. The number of repetitions is O(n
2
) which is the time and space com-
plexity of the computation of all occurrences of repeated factors.
4.4.4 Exact repetitions in a nite set of strings
The idea of the use of a factor automaton for nding exact repetitions in one
string can also be used for nding exact repetitions in a nite set of strings
(F?NEC problem). Next algorithm is an extension of Algorithm 4.25 for a
nite set of strings.
Algorithm 4.44
Computation of repetitions in a nite set of strings.
Input: Set of strings S = x
1
, x
2
, . . . , x
|S|
, x
i
A
+
, i = 1, 2, . . . , [S[.
Output: Multiple front end MFE(M
D
) of deterministic factor automaton
M
D
accepting Fact(S) and d-subsets for all states of MFE(M
D
).
Method:
1. Construct nondeterministic factor automata M
i
for all strings x
i
, i =
1, 2, . . . , [S[:
(a) Construct nite automaton M
i
accepting string x
i
and all its
prexes for all i = 1, 2, . . . , [S[.
(b) Construct nite automaton M
i
from automaton M
i
by inserting
-transitions from the initial state to all other states for all i =
1, 2, . . . , [S[.
2. Construct automaton M

by merging of initial states of automata M


i
for i 1, 2, . . . , [S[.
3. Replace all -transitions by non--transitions. The resulting automa-
ton is M
N
.
146
4. Construct the multiple front end of deterministic MFE(M
D
) factor
automaton M
D
equivalent to automaton M
N
using Algorithm 3.21
and save the d-subset during this construction. 2
We show in the next example the construction of a factor automaton and the
analysis of dsubsets created during the construction of factor automaton
for a nite set of strings.
Example 4.45
Let us construct the factor automaton for the set of strings S = abab, abba.
First, we construct factor automata M
1
and M
2
for both strings in S. Their
transition diagrams are depicted in Figs 4.22 and 4.23, respectively.
Figure 4.22: Transition diagram of factor automaton M
1
accepting
Fact(abab) from Example 4.45
Figure 4.23: Transition diagram of factor automaton M
2
accepting
Fact(abba) from Example 4.45
In the second step we construct automaton M

accepting language L(M

) =
Fact(abab) Fact(abba). Its transition diagram is depicted in Fig. 4.24.
In the third step we construct automaton M
N
by removing transitions
from automaton M

. Its transition diagram is depicted in Fig. 4.25.


147
Figure 4.24: Transition diagram of factor automaton M

accepting set
Fact(abab) Fact(abba) from Example 4.45
b
b
b
b
b
a
a
b
a
b
b
a
a
a
START
0
1
2
2
2
3
2
4
2
1
1
2
1
3
1
4
1
Figure 4.25: Transition diagram of nondeterministic factor automaton M
N
accepting set Fact(abab) Fact(abba) from Example 4.45
148
a b
0 1
1
1
2
3
1
4
2
2
1
2
2
3
2
4
1
1
1
1
2
3
1
4
2
2
1
2
2
4
1
2
1
2
2
3
2
4
1
3
1
4
2
3
2
2
1
2
2
4
1
3
1
3
2
3
1
4
2
4
1
3
1
4
1
3
2
4
2
4
1
4
2
Table 4.10: Transition table of deterministic factor automaton M
D
from
Example 4.45
The last step is to construct deterministic factor automaton M
D
. Its
transition table is shown in Table 4.10. The transition diagram of the re-
sulting deterministic factor automaton M
D
is depicted in Fig. 4.26. Now we
b
b
b a
a
b
b
b
a
a
START
multilpe front end
0 1 1 3 4
1 2 1 2
2 2 4
1 2 1
2 2 3 4
1 2 2 1
3 4
1 2
3
2
3
1
4
2
4
1
Figure 4.26: Transition diagram of deterministic factor automaton M
D
ac-
cepting set Fact(abab) Fact(abba) from Example 4.45
do the analysis of dsubsets of resulting automaton M
D
. The result of this
analysis is the repetition table shown in Table 4.11 for set S = abab, abba.
2
Denition 4.46
Let S be a set of strings S = x
1
, x
2
, . . . , x
|S|
. The repetition table for S
contains the following items:
1. dsubset,
2. corresponding factor,
149
dsubset Factor List of repetitions
1
1
1
2
3
1
4
2
a (1, 1, F), (2, 1, F), (1, 3, G), (2, 4, G)
2
1
2
2
4
1
ab (1, 2, F), (2, 2, F), (1, 4, S)
2
1
2
2
3
2
4
1
b (1, 2, F), (2, 2, F), (2, 3, S), (1, 4, G)
3
1
4
2
ba (1, 3, F), (2, 4, F)
Table 4.11: Repetition table for set S = abab, abba from Example 4.45
3. list of repetitions of the factor containing elements of the form (i, j, X
ij
),
where i is the index of the string in S,
j is the position in string x
i
,
X
ij
is the type of repetition:
F - the rst occurrence of the factor in string x
i
,
O - repetition of the factor in x
i
with overlapping,
S - repetition as a square in x
i
,
G - repetition with a gap in x
i
.
2
Let us suppose that each element of d-subset constructed by Algorithm 4.44
keeps two kind of information:
- index of the string in S to which it belongs,
- depth (position) in this string.
Moreover we suppose, that it is possible to identify the longest factor (max-
imal repeating factor) to which the d-subset belongs.
Algorithm 4.47
Constructing a repetition table containing exact repetitions in a nite set of
strings.
Input: Multiple front end of factor automaton for set of strings S =
x
1
, x
2
, . . . , x
|S|
, x
i
A
+
, i = 1, 2, . . . , [S[.
Output: Repetition table R for set S.
Method: Let us suppose, that d-subset of multiple state q has form
r
1
, r
2
, . . . , r
p
. Create rows in repetition table R for each multiple state q:
the row for maximal repeating factor x of state q has the form:
(r
1
, r
2
, . . . , r
p
, x, (i
1
, j
1
, F), (i
2
, j
2
, X
i
2
,j
2
), . . . , (i
p
, j
p
, X
i
p
,j
p
),
where i
l
is the index of the string in S, l = 1, 2, . . . , [S[,
j
l
is the position in string x
i
, l = 1, 2, . . . , [S[,
X
ij
is the type of repetition:
i: O, if j
l
j
l1
< [x[,
ii: S, if j
l
j
l1
= [x[,
iii: G, if j
l
j
l1
> [x[. 2
150
4.4.5 Computation of approximate repetitions
We have used the factor automata for nding exact repetitions in either
one string or in a nite set of strings. A similar approach can be used for
nding approximate repetitions as well. We will use approximate factor
automata for this purpose. As before, a part of deterministic approximate
factor automaton will be useful for the repetition nding. This part we will
call mixed multiple front end.
Let us suppose that nite automaton M is accepting set
Approx(x) = y : D(x, y) k,
where D is a metrics and k is the maximum distance. We can divide au-
tomaton M into two parts:
- exact part which is used for accepting string x,
- approximate part which is used when other strings in Approx(x) are
accepting.
For the next algorithm we need to distinquish states either in exact part
or in approximate part. Let us do this distinction by labelling of states in
question.
Denition 4.48
A mixed multiple front end of deterministic approximate factor automaton
is a part of this automaton containing only multiple states with:
a) d-subsets containing only states from exact part of nondeterministic
automaton,
b) d-subset containing mix of states from exact part and approximate part
of nondeterministic automaton with at least one state from exact part.
Let us call such states mixed multiple states. 2
We can construct the mixed multiple front end by a little modied of Algo-
rithm 3.21. The modication consists in modication of point 3(b) in this
way:
(b) if

(q, a) is a mixed multiple state then Q

= Q

(q, a).
4.4.6 Approximate repetitions Hamming distance
In this Section, we show how to nd approximate repetitions using the
Hamming distance (O?NRC problem).
Example 4.49
Let string x = abba. We construct an approximate factor automaton using
Hamming distance k = 1.
1. We construct a nite automaton accepting string x and all strings
with Hamming distance equal to 1. The set of these strings is denoted
H
1
(abba). The resulting nite automaton has the transition diagram
depicted in Fig. 4.27.
151
b
b a
b
b
a
a
b a a b
START
1 2 3 4
1 0 2 3 4
Figure 4.27: Transition diagram of the Hamming automaton accepting
H
1
(abba) with Hamming distance k = 1 from Example 4.49
2. We use the principle of inserting the transitions from state 0 to
states 1, 2, 3 and 4 and we x all states as nal states. The transition
diagram with inserted transition and xed nal states is depicted in
Fig. 4.28.
e e e
b
b a
b
b
a
a
b a a b
START
1' 2' 3' 4'
1 0 2 3 4
e
Figure 4.28: Transition diagram of the Hamming factor automaton with
xed nal states and inserted transitions from Example 4.49
3. We replace transitions by nontransitions. The resulting automa-
ton has the transition diagram depicted in Fig. 4.29.
4. The nal operation is the construction of the equivalent deterministic
nite automaton. Its transition table is Table 4.12
The transition diagram of the resulting deterministic approximate factor
automaton is depicted in Fig. 4.30.
152
b
b
b
a
b
b
b
a
a
a
b a
a
a
a
b
b
START
1' 2' 3' 4'
1 0 2 3 4
Figure 4.29: Transition diagram of the nondeterministic Hamming factor
automaton after removal of transitions from Example 4.49
a b
0 142

231

142

23

231

4 32

23

3
3

4 4

32

4 3

3 4 4

4
Table 4.12: Transition table of the deterministic Hamming factor automa-
ton from Example 4.49
153
Figure 4.30: Transition diagram of the deterministic Hamming factor au-
tomaton for x = abba, Hamming distance k = 1 from Example 4.49
Construction of repetition table is based on the following observation con-
cerning mixed multiple states:
1. If only exact states are in the respective d-subset then exact repetitions
take place.
2. Let us suppose that a mixed multiple state corresponding to factor x
has d-subset containing pair (r
1
, r
2
), where r
1
is an exact state and r
2
is an approximate state. It means that state r
1
corresponds to exact
factor x. There is a sequence of transitions for factor x also to the
state r
2
. Therefore in the exact part must be a factor y such that
the distance of x and y less than k. It means that factor y is an
approximate repetition of x.
Now we construct the approximate repetition table. We take into account
the repetition of factors which are longer than k. In this case k = 1 and
therefore we select repetitions of factors having length greater or equal to
two. The next table contains information on the approximate repetition of
factors of the string x = abba.
dsubset Factor Approximate repetitions
23

ab (2, ab, F), (3, bb, O)


32

bb (3, bb, F), (2, ab, O), (4, ba, O)


3

4 ba (4, ba, F), (3, bb, O)


2
154
As we can see, the approximate repetition table is similar to the repetition
table expressing the exact repetitions (see Denition 4.27). But there are
two important dierences:
1. There are no all multiple d-subsets in the rst column. The only d-
subsets used are such having at least one state corresponding to the
exact part of the nondeterministic approximate factor automaton.
2. There are, in the last column, triples containing repeating factors.
This is motivated by the fact, that approximate factor can be dierent
from original factor. Triple (i, x, F) always corresponds to the rst
exact factor.
Let us go back to the approximate repetition table in Example 4.49. The
rst row contains d-subset 23. It means, that factor ab at position 2 has
approximate repetition bb at position 3.
4.4.7 Approximate repetitions Levenshtein distance
Let us note that Levenshtein distance between strings x and y is dened
as the minimum number of editing operations delete, insert and replace
which are necessary to convert string x into string y. In this section we
show solution of O?NDC problem.
Example 4.50
Let string x = abba and Levenshtein distance k = 1. Find all approximate
repetitions in this string.
We construct an approximate factor automaton using Levenshtein distance
k = 1.
1. We construct a nite automaton accepting string x and all strings with
Levenshtein distance equal to 1. The set of these strings is denoted
L
1
(abba). The resulting nite automaton has the transition diagram
depicted in Fig. 4.31.
b
b a
b
b
a a
a
b
e e e e
b,a a,b a,b b,a a,b
a a b
START
1' 0' 2' 3' 4'
1 0 2 3 4
Figure 4.31: Transition diagram of the Levenshtein automaton accepting
L
1
(abba), with Levenshtein distance k = 1 from Example 4.50
155
2. We use the principle of inserting the transitions from state 0 to
states 1,2,3, and 4 and we x all states as nal states. The transition
diagram with inserted transitions and xed nal states is depicted
in Fig. 4.32.
Figure 4.32: Transition diagram of the Levenshtein factor automaton with
nal states xed and transitions inserted from Example 4.50
3. We replace all transitions by nontransitions. The resulting au-
tomaton has the transition diagram depicted in Fig. 4.33. Its transition
table is shown in Tab. 4.13
Figure 4.33: Transition diagram of the nondeterministic Levenshtein fac-
tor automaton after removal of transitions from Example 4.50
4. The nal operation is to construct the equivalent deterministic nite
automaton. Its transition table is shown in Table 4.14.
156
a b
0 140

230

1 1

21

2 2

32

3 43

4 4

Table 4.13: Transition table of the nondeterministic Levenshtein factor


automaton from Example 4.50
a b
0 140

230

140

21

230

41

32

21

32

32

43

41

32

43

43

Table 4.14: Transition table of the deterministic Levenshtein factor au-


tomaton from Example 4.50
157
The transition diagram of the resulting deterministic Levenshtein fac-
tor automaton is depicted in Fig. 4.34.
b a b
b
a
a a a b a,b
a
START
140 1 2 3 ' ' ' '4'
1 2 4 ' ' '
0
21 2 ' '3'4'
230'1'2'3'4'
2'3'4'
2'3'
32'3'
32'3'4'
3' ' 4
3'
43'4'
4'
4 '2'3'4' 1
b
a
a
a
b
a
b b
b
a
b
mixed multiple front end
Figure 4.34: Transition diagram of the deterministic Levenshtein factor
automaton for the string x = abba from Example 4.50
Now we construct the repetition table. We take into account the repeti-
tion of factors longer than k (the number of allowed errors). Approximate
repetition table R is shown in Table 4.15.
4.4.8 Approximate repetitions distance
Let us note that the distance is dened by Def. 1.18. This distance is
dened as the local distance for each position of the string. The number
of errors is not cumulated as in the previous (and following) cases of nd-
ing approximate repetitions. In this section we show solution of O?NC
problem.
Example 4.51
Let string x = abbc over ordered alphabet A = a, b, c. We construct an
approximate factor automaton using -distance equal to one.
1. We construct a nite automaton accepting string x and all strings
having -distance equal to one. The set of all these strings is denoted

1
(abbc). This nite automaton has the transition diagram depicted
in Fig. 4.35.
158
dsubset Factor Approximate repetitions
21234 ab (2, ab, F)(3, abb, O)
323 abb (3, abb, F)(2, ab, O), (3, bb, O)
434 abba (4, abba, F)(3, abb, O), (4, bba, O)
3234 bb (3, bb, F)(2, ab, O), (3, abb, O), (4, ba, O), (4, bba, O)
41234 ba (4, ba, F)(3, bb, S), (4, bba, O)
434 bba (4, bba, F)(3, bb, O), (4, ba, O)
Table 4.15: Approximate repetition table R for string x = abba with Leven-
shtein distance k = 1 from Example 4.50
a b
b a,c a,c b
a,b,c
b
a,b,c
c
b,c
START
0 1
1'
2
2'
3
3'
4
4'
Figure 4.35: Transition diagram of the automaton accepting
1
(abbc)
with -distance k = 1 from Example 4.51
2. We use the principle of inserting transitions from state 0 to states 1,
2, 3, and 4 and making all states nal states. The transition diagram
of the automaton with inserted transitions and xed nal states is
depicted in Fig. 4.36.
3. We replace transitions by nontransitions. The resulting automa-
ton has the transition diagram depicted in Fig. 4.37.
159
Figure 4.36: Transition diagram of the factor automaton with nal
states xed and transitions inserted from Example 4.51
a
b b c
a,c a,c b
b
b a,c a,c b
a,b,c
b
a,b,c
c
b,c
START
0 1
1'
2
2'
3
3'
4
4'
Figure 4.37: Transition diagram of the nondeterministic factor automa-
ton after the removal of transitions from Example 4.51
160
4. The nal operation is to construct the equivalent deterministic factor
automaton. Table 4.16 is its transition table and its transition diagram
is depicted in Fig. 4.38.
b a b
b,c
b,c
a
a
a
b
b
c
c
a
b
a a
a
c
c
b,c
b,c
b,c
b,c b
c
c
START
12 3 ' ' 0 23'4'
2'3'4'
2'3'
231'4'
34'
3 4 ' '
3'
3 4 2' '
4
4'
43'
42'3'
mixed multiple front end
Figure 4.38: Transition diagram of the deterministic factor automaton
for x = abbc, -distance=1, from Example 4.51
Now we construct the repetition table. We take into account the repetitions
of factors longer than the allowed distance. In this case, the distance is equal
to 1 and therefore we select repetitions of factors having the length greater
or equal to two. Table 4.17 contains information on the approximate
repetitions of factors of the string x = abbc. 2
4.4.9 Approximate repetitions distance
-distance is dened by Def. 1.18. This distance is dened as a global
distance, which means that the local errors are cumulated. In this section
we show solution of O?NC problem.
Example 4.52
Let string x = abbc over ordered alphabet A = a, b, c. We construct an
approximate factor automaton using -distance equal to two.
1. We construct a nite automaton accepting string x and all strings
having -distance equal to two. The set of all these strings is denoted

2
(abbc). This nite automaton has the transition diagram depicted
in Fig. 4.39.
161
a b c
0 12

231

42

12

23

23

34

231

32

42

32

43

34

4
43

4
42

Table 4.16: Transition table of the deterministic factor automaton from


Example 4.51
dsubset Factor Approximate repetitions
234 ab (2, ab, F), (3, bb, O), (4, bc, S)
34 abb (3, abb, F), (4, bbc, O)
324 bb (3, bb, F), (2, ab, O), (4, bc, O)
423 bc (4, bc, F), (2, ab, S), (3, bb, O)
43 bbc (4, bbc, F), (3, abb, O)
Table 4.17: Approximate repetition table for string x = abbc from Exam-
ple 4.51
162
a b
b
c
a,c
a,c
a,c
a,c
b
b
b
b
b
b
b
c
c
c
a
START
0 1
1'
1''
2
2'
2''
3
3'
3''
4
4'
4''
Figure 4.39: Transition diagram of the automaton accepting
2
(abbc)
with -distance k = 2 from Example 4.52
2. Now we insert transitions from state 0 to states 1, 2, 3, and 4 and we
make all states nal. The transition diagram of the automaton with
inserted transitions and xed nal states is depicted in Fig. 4.40
Figure 4.40: Transition diagram of the factor automaton with nal
states xed and transitions inserted from Example 4.52
3. We replace the transitions by nontransitions. The resulting non-
deterministic factor automaton has the transition diagram depicted
in Fig. 4.41.
4. The nal operation is to construct the equivalent deterministic nite
automaton.
163
a
b b c
b
b
c
a,c
a,c
a,c
a,c
a,c
a,c
b
b
b
b
b
b
b
b
c
c
c
a
a
START
0 1
1'
1''
2
2'
2''
3
3'
3''
4
4'
4''
Figure 4.41: Transition diagram of the nondeterministic factor automa-
ton after the removal of transitions from Example 4.52
Analyzing Table 4.18 we can recognize that the following sets are sets of
equivalent states:
2

, 2

, 3

, 3

, 3

, 43

, 4

, 3

, 3

, 43

, 3

.
Only states 43

and 43

have an impact on the repetition table. Let us


replace all equivalent states by the respective sets. Then we obtain the tran-
sition diagram of the optimized deterministic factor automaton depicted
in Fig. 4.42. Now we construct the repetition table. We take into account
the repetitions of factors longer than two (allowed distance). The Table 4.19
contains information on approximate repetition of one factor. The repeti-
tion of factor bc indicated by d-subset 432 is not included in the repetition
table because the length of this factor is [bc[ = 2 2
4.4.10 Approximate repetitions (, ) distance
(, )-distance is dened by Def. 1.18. This distance is dened as a global
distance, which means that the local errors are cumulated. In this section
we show solution of O?N(, )C problem.
Example 4.53
Let string x = abbc over ordered alphabet A = a, b, c. We construct an
approximate factor automaton using (, )-distance equal to (1,2).
164
a b c
0 12

231

42

12

23

231

32

43

23

34

32

43

34

4
42

43

43

4
2

Table 4.18: Transition table of the deterministic factor automaton for


the string x = abbc from Example 4.52
1. We construct a nite automaton accepting string x and all strings
having (, )-distance equal to (1,2). The set of all these strings
is denoted (, )
1,2
(abbc). This nite automaton has the transition
diagram depicted in Fig. 4.43.
2. Now we insert transitions from state 0 to states 1, 2, 3, and 4 and we
make all states nal. The transition diagram of the automaton with
inserted transitions and xed nal states is depicted in Fig. 4.44.
3. We replace transitions by nontransitions. The resulting nonde-
terministic factor automaton has the transition diagram depicted in
Fig. 4.45.
165
b
b
b
a b
b
b
b
c
c
c
b
b
a,c a,c a,c c
b
a,c
a,c
c
a
START
12 3 ' '4''
231'4'
4 ' '1'' 2 3
0
23'4''
32'4'
3'2''4'',43'2''
2'3''2'4'3''
34''
3 4 ',3' ',3'4''
4 '',4'3'',3'' '',3'' 3 4
4
4'
4''
mixed multiple front end
Figure 4.42: Transition diagram of the optimized deterministic factor
automaton for string x = abbc from Example 4.52
dsubset Factor Approximate repetitions
34 abb (3, abb, F), (4, bbc, O)
43 bbc (4, bbc, F), (3, abb, O)
Table 4.19: Approximate repetition table for string x = abbc, distance
equal to two, from Example 4.52
a b
b a,c
a,c
a,c
a,c
b
b
b
b
b
b
c
c
c
START
0 1
1'
2
2'
2''
3
3'
3''
4
4'
4''
Figure 4.43: Transition diagram of the (, ) automaton accepting
(, )
1,2
(abbc) with (, )-distance (k, l) = (1, 2) from Example 4.53
166
Figure 4.44: Transition diagram of the (, ) factor automaton with nal
states xed and transitions inserted from Example 4.53
a b
b
b a,c
a,c
a,c
a,c
a,c
a,c
b
b
b
b
b
b
b
b
c
c
c
c
START
0 1
1'
2
2'
2''
3
3'
3''
4
4'
4''
Figure 4.45: Transition diagram of the nondeterministic (, ) factor
automaton after the removal of transitions from Example 4.53
167
4. The nal operation is to construct the equivalent deterministic nite
automaton. Table 4.20 is its transition table.
a b c
0 12

231

42

12

23

231

32

43

23

34

32

43

34

4
42

43

43

4
2

Table 4.20: Transition table of the deterministic (, ) factor automaton


from Example 4.53
Analyzing Table 4.20 we can recognize that the following sets are sets of
equivalent states:
3

, 3

, 3

, 3

, 3

, 43

, 4

, 43

, 3

.
We replace all equivalent states by the respective sets. Then we obtain the
transition diagram of the optimized deterministic (, ) factor automaton
depicted in Fig. 4.46. Now we construct the repetition table. We take
into account the repetitions of factors longer than two (allowed distance).
Table 4.21 contains information on approximate repetition of one factor. 2
4.4.11 Exact repetitions in one string with dont care symbols
The dont care symbol () is dened by Def. 1.14. Next example shows
principle of nding repetitions in the case of presence of dont care symbols
(O?NED problem).
168
b
b
a b
b
b
b
c
c
c
b
b
a,c
a,c
a,c
a,c
c
b
a,c
a,c
c
START
12 3 ' '
231'4'
0 23'
32'4'
42'3'
{ } 43'2'',3'2''
2'3''
34''
{ } 3 4 ',3' ',3'4''
{ } 3 4 '',3'' '',43'',4'3''
4
4'
4''
mixed multiple front end
Figure 4.46: Transition diagram of the optimized deterministic (, )
factor automaton for the string x = abbc from Example 4.53
dsubset Factor Approximate repetitions
34 abb (3, abb, F), (4, bbc, O)
43 bbc (4, bbc, F), (3, abb, O)
Table 4.21: Approximate repetition table for string x = abbc, (, ) distance
equal to (1,2), from Example 4.53
Example 4.54
Let string x = aaab over alphabet A = a, b, c. Symbol is the dont care
symbol. We construct a dont care factor automaton.
1. We construct a nite automaton accepting set of strings described by
string x with dont care symbol. This set is DC(x) = aaaab, abaab,
acaab. This nite automaton has the transition diagram depicted in
Fig. 4.47.
Figure 4.47: Transition diagram of the DC automaton accepting DC(x)
from Example 4.54
2. We insert -transitions from state 0 to states 1,2,3,4, and 5 and we
169
make all states nal. Transition diagram of DC

(aaab) factor au-


tomaton with inserted -transitions and xed nal states is depicted
in Fig. 4.48.
Figure 4.48: Transition diagram of the DC

(a aab) factor automaton with


inserted -transitions and xed nal states from Example 4.54
3. We replace -transitions by non -transitions. Resulting nondeter-
ministic factor automaton DC
N
(aaab) has the transition diagram
depicted in Fig. 4.49.
Figure 4.49: Transition diagram of nondeterministic factor automaton
DC
N
(aaab) after the removal of -transitions from Example 4.54
4. The nal operation is construction of equivalent deterministic factor
automaton DC
D
(aaab). Table 4.22 is transition table of the non-
deterministic factor automaton having transition diagram depicted in
Fig. 4.49. Table 4.23 is transition table of deterministic factor automa-
ton DC
D
(aaab). Transition diagram of deterministic factor automa-
ton DC
D
(aaab) is depicted in Fig. 4.50.
170
a b c
0 1, 2, 3, 4 2, 5 2
1 2 2 2
2 3
3 4
4 5
5
Table 4.22: Transition table of nondeterministic factor automaton
DC
N
(aaab) from Example 4.54
a b c
0 1234 25 2
1234 234 25 2
2 3
25 3
234 34 5
3 4
34 4 5
4 5
5
Table 4.23: Transition table of deterministic factor automaton DC
D
(aaab)
from Example 4.54
171
0 1234
234
2
25
34
3 4 5
a c
a
c
b
b
a
a
a
a a b
b b
START
multiple front end
Figure 4.50: Transition diagram of deterministic factor automaton
DC
D
(aaab) from Example 4.54
The last step is the construction of the repetition table. It is shown in
Table 4.24.
dsubset Factor Repetitions
1234 a (1, F), (2, S), (3, S), (4, S)
25 ab (2, F), (5, G)
234 aa (2, F), (3, O), (4, O)
34 aaa (3, F), (4, O)
Table 4.24: Repetition table for string x = aaab from Example 4.54
4.5 Computation of periods revisited
Let us remind the denition of a period (see Def. 4.4). We have seen in
Section 4.2.2 the principle of computation of periods in the framework of
computation of borders. Now we show the computation of periods without
regard to the computation of borders. For the computation of periods of
given string we can use the backbone of a factor automaton and we will
construct the repetition table. Such repetition table is a prex repetition
table. The next Algorithm shows, how to nd periods of a string.
Algorithm 4.55
Input: String x A
+
.
Output: Sequence of periods p
1
, p
2
, . . . , p
h
, h 0, and the shortest period
Per(x).
Method:
1. Construct factor automaton M for string x.
172
2. Compute the prex repetition table using the backbone of the deter-
ministic factor automaton for string x.
3. Do inspection of the prex repetition table in order to nd all rows,
where squares are indicated. If these squares covers the prex of x
long enough, then the lengths of repeated strings are the periods of x.
More precisely:
If the prex of length m is repeating as squares
(m, F), (2 m, S), (3 m, S), . . . , (j m, S),
[x[ j m m, and x[j m + 1..[x[] Pref (x[1..m]), then m is the
period of x.
4. Do inspection of the prex repetition table in order to nd all rows
where are repetitions of prexes longer than one half of string x with
overlapping. If such prex has length m > [x[/2 and the sux of x
x[m+ 1..[x[] Pref(x[1..m]), then m is the period of x.
5. [x[ is also the period of x.
6. The shortest period is Per(x). 2
Note: If Per(x) = [x[ then x is primitive string.
Example 4.56
Let us compute periods of string x = abababababa. The nondetermin-
istic factor automaton has the transition diagram depicted in Fig. 4.51.
Transition table of the deterministic factor automaton is shown in Ta-
0 1 2 3 4 5 6 7 8 9 A B
a b
b
a
a
b
b
a
a
b
b
a
a
b
b
a
a
b
b
a
a
Figure 4.51: The nondeterministic factor automaton for string x =
abababababa from Example 4.56, A and B represent numbers 10 and 11,
respectively
ble 4.25. The backbone of the deterministic factor automaton has the transi-
tion diagram depicted in Fig. 4.52. The prex repetition table is Table 4.26.
1. During inspection of this table we nd two rows with squares: row 2
and row 4. In the row 2 we see, that the prex ab is repeated four
times: (4, S), (6, S), (8, S), and (A, S). The rest of the string is a which
is the prex of ab. Therefore the rst period is 2. In row 4 we see,
173
a b
0 13579B 2468A
13579B 2468A
2468A 3579B
3579B 468A
468A 579B
579B 68A
68A 79B
79B 8A
8A 9B
9B A
A B
B
Table 4.25: Transition table of the deterministic factor automaton from
Example 4.56
dsubset Prex Repetitions of prexes
13579B a (1, F), (3, G), (5, G), (7, G), (9, G), (B, G)
2468A ab (2, F), (4, S), (6, S), (8, S), (A, S)
3579B aba (3, F), (5, O), (7, O), (9, O), (B, O)
468A abab (4, F), (6, O), (8, S), (A, O)
579B ababa (5, F), (7, O), (9, O), (B, O)
68A ababab (6, F), (8, O), (A, O)
79B abababa (7, F), (9, O), (B, O)
8A abababab (8, F), (A, O)
9B ababababa (9, F), (B, O)
Table 4.26: The prex repetition table from Example 4.56
174
0 13579B 3579B 579B
8A 9B A B
2468A 468A 68A
79B
a b b b
b a b a
a a a
Figure 4.52: Transition diagram of the backbone of the deterministic factor
automaton for string x = abababababa from Example 4.56
that the prex abab is repeated once: (8, S). The rest of the string is
aba which is the prex of abab. Therefore the second period is 4.
2. Moreover, there are four cases with overlapping of prexes longer
than 5:
(6) ababab, ababa
(7) abababa, baba
(8) abababab, aba
(9) ababababa, ba
(10) ababababab, a
Suxes of cases 6, 8, 10 are prexes of x : ababa, aba, a. It means that
6, 8 and 10 are periods of x, too.
3. The last period of x is [x[.
The set of all periods is 2, 4, 6, 8, 10, 11. Per(x) = 2. 2
Example 4.57
Let us compute periods of string x = a
4
. Ttransition diagram of the non-
deterministic factor automaton is depicted in Fig. 4.53. Transition diagram
a a
a
a a
a a
START
0 1 2 3 4
Figure 4.53: Transition diagram of nondeterministic factor automaton for
string x = a
4
from Example 4.57
of the backbone of deterministic factor automaton is depicted in Fig. 4.54.
a a a a START
0 1234 234 34 4
Figure 4.54: Transition diagram of the backbone of deterministic factor
automaton for string x = a
4
from Example 4.57
175
The prex repetition table has the form:
dsubset Prex Repetitions of prexes
1234 a (1, F), (2, S), (3, S), (4, S)
234 aa (2, F), (3, O), (4, S)
34 aaa (3, F), (4, O)
We see that there are four periods: 1, 2, 3, 4. Per(a
4
) = 1.
Transition table of the deterministic factor automaton is shown in Ta-
ble 4.27.
a
0 1234
1234 234
234 34
34 4
Table 4.27: Transition table of the deterministic factor automaton from
Example 4.57
Example 4.58
Let us compute periods of string x = abcd. Transition diagram of the factor
automaton is depicted in Fig. 4.55. This factor automaton is deterministic.
Figure 4.55: Transition diagram of the factor automaton for string x = abcd
It means, that nothing is repeated and the shortest period is equal to the
length of string and therefore string x = abcd is the primitive string (see
Def. 4.6).
176
5 Simulation of nondeterministic pattern match-
ing automata fail function
Deterministic pattern matching automata have in some cases large space
complexity. This is especially true for automata for approximate pattern
matching. This situation led to the construction of algorithms for the sim-
ulation of nondeterministic pattern matching automata. These simulation
algorithms have an acceptable space complexity but theirs time complexity
is, in some cases, greater than linear. We can divide methods used for simu-
lation of nondeterministic pattern matching automata into three categories:
1. using fail function,
2. dynamic programming,
3. bit parallelism.
The use of the fail function we will discuss in this Chapter. The dynamic
programming and bit parallelism will be covered in the next Chapter.
5.1 Searching automata
The group of methods which use fail function is based on the following
principle:
The nondeterministic pattern matching automaton is used but the min-
imum number of its self loops in the initial state are removed in order to
obtain deterministic nite automaton. If this operation succeed and the
resulting nite automaton is deterministic, then it can be used. Afterwards
some transitions called backward transitions are added. No input symbol is
read during such transitions. They are used in case when the forward transi-
tions cannot be used. Well known MorrisPratt (MP), KnuthMorrisPratt
(KMP) and AhoCorasick (AC) algorithms belong to this category.
The base of algorithms of this category is a notion of searching automaton
which is an extended deterministic nite automaton.
Denition 5.1
Searching automaton is sixtuple SA = (Q, A, , , q
0
, F), where
Q is a nite set of states,
A is a nite input alphabet,
: QA Q fail is the forward transition function,
: (Qq
0
) A

Q is the backward transition function,


q
0
is the initial state,
F Q is the set of nal states.
A conguration of searching automaton SA is pair (q, w), where q Q, w
A

. The initial conguration is (q


0
, w), where w is the complete input text.
The nal conguration is (q, w), where q F and w A

is an unread part
of the text. This conguration means that the pattern was found and its
177
position is in the text just before w. The searching automaton performs
forward and backward transitions. Transition relation
(QA

) (QA

)
is dened in this way:
1. if (q, a) = p, then (q, aw) (p, w) is a forward transition,
2. if (q, x) = p, then (q, w) (p, w) is a backward transition, where x is
the sux of the part of the text read before reaching state q. 2
Just one input symbol is read during forward transition. If (q, a) = fail
then backward transition is performed and no symbol is read. Forward and
backward transition functions and have the following properties:
1. (q
0
, a) ,= fail for all a A,
2. If (q, x) = p then the depth of p is strictly less than the depth of q,
where the depth of state q is the length of the shortest sequence of
forward transitions from state q
0
to state q.
The rst condition ensures that no backward transition is performed in
the initial state. The second condition ensures that the total number of
backward transitions is less than the number of forward transitions. It
follows that the total number of performed transitions is less than 2n, where
n is the length of the text.
5.2 MP and KMP algorithms
MP and KMP algorithms are the simulators of the SFOECO automaton
(see Section 2.2.1) for exact matching of one pattern. We will show both
algorithms in the following example. Let us mention, that the backward
transition function is simplied and
: (Qq
0
) Q.
Example 5.2
Let us construct MP and KMP searching automata for pattern P = ababb
and compare it with pattern matching automaton for P. The construction
of both, deterministic pattern matching automaton and MP searching au-
tomaton is shown in Fig. 5.1. We can construct, for the resulting MP and
KMP searching automata, the following Table 5.1 containing forward tran-
sition function , backward transition function for MP algorithm, and
optimized backward transition function
opt
for KMP algorithm. The rea-
son for the introduction of the optimized backward transition function
opt
will follow from the next example. 2
Example 5.3
The MP searching automaton for pattern P = ababb and text T = abaababbb
performs the following sequence of transitions (backward transition function
is used):
178
a b
opt
0 1 0
1 fail 2 0 0
2 3 fail 0 0
3 fail 4 1 0
4 fail 5 2 2
4 fail fail 0 0
Table 5.1: Forward transition function , backward transition functions
and
opt
from Example 5.2
(0, abaababbb) (1,baababbb)
(2, aababbb)
(3, ababbb) fail
(1, ababbb) fail
(0, ababbb)
(1, babbb)
(2, abbb)
(3, bbb)
(4, bb)
(5, b)
The pattern is found in state 5 at position 8. 2
Let us mention one important observation. We can see, that there are
performed two subsequent backward transitions from state 3 for the input
symbol a leading to states 1 and 0. The reason is that (3, a) = (1, a) =
fail. This variant of searching automaton is called MP (MorrisPratt) au-
tomaton and the related algorithm shown below is called MP algorithm.
There is possible to compute optimized backward transition function
opt
having in this situation value
opt
(3) = 0. The result is, that each such
sequence of backward transitions is replaced by just one backward transi-
tion. The algorithm using optimized backward transition function
opt
is
KMP (KnuthMorrisPratt) algorithm. After this informal explanation, we
show the direct construction of MP searching automaton, computation of
backward and optimized backward transition functions. We start with MP
searching algorithm.
179
Figure 5.1: SFOECO and MP searching automata for pattern P = ababb
from Example 5.2
180
var TEXT:array[1..N] of char;
PATTERN:array[1..M] of char;
I,J: integer;
FOUND: boolean;
...
I:=1; J:=1;
while (I <= N) and (J <= M) do
begin
while (TEXT[I] <> PATTERN[J]) and (J > 0) do J:=PHI[J];
J := J + 1;
I := I + 1
FOUND := J > M;
end;
...
Variables used in MP searching algorithm are shown in Figure 5.2. The com-
Figure 5.2: Variables used in MP searching algorithm, pos = I J + 1
putation of backward transition function is based on the notion of repetitions
of prexes of the pattern in the pattern itself. The situation is depicted in
Fig. 5.3. If prex u = u
1
u
2
. . . u
j1
of the pattern matches the substring of
pattern
pattern
period
period
a
a b
u
v v
Figure 5.3: Repetition of prex v in prex u of the pattern
the text u = t
ij+1
t
ij
. . . t
i1
and u
j
,= t
i
then it is not necessary to com-
pare prex v of the pattern with substring t
ij+2
t
ij+3
. . . of text string at
181
the next position. Instead of this comparison we can do shift of the pattern
to the right. The length of this shift is the length of Border(u).
Example 5.4
Let us show repetitions and periods of prexes of pattern P = ababb in
Fig. 5.4. The shift is represented by the value of backward transition function
Figure 5.4: Repetitions and periods of P = ababb
for position j in the pattern is
(j) = [Border(p
1
p
2
. . . p
j
)[.
If there is no repetition of the prex of the pattern in itself, then the shift
is equal to j, because the period is equal to zero. 2
For the computation of the function for the pattern P we will use the fact,
that the value of (j) is equal to the element of the border array [j] for
the pattern P.
Example 5.5
Let us compute the border array for pattern P = ababb. Transition diagram
of the nondeterministic factor automaton is depicted in Fig. 5.5. Table 5.2 is
a b
b
a
a
b
b
b
b
START
0 1 2 3 4 5
Figure 5.5: Transition diagram of the nondeterministic factor automaton for
pattern P = ababb from Example 5.5
transition table of the equivalent deterministic factor automaton. Transition
diagram of the deterministic factor automaton is depicted in Fig. 5.6.
182
a b
0 13 245
13 24
24 3 5
245 3 5
3 4
4 5
5
Table 5.2: Transition table of the deterministic factor automaton from Ex-
ample 5.5.
Figure 5.6: Transition diagram of the deterministic factor automaton for
pattern P = ababb from Example 5.2
The analysis od dsubsets on the backbone of deterministic factor automaton
is shown in this table:
Analyzed state Value of border array element
13 [3] = 1
24 [4] = 2
Values of elements of border array are shown in this table:
j 1 2 3 4 5
symbol a b a b a
[j] 0 0 1 2 0
Let us recall, that (j) = [j]. 2
The next algorithm constructs MP searching automaton.
Algorithm 5.6
Construction of MP searching automaton.
Input: Pattern P = p
1
p
2
. . . p
m
.
183
Output: MP searching automaton.
Method:
1. The initial state is q
0
.
2. Each state q of MP searching automaton corresponds to prex
p
1
p
2
. . . p
j
of the pattern. (q, p
j+1
) = q

, where q

corresponds to
prex p
1
p
2
. . . p
j
p
j+1
.
3. The state corresponding to complete pattern p
1
p
2
. . . p
m
is the nal
state.
4. Dene (q
0
, a) = q
0
for all a for which no transitions was dened in
step 2.
5. (q, a) = fail for all a A and q Q for which (q, a) was not dened
in steps 2 and 3.
6. Function is the backward transition function. This is equal to the
border array for pattern P. 2
The next algorithm computes optimized backward transition function
opt
on the base of backward transition function .
Algorithm 5.7
Computation of optimized backward transition function
opt
.
Input: Backward transition function .
Output: Optimized backward transition function
opt
.
Method:
1.
opt
= for all states of the depth equal to one.
2. Let us suppose that the function
opt
has been computed for all states
having the depth less or equal to d. Let q has the depth equal to d+1.
Let us suppose that (q) = p. If (q, a) = fail and (p, a) = fail then

opt
(q) =
opt
((q)) else
opt
(q) = (q). 2
Example 5.8
Let us construct KMP searching automaton for pattern P = abaabaa and
alphabet A = a, b. First, we construct the forward transition function.
It is depicted in Fig. 5.7. Now we will construct both, the nonoptimized
Figure 5.7: Forward transition function of KMP searching automaton for
pattern P = abaabaa from Example 5.8
184
backward transition function and optimized backward transition func-
tion
opt
using Algorithms 4.19 and 5.7. To construct the nonoptimized
backward transition function, we construct the border array for pattern
P = abaabaa. The nondeterministic factor automaton has the transition
diagram depicted in Fig. 5.8. The transition diagram of the useful part
0 1 2 3 4 5 6 7
a b a a b a a
b a a b a a
START
Figure 5.8: Transition diagram of the nondeterministic factor automaton for
pattern P = abaabaa from Example 5.8
of the deterministic factor automaton is depicted in Fig. 5.9. The analy-
Figure 5.9: Part of the transition diagram of the deterministic factor au-
tomaton for pattern P = abaabaa from Example 5.8
sis of dsubsets of the deterministic factor automaton is summarized in the
Table 5.3.
dsubset Values of elements of border array
13467 [3] = 1, [4] = 1, [5], = 1, [6] = 1, [7] = 1
25 [5] = 2
36 [6] = 3
47 [7] = 4
Table 5.3: Computation of the border array for pattern P = abaabaa from
Example 5.8
185
The resulting border array is in the Table 5.4: The values [1] and [2]
Index 1 2 3 4 5 6 7
Symbol a b a a b a a
0 0 1 1 2 3 4
Table 5.4: The border array for pattern P = abaabaa from Example 5.8
are equal to zero because strings a and ab have borders of zero length. This
border array is equal to the backward transition function . The optimized
backward transition function
opt
is for some indices dierent than function
. Both functions and
opt
have values according to the Table 5.5. The
I [I]
opt
[I]
1 0 0
2 0 0
3 1 1
4 1 0
5 2 0
6 3 1
7 4 4
Table 5.5: Functions and
opt
for pattern P = abaabaa from Example 5.8
complete MP and KMP automata are depicted in Fig. 5.10. Let us have
b
0 1 2 3 4 5 6 7
a b a a b a a
START
Figure 5.10: KMP searching automaton for pattern P = abaabaa from
Example 5.8; nontrivial values of are shown by dashed lines, nontrivial
values of
opt
are shown by dotted lines, trivial values of both functions are
leading to the initial state 0
text starting with prex T = abaabac . . .. We show the behaviours of both
variants of KMP automaton. The rst one is MP variant and it is using .
186
(0, abaabac . . .) (1,baabac. . .)
(2, aabac. . .)
(3, abac. . .)
(4, bac. . .)
(5, ac. . .)
(6, c. . .)fail
(3, c. . .)fail
(1, c. . .)fail
(0, c. . .)
(0, . . .)
. . .
The second one is KMP variant using
opt
.
(0, abaabac . . .) (1,baabac. . .)
(2, aabac. . .)
(3, abac. . .)
(4, bac. . .)
(5, ac. . .)
(6, c. . .)fail
(1, c. . .)fail
(0, c. . .)
(0, . . .)
. . .
We can see from these two sequences of transitions, that MP variant com-
pares symbol c 4 times and KMP variant compares symbol c 3 times. 2
Both variants of KMP algorithm have linear time and space complexities.
If we have text T = t
1
t
2
. . . t
n
and pattern P = p
1
p
2
. . . p
m
then KMP
algorithm requires [Cro97]:
2m3 symbol comparisons during preprocessing phase (computation
of function or
opt
),
2n 1 symbol comparisons during searching phase,
m elements of memory to store the values of function or
opt
.
The nal result is that the time complexity is O(n + m) and the space
complexity is O(m). Let us note, that this complexity is not inuenced by
the size of alphabet on the contrary with the deterministic nite automata.
5.3 AC algorithm
The AC (AhoCorasick) algorithm is the simulator of the SFFECO au-
tomaton (see Section 2.2.5) for exact matching of a nite set of patterns.
187
It is based on the same principle as KMP algorithm and use the searching
automaton with restricted backward transition function:
: (Qq
0
) Q.
We start with the construction of forward transition function.
Algorithm 5.9
Construction of the forward transition function of AC automaton.
Input: Finite set of patterns S = (P
1
, P
2
, . . . , P
|S|
), where P
i
A
+
,
1 i [S[.
Output: Deterministic nite automaton M = (Q, A, , q
0
, F) accepting the
set S.
Method:
1. Q = q
0
, q
0
is the initial state.
2. Create all possible states. Each new state q of AC automaton cor-
responds to some prex a
1
a
2
. . . a
j
of one or more patterns. Dene
(q, a
j+1
) = q

, where q

corresponds to prex a
1
a
2
. . . a
j
a
j+1
of one or
more patterns. Q = Q q

.
3. For state q
0
dene (q
0
, a) = q
0
for all such a that (q
0
, a) was not
dened in step 2.
4. (q, a) = fail for all q and a for which (q, a) was not dened in steps
2 or 3.
5. Each state corresponding to the complete pattern will be the nal
state. It holds also in case when one pattern is either prex or sub-
string of another pattern. 2
The same result as by Algorithm 5.9 we can obtain by modication of
SFFECO automaton. This modication consists in two steps:
1. Removing some number of self loops in the initial state. There are
self loops for symbols for which exist transitions from the initial state
to the next states.
2. Determinization of the automaton resulting from step 1.
Step 2 must be used only in case when some patterns in set S = (P
1
, P
2
,
. . . , P
|s|
) have equal prexes. Otherwise the resulting automaton is deter-
ministic after step 1.
The next algorithm constructs the backward transition function. Let us note
that the depth of state q is the number of forward transitions from state q
0
to state q.
Algorithm 5.10
Construction of backward transition function of AC automaton for set S =
188
P
1
, P
2
, . . . , P
|S|
.
Input: Deterministic nite automaton M accepting set S.
Output: Complete AC automaton with backward transition function .
Method:
1. Let Q = Q
0
Q
1
. . . Q
maxd
, where elements of Q
d
are states with
depth d. Let us note:
(a) Q
0
= q
0
,
(b) Q
i
Q
j
= , for i ,= j, 0 i, j maxd.
2. (q) = q
0
for all q Q
1
.
3. For d = 2, . . . , maxd and all states in Q
d
construct mborder array
m[2..n]. 2
The computation of backward transition function is depicted in Fig. 5.11.
b
a a
j(q )
d
j j ( (q ))
d
j(q') q
d
q'
START
Figure 5.11: Computation of (q

)
This backward transition function is not optimized similarly to the case
of MP searching automaton. The optimized version
opt
of function is
constructed by the next algorithm.
Algorithm 5.11
Construction of optimized backward transition function
opt
of AC search-
ing automaton.
Input: AC searching automaton with backward transition function (not
optimized).
Output: Optimized backward transition function
opt
for the input AC
searching automaton.
Method:
1. Let Q = q
0
Q
1
. . . Q
maxd
, where elements of Q
d
are states with
the depth d.
2.
opt
(q) = (q) = q
0
for all q Q
1
.
3. For d = 2, 3, . . . , maxd and all states in q Q
d
do:
(a) X = a : (q, a) = p, a A,
189
(b) Y = a : ((q), a) = r, a A,
(c) if Y X or X = then
opt
(q) =
opt
((q)) else
opt
(q) = (q).
2
The principle of the construction of optimized backward transition function

opt
is depicted in Fig. 5.12.
b
c
a
a
j j j
opt opt
( )= q ( (q))
q
b
X= a,b , Y= a , Y X { } { }
j(q)
START
Figure 5.12: Construction of
opt
(q)
Example 5.12
Let us construct AC searching automaton for set of patterns S = ab, babb, bb
and alphabet A = a, b, c. First we construct the forward transition func-
tion. It is depicted in Fig. 5.13. Further we will construct both backward
b a b
b
a
b
c
b
START
0 1
23
2
2
1
1
3
2
1
2
3
4
Figure 5.13: Forward transition function of AC searching automaton for set
of patterns S = ab, babb, bb and A = a, b, c from Example 5.12
transition functions and
opt
. To construct the nonoptimized backward
transition function we construct the mborder array m for set of patterns
S = ab, babb, bb. The nondeterministic factor automaton has the transi-
tion diagram depicted in Fig. 5.14. Table 5.6 is the transition table of the
deterministic factor automaton for set of patterns S = ab, babb, bb. The
transition diagram of this factor automaton is depicted in Fig. 5.15. The
190
b a
a
b
b
b
b
a
b
b
b
b
START
0 1
23
2
2
1
1
3
2
1
2
3
4
Figure 5.14: Transition diagram of the nondeterministic factor automaton
for set of patterns S = ab, babb, bb from Example 5.12
a b
0 1
1
2
2
1
23
2
1
2
3
34
1
1
2
2
2
1
3
2
1
3 4
1
23
2
1
2
3
34 2
2
2
3
4
2
2
3
3 4
4
2
3
4
Table 5.6: Transition table of the deterministic factor automaton for set
S = ab, babb, bb from Example 5.12
analysis of dsubsets of the deterministic factor automaton is summarized
in the Table 5.7. The next table shows the border array.
State 1
1
1
23
2
1
2
2
2
3
3 4
Symbol a b b a b b b
m 0 0 1
23
1
1
1
23
2
1
2
3
The values of [1
1
] and [1
23
] are equal to zero because strings a and b have
borders of zero length. This border array is equal to backward transition
function . Optimized backward transition function
opt
diers for some
states of function . Forward transition function and both functions
and
opt
have values according to the Table 5.8. The complete AC search-
ing automaton is depicted on Fig. 5.16 The backward transition function
is shown by dashed lines, the optimized backward transition function is
shown by dotted lines in cases when ,=
opt
. The Fig. 5.17 shows the
transition diagram of the deterministic automaton for nondeterministic SF-
191
Figure 5.15: Transition diagram of the deterministic factor automaton for
set of patterns S = ab, babb, bb from Example 5.12
dsubset Values of elements of border array
1
23
2
1
2
3
34 m[2
1
] = 1
23
, m[2
3
] = 1
23
, m[3] = 1
23
, m[4] = 1
23
1
1
2
2
m[2
2
] = 1
1
2
1
3 m[3] = 2
1
2
3
4 m[4] = 2
3
Table 5.7: Computation of mborder array for set of patterns S =
ab, babb, bb from Example 5.12
FECO automaton for S = ab, babb, bb. It is included in order to enable
comparison of deterministic nite automaton and AC searching automaton.
Let us show the sequence of transitions of resulting AC searching automaton
for input string bbabb.
(0, bbabb) ( 1
23
, babb)
( 2
3
, abb) bb found, fail
( 1
23
, abb)
( 2
2
, bb)
( 3, b) ab found
( 4, ) bb and babb found.
For comparison we show the sequence of transitions of deterministic nite
automaton:
(0, bbabb) ( 01
23
, babb)
( 01
23
2
3
, abb) bb found
( 01
1
2
2
, bb)
( 01
23
2
1
3, b) ab found
( 01
23
2
3
4, ) bb and babb found.
2
192
a b c
opt
0 1
1
1
23
0
1
1
fail 2
1
fail 0 0
1
23
2
2
2
3
fail 0 0
2
1
fail fail fail 1
23
1
23
2
2
fail 3 fail 1
1
0
2
3
fail fail fail 1
23
1
23
3 fail 4 fail 2
1
1
23
4 fail fail fail 2
3
1
23
Table 5.8: The forward and both backward transition functions for set of
patterns S = ab, babb, bb from Example 5.12
Figure 5.16: Complete AC searching automaton for set of patterns S =
ab, babb, bb from Example 5.12
Similarly as the KMP algorithm, the AC algorithm have linear time and
space complexities. If we have alphabet A, text T = t
1
t
2
. . . t
n
and set of
patterns S = P
1
, P
2
, . . . P
|s|
, where [S[ =

|s|
i=1
[P
i
[, then AC algorithm
needs [CH97b]:
O([S[) time for preprocessing phase (construction of AC searching
automaton),
O(n) time for searching phase,
O([S[ [A[) space to store AC searching automaton.
The space requirement can be reduced to O([S[). In this case the searching
phase has time complexity O([n[ log[A[). See more details in Crochemore
and Hancart [CH97b].
193
Figure 5.17: Transition diagram of the deterministic nite automaton for
S = ab, babb, bb from Example 5.12; transitions for symbol c from all
states are leading to state 0
194
6 Simulation of nondeterministic nite automata
dynamic programming and bit parallelism
In the case when the space complexity or the preprocessing time of DFA
makes it unusable, we can use some of deterministic simulation methods
of corresponding NFA. At the beginning of this section we describe a ba-
sic simulation method, which is the base of the other simulation methods
presented further in this section. This method can be used for any general
NFA. The other simulation methods improve complexities of simulation, but
on the other hand they set some requirements to NFAs in order to be more
ecient.
The simulation methods will be presented on NFAs for exact and approx-
imate string matching. In this Section we will use another version of NFAs
for approximate string matching, where all transitions for edit operations
replace and insert are labeled by whole alphabet instead of complement of
matching symbol. This simplies the formulae for the simulation while the
behavior of NFAs practically does not change.
6.1 Basic simulation method
In Algorithm 6.1 we show the basic algorithm, which is very similar to the
transformation of NFA to the equivalent DFA. In each step of the run of
NFA a new set of active states is computed by evaluating all transitions from
all states of the previous set of active states. This provides that all possible
paths labeled by input string are considered in NFA. The simulation nishes
when the end of the input text is reached or when there is no active state
(e.g., no accepting path exists).
Algorithm 6.1 (Simulation of run of NFAbasic method)
Input: NFA M = (Q, A, , q
0
, F), input text T = t
1
t
2
. . . t
n
.
Output: Output of run of NFA.
Method: Set S of active states is used.
S := CLOSURE(q
0
)
i := 1
while i n and S ,= do
/ transitions are performed for all elements of S /
S :=

qS
CLOSURE((q, t
i
))
if S F ,= then
write(information associated with each nal state in S F)
endif
i := i + 1
endwhile
195
In the transformation of NFA to the equivalent DFA, all the possible con-
gurations of set S of active states are evaluated as well as all the possible
transitions among these congurations and a deterministic state is assigned
to each such conguration of S. Using the simulation method from Algo-
rithm 6.1 only the current conguration of set S is evaluated in each step of
the simulation of NFAonly used congurations are evaluated during the
simulation.
It is also possible to combine the simulation and the transformation.
When processing an input text, we can store the used congurations of S
in some state-cache and assign them deterministic states. In such a way we
transform NFA to DFA incrementally, but we evaluate only used states and
transitions. If the state-cache is full, we can use one of cache techniques for
making room in the state-cache, e.g., removing least recently used states.
This solution has the following advantages: we can control the amount of
used memory (size of state-cache) and for the most frequent congurations
we do not need always to compute the most frequent transitions, but we
can use directly the information stored in the state-cache (using a DFA
transition has better time complexity than computing a new conguration).
Theorem 6.2
The basic simulation method shown in Algorithm 6.1 simulates run of NFA.
Proof
Let M = (Q, A, , q
0
, F) be an NFA and T = t
1
t
2
. . . t
n
be an input text.
Algorithm 6.1 considers really all paths (i.e., sequences of congurations)
leading from q
0
.
At the beginning of the algorithm, set S of active states contains
CLOSURE(q
0
)each path must start in q
0
and then some of -transitions
leading from q
0
can be used. In this way all congurations (q
j
, T), q
j
Q,
(q
0
, T)

M
(q
j
, T), are considered.
In each i-th step of the algorithm, 1 i n, all transitions (relevant to
T) leading from all states of S are evaluated, i.e., both t
i
labeled transitions
as well as -transitions. At rst all transitions reachable by transitions
labeled by t
i
from each state of S are inserted in new set S and then
CLOSURE of new set S is also inserted in new set S. In this way all
congurations (q
j
, t
i+1
t
i+2
. . . t
n
), q
j
Q, (q
0
, T)

M
(q
j
, t
i+1
t
i+2
. . . t
n
), are
considered in i-th step of the algorithm.
In each step of the algorithm, the set S of active states is tested for nal
state and for each such found active nal state the information associated
with such state is reported. 2
6.1.1 Implementation
6.1.1.1 NFA without -transitions This basic method can be imple-
mented using bit vectors. Let M = (Q, A, , q
0
, F) be an NFA without
196
-transitions and T = t
1
t
2
. . . t
n
be an input text of M. We implement a
transition function as a transition table T (of size [Q[ [A[) of bit vectors:
T [i, a] =
_

1
.
.
.

|Q|1
_

_
(5)
where a A and bit
j
= 1, if q
j
(q
i
, a), or
j
= 0 otherwise. Then we
also implement a set F of nal states as a bit vector T:
T =
_

_
f
0
f
1
.
.
.
f
|Q|1
_

_
(6)
where bit f
j
= 1, if q
j
F, or f
j
= 0 otherwise. In each step i, 0 i n,
(i.e., after reading symbol t
i
) of the simulation of the NFA run, set S of
active states is represented by bit vector o:
o
i
=
_

_
s
0,i
s
1,i
.
.
.
s
|Q|1,i
_

_
(7)
where bit s
j,i
= 1, if state q
j
is active (i.e. q
j
o) in i-th simulation step,
or s
j,i
= 0 otherwise.
When constructing o
i+1
, we can evaluate all transitions labeled by t
i+1
leading from state q
j
, which is active in i-th step, at oncejust using bitwise
operation or for bit-vector o
i+1
and bit-vector T [j, t
i+1
]. This implementa-
tion is used in Algorithm 6.3.
Note, that the rst for cycle of Algorithm 6.3 is multiplication of vector
o
i
by matrix T [, t
i
]. This approach is also used in quantum automata
[MC97], where quantum computation is used.
Theorem 6.4
The simulation of the run of general NFA runs in time O(n[Q[
|Q|
w
|) and
space
1
O([A[[Q[
|Q|
w
|), where n is the length of the input text, [Q[ is the
1
For the space complexity of this theorem we expect complete transition table for
representation of transition function.
197
Algorithm 6.3 (Simulation of run of NFAbit-vector imple-
mentation of basic method)
Input: Transition table T and set T of nal states of NFA, input text
T = t
1
t
2
. . . t
n
.
Output: Output of run of NFA.
Method:
o
0
:= [100 . . . 0] / only q
0
is active at the beginning /
i := 1
while i n and o
i1
,= [00 . . . 0] do
o
i
:= [00 . . . 0]
for j := 0, 1, . . . [Q[ 1 do
if s
j,i1
= 1 then / q
j
is active in (i 1)-th step /
o
i
:= o
i
or T [j, t
i
] / evaluate transitions for q
j
/
endif
endfor
for j := 0, 1, . . . [Q[ 1 do
if s
j,i
= 1 and f
j
= 1 then / if q
j
is active nal state /
write(information associated with nal state q
j
)
endif
endfor
i := i + 1
endwhile
number of states of the NFA, A is the input alphabet, and w is the size of
the used computer word in bits.
Proof
See the basic simulation method in Algorithm 6.3. The main while-cycle is
performed at most n-times and both inner while-cycles are performed just
[Q[ times. If [Q[ > w (i.e., more than one computer word must be used to
implement bit-vector for all states of NFA), each elementary bitwise opera-
tions must be split into
|Q|
w
| bitwise operations. It gives us time complexity
O(n[Q[
|Q|
w
|). The space complexity is given by the implementation of tran-
sition function by transition table T , which contains ([Q[ [A[) bit-vectors
each of size
|Q|
w
|. 2
Example 6.5
Let M be an NFA for the exact string matching for pattern P = aba and
T = accabcaaba be an input text.
Transition table and its bit-vector representation T are as follows:
198

M
a b A a, b
0 0,1 0 0
1 2
2 3
3
T
M
a b A a, b
0
1
1
0
0
1
0
0
0
1
0
0
0
1
0
0
0
0
0
0
1
0
0
0
0
0
2
0
0
0
1
0
0
0
0
0
0
0
0
3
0
0
0
0
0
0
0
0
0
0
0
0
The process of simulation of M over T is displayed by set S of active
states and its bit-vector representation o.
- a c c a b c a a b a
o
1
0
0
0
1
1
0
0
1
0
0
0
1
0
0
0
1
1
0
0
1
0
1
0
1
0
0
0
1
1
0
0
1
1
0
0
1
0
1
0
1
1
0
1
S q
0
q
0
q
1
q
0
q
0
q
0
q
1
q
0
q
2
q
0
q
0
q
1
q
0
q
1
q
0
q
2
q
0
q
1
q
3
6.1.1.2 NFA with -transitions If we have an NFA with -transitions,
we can transform it to an equivalent NFA without -transitions using Algo-
rithm 6.6. There are also other algorithms for removing -transitions but
this algorithm does not change the states of the NFA.
Algorithm 6.6 provides that all transitions labeled by symbols (not -
transitions) leading from all states of CLOSURE(q
0
) lead also from
q
0
. Then for each state q Q and symbol a A all states accessible from
(q, a) (i.e., CLOSURE((q, a))) are inserted into

(q, a).
Note, that the resulting NFA M

can contain inaccessible states. Each


state q, q S q
0
, of NFA M

is inaccessible if the only incoming transi-


tions into this state q in NFA M are the -transitions leading from q
0
.
The -transitions can also be removed directly during a construction of
NFA.
There is also other possibility of simulation of NFA with -transitions.
We can implement -transitions by table c (of size [Q[) of bit-vectors:
199
Algorithm 6.6 (Removing -transitions from NFA)
Input: NFA M = (Q, A, , q
0
, F) with -transitions.
Output: NFA M

= (Q, A,

, q
0
, F

) without -transitions.
Method:
S := CLOSURE(q
0
)
for each a A do

(q
0
, a) :=

qS
CLOSURE((q, a))
endfor
for each q Q q
0
do
for each a A do

(q, a) := CLOSURE((q, a))


endfor
endfor
if S F ,= then
F

:= F q
0

else
F

:= F
endif
c[i] =
_

_
e
0
e
1
.
.
.
e
|Q|1
_

_
(8)
where bit e
j
= 1, if q
j
CLOSURE(q
i
), or e
j
= 0 otherwise.
This implementation is used in Algorithm 6.7, where CLOSURE(S) is
computed in each step of the simulation. The time and space complexities
are asymptotically same as in Algorithm 6.3, therefore Theorem 6.4 holds
for all NFAs (with as well as without -transitions).
Lemma 6.8
Let NFA M = (Q, A, , q
0
, F) is implemented by bit-vectors as shown in the
previous subsection. Then Algorithm 6.6 runs in time O([A[[Q[
2

|Q|
w
|) and
space O([A[[Q[
|Q|
w
|), where w is a length of used computer word in bits.
Proof
Let -transitions be implemented by table c as shown above (See Formula 8).
The statement in the rst for cycle of Algorithm 6.6 contains the cycle
performed for all states of S (O([Q[)), in which there is nested another for
cycle performed for all states of (q, a) (O([Q[)).
200
Algorithm 6.7 (Simulation of run of NFA with -transitions
bit-vector implementation of basic method)
Input: Transition tables T and c, and set T of nal states of NFA,
input text T = t
1
t
2
. . . t
n
.
Output: Output of run of NFA.
Method:
/ only CLOSURE(q
0
) is active at the beginning /
o
0
:= [100 . . . 0] or c[0]
i := 1
while i n and o
i1
,= [00 . . . 0] do
o
i
:= [00 . . . 0]
for j := 0, 1, . . . [Q[ 1 do
if s
j,i1
= 1 then / q
j
is active in (i 1)-th step /
o
i
:= o
i
or T [j, t
i
] / evaluate transitions for q
j
/
endif
endfor
for j := 0, 1, . . . [Q[ 1 do / construct CLOSURE(o
i
) /
if s
j,i
= 1 then
o
i
:= o
i
or c[j]
endif
endfor
for j := 0, 1, . . . [Q[ 1 do
if s
j,i
= 1 and f
j
= 1 then / if q
j
is active nal state /
write(information associated with nal state q
j
)
endif
endfor
i := i + 1
endwhile
The statement inside next two nested for cycles (O([Q[) and O(A)) also
contains a cycle (CLOSURE() for all states of (q, a)O([Q[)).
Therefore the total time complexity is O([A[[Q[
2

|Q|
w
|). The space com-
plexity is given by the size of new copy M

of NFA and no other space is


needed. 2
The basic simulation method presented in this section can be used for
any NFA and it runs in time O(n[Q[
|Q|
w
|) and space O([A[[Q[
|Q|
w
|), where
w is a length of used computer word in bits. The other simulation methods
shown below attempt to improve the time and space complexity, but they
cannot be used for general NFA.
201
6.2 Dynamic programming
The dynamic programming is a general technique widely used in various
branches of computer science. It was also utilized in the approximate string
matching using the Hamming distance [Mel95] (and [GL89] for detection
of all permutations of pattern with at most k errors in text) and in the
approximate string matching using the Levenshtein distance [WF74, Sel80,
Ukk85, LV88, Mel95].
In this Section we describe how the dynamic programming simulates
the NFAs for the approximate string matching using the Hamming and
Levenshtein distances. In the dynamic programming, the set of active states
is represented by a vector of integer variables.
6.2.1 Algorithm
The dynamic programming for the string and sequence matching computes
matrix D of size (m+1) (n+1). Each element d
j,i
, 0 j m, 0 < i n,
usually contains the edit distance between the string ending at i-th position
in text T and the prex of pattern P of length j.
6.2.2 String matching
6.2.2.1 Exact string matching The dynamic programming for the ex-
act string matching is the same as the dynamic programming for the ap-
proximate string matching using the Hamming distance in which k = 0. See
the paragraph below.
6.2.2.2 Approximate string matching using Hamming distance
In the approximate string matching using the Hamming distance each ele-
ment d
j,i
, 0 < i n, 0 j m, contains the Hamming distance between
string t
ij+1
. . . t
i
and string p
1
. . . p
j
. Elements of matrix D are computed
as follows:
d
j,0
:= j, 0 j m
d
0,i
:= 0, 0 i n
d
j,i
:= min(if t
i
= p
j
then d
j1,i1
,
d
j1,i1
+ 1), 0 < i n,
0 < j m
(9)
Formula 9 exactly represents simulation of NFA for approximate string
matching where transition replace is labeled by all symbols of alphabet as
shown in Fig. 6.1. However, we can optimize a little bit the formula and we
get Formula 10.
202
12
8
11
13
10 9
2 3 4
5 6 7
A
A
A
A
A
A
A
A
A
A
p
1
p
2
p
2
p
3
p
3
p
3
p
4
p
4
p
4
p
4
0 1
Figure 6.1: NFA for the approximate string matching using the Hamming
distance (m = 4, k = 3)
d
j,0
:= k + 1, 0 < j m
d
0,i
:= 0, 0 i n
d
j,i
:= if t
i
= p
j
then d
j1,i1
else d
j1,i1
+ 1, 0 < i n,
0 < j m
(10)
In Formula 10 term d
j1,i1
represents matchingposition i in text T
is increased, position j in pattern P is increased and edit distance d is the
same. Term d
j1,i1
+ 1 represents edit operation replaceposition i in
text T is increased, position j in pattern P is increased and edit distance d
is increased. The value of d
0,i
, 0 i n, is set to 0, because the Hamming
distance between two empty strings (the prex of length 0 of the pattern and
string of length 0 ending at position i) is 0. The value of d
j,0
, 0 j m, is
set to k + 1, where k is the maximum number of allowed errors (maximum
Hamming distance). In such a way it holds d
j,i
> k, i, j, 0 i < m,
i < j m, so all the items not satisfying condition j i exceed maximum
acceptable value k.
Algorithm 6.9 shows the use of matrix D in the approximate string
matching.
An example of matrix D using Formula 10 for searching for pattern
P = adbbca in text T = adcabcaabadbbca with k = 3 is shown in Table 6.1.
Since we are interested in at most k errors in the found string, each value
203
Algorithm 6.9 (Simulation of run of NFAdynamic program-
ming)
Input: Pattern P = p
1
p
2
. . . p
m
, input text T = t
1
t
2
. . . t
n
, maximum
number of errors allowed k, k < m.
Output: Output of run of NFA.
Method:
Compute 0-th column of matrix D
for i := 1, 2, . . . , n do
Compute i-th column of matrix D for input symbol t
i
if d
m,i
k then
write(Pattern P with d
m,i
errors ends at position i in text T.)
endif
endfor
D - a d c a b c a a b a d b b c a
- 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
a 4 0 1 1 0 1 1 0 0 1 0 1 1 1 1 0
d 4 5 0 2 2 1 2 2 1 1 2 0 2 2 2 2
b 4 5 6 1 3 2 2 3 3 1 2 3 0 2 3 3
b 4 5 6 7 2 3 3 3 4 3 2 3 3 0 3 4
c 4 5 6 6 8 3 3 4 4 5 4 3 4 4 0 4
a 4 4 6 7 6 9 4 3 4 5 5 5 4 5 5 0
Table 6.1: Matrix D for pattern P = adbbca, text T = adcabcaabadbbca,
and k = 3 using the Hamming distance
of d
i,j
greater than k +1 can be replaced by value k +1 and represents that
the value is greater than k. It is useful in some implementations, since we
need just log
2
(k + 2)| bits for each number.
Theorem 6.10
The dynamic programming algorithm described by Formula 10 simulates a
run of the NFA for the approximate string matching using the Hamming
distance.
Proof
In the dynamic programming for the approximate string matching using the
Hamming distance there is for each depth j, 0 < j m, of NFA in each
step i of the run one integer variable d
j,i
that contains the Hamming distance
between string p
1
. . . p
j
and the string in text T ending at position i. Since
for each value l of the Hamming distance there is one level of states in NFA,
integer variable d
j,i
= l and it contains level number of the topmost active
state in j-th depth of NFA. Each value of d
i,j
greater than k + 1 can be
204
8
11 10 9
2 3 4
5 6 7
A
A A
A
A
A
A
A
p
2
p
2
p
3
p
3
p
3
p
4
p
4
p
4
0 1
d
1
p
1
d
3
d
4
d
2
Figure 6.2: Dynamic programming uses for each depth of states of NFA one
integer variable d
replaced by value k + 1 and it represents that there is no active state in
j-th depth of NFA in i-th step of the run. So in dynamic programming, the
set of active states from Section 6.1 is implemented by the vector of integer
variableseach variable for one depth of NFA.
In Formula 10 term d
j1,i1
represents matching transitionactive state
is moved from depth j 1 to the next depth j within level d
j1,i1
and
symbol t
i
is read from the input. Term d
j1,i1
+ 1 represents transition
replaceactive state is moved from depth j 1 and level d
j1,i1
to the
next depth j and the next level d
j1,i1
+ 1 and symbol t
i
is read from the
input.
If the condition in if statement of Formula 10 holds (i.e., p
j
is read),
only one value (d
j1,i1
) is considered.
The self loop of the initial state is represented by setting d
0,i
:= 0, 0
i n.
Therefore all transitions (paths) of the NFA are considered.
At the beginning, only the initial state is active, therefore d
0,0
= 0 and
d
j,0
= k + 1, 0 < j m.
If d
m,i
k, then we report that pattern P was found with d
m,i
errors
(the nal state of level d
m,i
is active) ending at position i. 2
6.2.2.3 Approximate string matching using Levenshtein distance
In the approximate string matching using the Levenshtein distance [Sel80,
Ukk85] each element d
j,i
, 0 < i n, 0 j m, contains the Levenshtein
distance between the string ending at position i in T and string p
1
. . . p
j
.
Since the Levenshtein distance compares two strings of not necessary equal
205
length, element d
j,i
is always valid (symbols can be deleted from the pattern).
It also implies that we can directly determine only the ending position of
the found string in text.
P:
. . . . . . . . .
t
i-m+ +k 1
T:
t
i-m+1
t
i-m+ -k 1
t
i
k
t
n
k
m-k
. . . . . .
m+k
m
Figure 6.3: Range of beginnings of occurrence of P ending at position i in T
If the found string ends at position i, it can start in front of position
i m + 1 (at most k symbols inserted) or behind position i m + 1 (at
most k symbols deleted)the beginning of occurrence of pattern P can be
located at position i m+ 1 +l, k l k, as shown in Figure 6.3.
P:
. . . . . .
t
n-m+k
T:
t
m-k+1
t
m-k
t
n
m-k
. . .
m-k
t
n-m+k+1
Figure 6.4: The rst and the last possible occurrence of P in T
It also implies that the rst possible occurrence of the pattern in the text
can end at position mk and the last occurrence can start at nm+k +1
as shown in Figure 6.4.
Elements of matrix D are computed as follows:
d
j,0
:= j, 0 j m
d
0,i
:= 0, 0 i n
d
j,i
:= min(if t
i
= p
j
then d
j1,i1
else d
j1,i1
+ 1,
if j < m then d
j,i1
+ 1,
d
j1,i
+ 1), 0 < i n,
0 < j m
(11)
206
Formula 11 exactly represents simulation of NFA for approximate string
matching where transitions replace and insert are labeled by all symbols of
alphabet as shown in Fig. 6.5.
13 12
11 10
5 8 7 6
4 3 2 1 0
p
4
p
3
p
4
p
2
p
3
p
4
A A
A
A
A
A
A
A
A
A
A
A
A
A
A
p
4
p
3
p
1
A
e
p
2
e
e
e
e
e
e
e
e
p
2
9
Figure 6.5: NFA for the approximate string matching using the Levenshtein
distance (m = 4, k = 3)
In Formula 11 term d
j1,i1
represents matching and term d
j1,i1
+ 1
represents edit operation replace. Term d
j,i1
+ 1 represents edit opera-
tion insertposition i in text T is increased, position j in pattern P is
not increased and edit distance d is increased. Term d
j1,i
+ 1 represents
edit operation deleteposition i in text T is not increased, position j in
pattern P is increased and edit distance d is increased.
Pattern P is found with at most k dierences ending at position i if
d
m,i
k, 1 i n. The maximum number of dierences of the found
string is D
L
(P, t
il+1
. . . t
i
) = d
m,i
, mk l m+k, l < i.
An example of matrix D for searching for pattern P = adbbca in text
T = adcabcaabadbbca is shown in Table 6.2.
Theorem 6.11
The dynamic programming algorithm described by Formula 11 simulates a
run of the NFA for the approximate string matching using the Levenshtein
distance.
Proof
The proof is similar to the proof of Theorem 6.10. We only have to show
the simulation of edit operations insert and delete.
207
D - a d c a b c a a b a d b b c a
- 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
a 1 0 1 1 0 1 1 0 0 1 0 1 1 1 1 0
d 2 1 0 1 1 1 2 1 1 1 1 0 1 2 2 1
b 3 2 1 1 2 1 2 2 2 1 2 1 0 1 2 2
b 4 3 2 2 2 2 2 3 3 2 2 2 1 0 1 2
c 5 4 3 2 3 3 2 3 4 3 3 3 2 1 0 1
a 6 5 4 3 2 4 3 2 3 4 3 4 3 2 1 0
Table 6.2: Matrix D for pattern P = adbbca and text T = adcabcaabadbbca
using the Levenshtein distance
In Formula 11 term d
j1,i1
represents matching transition and term
d
j1,i1
+1 represents transition replace. Termd
j,i1
+1 represents transition
insertactive state is moved from level d
j,i1
to the next level d
j,i1
+ 1
within depth j and symbol t
i
is read from the input. Term if j < m then
provides that insert transition is not considered in depth m, where the NFA
has no insert transition. Term d
j1,i
+1 represents transition deleteactive
state is moved from depth j 1 and level d
j1,i
to the next depth j and the
next level d
j1,i
+ 1 and no symbol is read from the input.
Since replace and insert transitions are labeled by all symbols of input
alphabet A, values d
j1,i1
+1 and d
j,i1
+1 are considered even if t
i
= p
j
.
The contribution of the matching transition (i.e., d
j1,i1
) is considered only
if t
i
= p
j
.
Thus all transitions of the NFA are considered. 2
In [GP89] they compress the matrix D. They use the property that
(d
j,i
d
j1,i1
) 0, 1the number of errors of an occurrence can only be
nondecreasing when reading symbols of that occurrence. Using this property
they shorten each column of the matrix D to k + 1 entries. They represent
the matrix D diagonal by diagonal in such a way that each line j, 0 j k,
of that new matrix contains the number of last entry of diagonal of matrix D
containing j errors.
6.2.2.4 Approximate string matching using generalized Leven-
shtein distance For the approximate string matching using the general-
ized Levenshtein distance we modify Formula 11 for the Levenshtein distance
such that we have added the term for edit transition transpose. The resulting
formula is as follows:
208
d
j,0
:= j, 0 j m
d
0,i
:= 0, 0 i n
d
j,i
:= min(if t
i
= p
j
then d
j1,i1
else d
j1,i1
+ 1,
if j < m then d
j,i1
+ 1,
d
j1,i
+ 1,
if i > 1 and j > 1
and t
i1
= p
j
and t
i
= p
j1
then d
j2,i2
+ 1), 0 < i n,
0 < j m
(12)
Formula 12 exactly represents simulation of NFA for approximate string
matching shown in Fig. 6.6.
A
A
0
A
A
A
A
A
A
A
A
A
A
A
A
A
A
p
1
p
2
p
2
p
3
p
3
p
3
p
4
p
4
p
4
p
4
e e
e e
e e
e
e e
p
2
p
3
p
3
p
4
p
4
p
4
p
1
p
2
p
2
p
3
p
3
p
3
14
1 2
15
5 6
16
4 3
7 8
18 17
9 10 11
19
12 13
Figure 6.6: NFA for the approximate string matching using the generalized
Levenshtein distance (m = 4, k = 3)
In Formula 12 term d
j1,i1
represents matching, term d
j1,i1
+1 repre-
sents edit operation replace, term d
j,i1
+1 represents edit operation insert,
term d
j1,i
+ 1 represents edit operation delete, and term d
j2,i2
+ 1 rep-
resents edit operation transposeposition i in text T is increased by 2,
position j in pattern P is increased by 2 and edit distance d is increased by
1.
An example of matrix D for searching for pattern P = adbbca in text
T = adbcbaabadbbca is shown in Table 6.3.
209
D - a d b c b a a b a d b b c a
- 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
a 1 0 1 1 1 1 0 0 1 0 1 1 1 1 0
d 2 1 0 1 2 2 1 1 1 1 0 1 2 2 1
b 3 2 1 0 1 2 2 2 1 2 1 0 1 2 2
b 4 3 2 1 1 1 2 3 2 2 2 1 0 1 2
c 5 4 3 2 1 1 2 3 3 3 3 2 1 0 1
a 6 5 4 3 2 2 1 2 4 3 4 3 2 1 0
Table 6.3: Matrix D for pattern P = adbbca and text T = adbcbaabadbbca
using the generalized Levenshtein distance
Theorem 6.12
The dynamic programming algorithm described by Formula 12 simulates a
run of the NFA for the approximate string matching using the generalized
Levenshtein distance.
Proof
The proof is similar to the proof of Theorem 6.11. We only have to show
the simulation of edit operation transpose.
In Formula 12 term d
j1,i1
represents matching transition, term
d
j1,i1
+ 1 represents transition replace, term d
j,i1
+ 1 represents tran-
sition insert, and term d
j1,i
+ 1 represents transition delete.
Term d
j2,i2
+1 represents transition transposeactive state is moved
from depth j 2 and level d
j2,i2
to depth j and level d
j2,i2
+ 1 and
symbols t
i1
and t
i
are read from the input. When representing transition
transpose we need not new integer variable for state on transition transpose
in NFA. 2
6.2.3 Time and space complexity
In the algorithms presented in this subsection, matrix D of size mostly
(m + 1) (n + 1) is computed, but in practice one needs just only one
(or two for the generalized Levenshtein distance) previous columns d
i1
(or
d
i1
, d
i2
respectively) in order to compute column d
i
. In the columns we do
not need to store 0-th item, since this item contains always the same value
(except step 0). Therefore the space complexity of this matrix is O(m).
All operations shown in formulae for computing each element of matrix D
can be performed in constant time, therefore the time complexity of the
simulation of run of NFAs for the approximate string and sequence matching
is O(mn).
In [GP89] they compacted the matrix D for the approximate string
matching using the Levenshtein distance exploiting the property that d
j,i

210
d
j1,i1
0, 1, 0 < i n, 0 < j m. For each diagonal of the matrix D
they store only the positions, in which the value increases.
Let us remark that the size of the input alphabet can be reduced to
reduced alphabet A

, A

A, [A

[ m+1 (A

contains all the symbols used


in pattern P and one special symbol for all the symbols not contained in
P). So A is bounded by a length m of pattern P.
The simulation of NFA using dynamic programming has time complexity
O(mn) and space complexity O(m).
6.3 Bit parallelism
Bit parallelism is a method that uses bit vectors and benets from the feature
that the same bitwise operations (or, and, add, . . . etc.) over groups of bits
(or over individual bits) can be performed at once in parallel over the whole
bit vector. The representatives of bit parallelism are Shift-Or, Shift-And,
and Shift-Add algorithms.
Bit parallelism was used for the exact string matching (Shift-And in
[D om64]), the multiple exact string matching (Shift-And in [Shy76]), the
approximate string matching using the Hamming distance (Shift-Add in
[BYG92]), the approximate string matching using the Levenshtein distance
(Shift-Or in [BYG92] and Shift-And in [WM92]) and for the generalized
pattern matching (Shift-Or in [Abr87]), where the pattern consists not only
of symbols but also of sets of symbols.
In this Section we will discuss only Shift-Or and Shift-Add algorithms.
Shift-And algorithm is the same as Shift-Or algorithm but the meaning of
0 and 1 is exchanged as well as the use of bitwise operations and and or is
exchanged.
6.3.1 Algorithm
The Shift-Or algorithm uses matrices R
l
, 0 l k of size m(n + 1) and
mask matrix D of size m [A[. Each element r
l
j,i
, 0 < j m, 0 i n,
contains 0, if the edit distance between string p
1
. . . p
j
and string ending at
position i in text T is l, or 1, otherwise. Each element d
j,x
, 0 < j m,
x A, contains 0, if p
j
= x, or 1, otherwise. The matrices are implemented
as tables of bit vectors as follows:
R
l
i
=
_

_
r
l
1,i
r
l
2,i
.
.
.
r
l
m,i
_

_
and D[x] =
_

_
d
1,x
d
2,x
.
.
.
d
m,x
_

_
, 0 i n, 0 l k, x A. (13)
211
6.3.2 String matching
6.3.2.1 Exact string matching In the exact string matching, vec-
tors R
0
i
, 0 i n, are computed as follows [BYG92]:
r
0
j,0
:= 1, 0 < j m
R
0
i
:= shl(R
0
i1
) or D[t
i
], 0 < i n (14)
In Formula 14 operation shl() is the bitwise operation left shift that
inserts 0 at the beginning of vector and operation or is the bitwise operation
or. Term shl(R
0
i1
) or D[t
i
] represents matchingposition i in text T is
increased, position in pattern P is increased by operation shl(), and the
positions corresponding to the input symbol t
i
are selected by term or D[t
i
].
Pattern P is found at position t
im+1
. . . t
i
if r
0
m,i
= 0, 0 < i n.
An example of mask matrix D for pattern P = adbbca is shown in
Table 6.4 and an example of matrix R
0
for exact searching for pattern P =
adbbca in text T = adcabcaabadbbca is shown in Table 6.5.
D a b c d A a, b, c, d
a 0 1 1 1 1
d 1 1 1 0 1
b 1 0 1 1 1
b 1 0 1 1 1
c 1 1 0 1 1
a 0 1 1 1 1
Table 6.4: Matrix D for pattern P = adbbca
R
0
- a d c a b c a a b a d b b c a
a 1 0 1 1 0 1 1 0 0 1 0 1 1 1 1 0
d 1 1 0 1 1 1 1 1 1 1 1 0 1 1 1 1
b 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1
b 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1
c 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1
a 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
Table 6.5: Matrix R
0
for the exact string matching (for pattern P = adbbca
and text T = adcabcaabadbbca)
Theorem 6.13
Shift-Or algorithm described by Formula 14 simulates a run of the NFA for
the exact string matching.
212
Proof
In Shift-Or algorithm for the exact string matching there is one bit vector
R
0
i
, 0 i n, which represents the set of active states of NFA. In the vector,
0 represents active state and 1 represents non-active state of the simulated
NFA. So in Shift-Or algorithm, the set of active states from Section 6.1 is
implemented by bit vector.
In Formula 14, term shl(R
0
i1
) or D[t
i
] represents matching transition
each active state is moved to the next position in the right
2
in the same level.
All active states are moved at once and only the transitions corresponding
to read symbol t
i
are selected by mask vector D[t
i
], which changes 0 to 1
in each such state that its incoming matching transition is not labeled by
t
i
. The initial state of NFA is not in vector R
0
and it is implemented by
inserting 0 at the beginning of the vector in operation shl()initial state is
always active because of its self loop.
At the beginning only the initial state is active therefore R
0
0
= 1
(m)
.
If r
0
m,i
= 0, 0 < i n, then we report that the nal state is active and
thus the pattern is found ending at position i in text T. 2
6.3.2.2 Approximate string matching using Hamming distance
In the approximate string matching using the Hamming distance, vectors R
l
i
,
0 l k, 0 i n, are computed as follows:
r
l
j,0
:= 1, 0 < j m, 0 l k
R
0
i
:= shl(R
0
i1
) or D[t
i
], 0 < i n
R
l
i
:= (shl(R
l
i1
) or D[t
i
]) and shl(R
l1
i1
), 0 < i n, 0 < l k
(15)
In Formula 15 operation and is bitwise operation and. Term
shl(R
l
i1
) or D[t
i
] represents matching and term shl(R
l1
i1
) represents edit
operation replaceposition i in text T is increased, position in pattern P is
increased, and edit distance l is increased. Pattern P is found with at most
k mismatches at position t
im+1
. . . t
i
if r
k
m,i
= 0, 0 < i n. The maximum
number of mismatches of the found string is D
H
(P, t
im+1
. . . t
i
) = l, where
l is the minimum number such that r
l
m,i
= 0.
An example of matrices R
l
for searching for pattern P = adbbca in text
T = adcabcaabadbbca with at most k = 3 mismatches is shown in Table 6.6.
2
In the Shift-Or algorithm, transitions from states in our gures are implemented by
operation shl() (left shift) because of easier implementation in the case when number of
states of NFA is greater than length of computer word and vectors R have to be divided
into two or more bit vectors.
213
R
0
- a d c a b c a a b a d b b c a
a 1 0 1 1 0 1 1 0 0 1 0 1 1 1 1 0
d 1 1 0 1 1 1 1 1 1 1 1 0 1 1 1 1
b 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1
b 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1
c 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1
a 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
R
1
- a d c a b c a a b a d b b c a
a 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
d 1 1 0 1 1 0 1 1 0 0 1 0 1 1 1 1
b 1 1 1 0 1 1 1 1 1 0 1 1 0 1 1 1
b 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1
c 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1
a 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
R
2
- a d c a b c a a b a d b b c a
a 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
d 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
b 1 1 1 0 1 0 0 1 1 0 0 1 0 0 1 1
b 1 1 1 1 0 1 1 1 1 1 0 1 1 0 1 1
c 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1
a 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
R
3
- a d c a b c a a b a d b b c a
a 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
d 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
b 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0
b 1 1 1 1 0 0 0 0 1 0 0 0 0 0 0 1
c 1 1 1 1 1 0 0 1 1 1 1 0 1 1 0 1
a 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0
Table 6.6: Matrices R
l
for the approximate string matching using the Ham-
ming distance (P = adbbca, k = 3, and T = adcabcaabadbbca)
214
8
11 10 9
2 3 4
5 6 7
A
A
A
A A
A
A
A
p
2
p
2
p
3
p
3
p
3
p
4
p
4
p
4
0 1
R
2
p
1
R
0
R
1
Figure 6.7: Bit parallelism uses one bit vector R for each level of states
of NFA
Theorem 6.14
Shift-Or algorithm described by Formula 15 simulates a run of the NFA for
the approximate string matching using the Hamming distance.
Proof
In Shift-Or algorithm for the approximate string matching there is for each
level l, 0 l k, of states of NFA one bit vector R
l
i
, 0 i n. So in
Shift-Or algorithm, the set of active states from Chapter 6.1 is implemented
by bit vectorsone vector for each level of states.
In Formula 15 term shl(R
l
i1
) or D[t
i
] represents matching transition
(see the proof of Theorem 6.13). Term shl(R
l1
i1
) represents transition re-
placeeach active state of level l 1 is moved to the next depth in level l.
The seloop of the initial state is implemented by inserting 0 at the
beginning of vector R
0
within operation shl(). 0 is inserted also at the
beginning of each vector R
l
, 0 < l k. Since the rst state of l-th level is
connected with the initial state by the sequence of l transitions labeled by A,
each of these rst states is active from l-th step of the simulation respectively
till the end. The impact of 0s inserted at the beginning of vectors R
l
does
not appear before l-th step, therefore also vectors R
l
simulate correctly the
NFA.
If r
l
m,i
= 0, 0 < i n, the nal state of l-th level is active and we can
report that the pattern is found with at most l errors ending at position i
in the text. In fact, we report just only the minimum l in each step. 2
6.3.2.3 Approximate string matching using Levenshtein distance
In the approximate string matching using the Levenshtein distance, vec-
tors R
l
i
, 0 l k, 0 i n, are computed by Formula 16. To prevent
215
insert transitions leading into nal states we use auxiliary vector V dened
by Formula 17.
r
l
j,0
:= 0, 0 < j l, 0 < l k
r
l
j,0
:= 1, l < j m, 0 l k
R
0
i
:= shl(R
0
i1
) or D[t
i
], 0 < i n
R
l
i
:= (shl(R
l
i1
) or D[t
i
])
and shl(R
l1
i1
and R
l1
i
)
and (R
l1
i1
or V ), 0 < i n, 0 < l k
(16)
V =
_

_
v
1
v
2
.
.
.
v
m
_

_
, where v
m
= 1 and v
j
= 0, j, 1 j < m. (17)
In Formula 16 term shl(R
l
i1
) or D[t
i
] represents matching, term
shl(R
l1
i1
) represents edit operation replace, term shl(R
l1
i
) represents edit
operation deleteposition in pattern P is increased, position in text T is
not increased, and edit distance l is increased. Term R
l1
i1
represents edit
operation insertposition in pattern P is not increased, position in text T
is increased, and edit distance l is increased. Term or V provides that no
insert transition leads from any nal state.
Pattern P is found with at most k dierences ending at position i if
r
k
m,i
= 0, 0 < i n. The maximum number of dierences of the found
string is l, where l is the minimum number such that r
l
m,i
= 0.
An example of matrices R
l
for searching for pattern P = adbbca in text
T = adcabcaabadbbca with at most k = 3 errors is shown in Table 6.7.
Theorem 6.15
Shift-Or algorithm described by Formula 16 simulates a run of the NFA for
the approximate string matching using the Levenshtein distance.
Proof
The proof is similar to the proof of Theorem 6.14, we only have to add the
simulation of insert and delete transitions.
In Formula 16 term shl(R
l
i1
) or D[t
i
] represents matching transition
and term shl(R
l1
i1
) represents transition replace (see the proof of The-
orem 6.14). Term R
l1
i1
represents transition inserteach active state of
level l 1 is moved into level l within the same depth. Term shl(R
l1
i
)
represents transition deleteeach active state of level l 1 is moved to the
next depth in level l while no symbol is read from the input (position i in
text T is the same).
216
R
0
- a d c a b c a a b a d b b c a
a 1 0 1 1 0 1 1 0 0 1 0 1 1 1 1 0
d 1 1 0 1 1 1 1 1 1 1 1 0 1 1 1 1
b 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1
b 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1
c 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1
a 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
R
1
- a d c a b c a a b a d b b c a
a 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
d 1 0 0 0 0 0 1 0 0 0 0 0 0 1 1 0
b 1 1 0 0 1 0 1 1 1 0 1 0 0 0 1 1
b 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1
c 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0
a 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0
R
2
- a d c a b c a a b a d b b c a
a 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
d 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
b 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
b 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0
c 1 1 1 0 1 1 0 1 1 1 1 1 0 0 0 0
a 1 1 1 1 0 1 1 0 1 1 1 1 1 0 0 0
R
3
- a d c a b c a a b a d b b c a
a 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
d 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
b 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
b 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
c 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0
a 1 1 1 0 0 1 0 0 0 1 0 1 0 0 0 0
Table 6.7: Matrices R
l
for the approximate string matching using the Lev-
enshtein distance (P = adbbca, k = 3, and T = adcabcaabadbbca)
217
At the beginning, only the initial state q
0
and all states located on the
same -diagonal like q
0
are active (i.e., all states of CLOSURE(q
0
) are
active), therefore l-th bit of vector R
l
, 0 < l k, is 0 in the initial setting of
the vector. The states in front of l-th bit in vector R
l
can also be 0, since they
have no impact (l-th bit is always 0, since all states of CLOSURE(q
0
)
are always active due to the seloop of q
0
). Therefore the bits behind l-th
bit are set in the initial setting while the initial setting of the bits in front
of l-th bit can be arbitrary.
In the case of the Levenshtein distance, the situation with inserting 0s
at the beginning of vectors R
l
, 0 < l k, during shl() operation is slightly
dierent. Since all the states of the -diagonal leading from q
0
are always
active, there is no impact of these 0 insertions. 2
6.3.2.4 Approximate string matching using generalized Leven-
shtein distance In order to construct Shift-Or algorithm for the approx-
imate string matching using the generalized Levenshtein distance [Hol97],
we modify Formula 16 for the approximate string matching using the Lev-
enshtein distance such that we add the term representing edit operation
transpose. Since NFA for the approximate string matching using the gener-
alized Levenshtein distance has on each transition transpose one auxiliary
state, (see Figure 6.6) we have to introduce new bit vectors S
l
i
, 0 l < k,
0 i < n, as follows:
S
l
i
=
_

_
s
l
1,i
s
l
2,i
.
.
.
s
l
m,i
_

_
, 0 l < k, 0 i < n. (18)
Vectors R
l
i
, 0 l k, 0 i n, and S
l
i
, 0 l < k, 0 i < n, are then
computed as follows:
218
r
l
j,0
:= 0, 0 < j l,
0 < l k
r
l
j,0
:= 1, l < j m,
0 l k
R
0
i
:= shl(R
0
i1
) or D[t
i
], 0 < i n
R
l
i
:= (shl(R
l
i1
) or D[t
i
])
and shl(R
l1
i1
and R
l1
i
and (S
l1
i1
or D[t
i
]))
and (R
l1
i1
or V ), 0 < i n,
0 < l k
s
l
j,0
:= 1, 0 < j m,
0 l < k
S
l
i
:= shl(R
l
i1
) or shr(D[t
i
]), 0 < i < n,
0 l < k
(19)
Term shl(R
l
i1
) or D[t
i
] represents matching, term shl(R
l1
i1
) represents
edit operation replace, term shl(R
l1
i
) represents edit operation delete, and
term R
l1
i1
represents edit operation insert.
Term (S
l1
i1
or D[t
i
]) represents edit operation transposeposition in pat-
tern P is increased by 2, position in text T is also increased by 2 but edit
distance is increased just by 1. The increase of both positions by 2 is pro-
vided using vector S
l
i
.
Pattern P is found with at most k dierences ending at position i if
r
k
m,i
= 0, 0 < i n. The maximum number of dierences of the found
string is l, where l is the minimum number such that r
l
m,i
= 0.
An example of matrices R
l
and S
l
for searching for pattern P = adbbca
in text T = adbcbaabadbbca with at most k = 3 errors is shown in Table 6.8.
Theorem 6.16
Shift-Or algorithm described by Formula 19 simulates a run of the NFA for
the approximate string matching using the generalized Levenshtein distance.
Proof
In Formula 19 term shl(R
l
i1
) or D[t
i
] represents matching transition and
term shl(R
l1
i1
) represents transition replace, term R
l1
i1
represents transition
insert, and term shl(R
l1
i
) represents transition delete (see the proof of
Theorem 6.15).
Term (S
l1
i1
or D[t
i
]) represents edit operation transpose. In this transi-
tion all states of level l are moved to the next position in the right (shl(R
l
i1
))
of auxiliary level l

. Then only the transitions corresponding to input sym-


bol t
i
have to be selected. It is provided by mask vector D[t
i
]. Since each
transition leading to state of depth j is labeled by symbol p
j+1
, we have
to shift the mask vector in opposite direction than in which vector R
l
i1
219
R
0
- a d b c b a a b a d b b c a
a 1 0 1 1 1 1 0 0 1 0 1 1 1 1 0
d 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1
b 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1
b 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1
c 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1
a 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
S
0
a 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1
d 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1
b 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1
b 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1
c 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
a 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
R
1
a 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
d 1 0 0 0 1 1 0 0 0 0 0 0 1 1 0
b 1 1 0 0 0 1 1 1 0 1 0 0 0 1 1
b 1 1 1 0 0 0 1 1 1 1 1 0 0 0 1
c 1 1 1 1 0 0 1 1 1 1 1 1 0 0 0
a 1 1 1 1 1 1 0 1 1 1 1 1 1 0 0
S
1
a 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1
d 1 1 1 0 1 0 1 1 0 1 1 0 0 1 1
b 1 1 1 0 1 1 1 1 0 1 1 0 0 1 1
b 1 1 1 1 0 1 1 1 1 1 1 1 1 0 1
c 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0
a 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
R
2
a 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
d 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
b 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
b 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0
c 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0
a 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0
S
2
a 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1
d 1 1 1 0 1 0 1 1 0 1 1 0 0 1 1
b 1 1 1 0 1 0 1 1 0 1 1 0 0 1 1
b 1 1 1 1 0 1 1 1 1 1 1 1 1 0 1
c 1 1 1 1 1 1 0 0 1 0 1 1 1 1 0
a 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
R
3
a 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
d 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
b 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
b 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
c 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0
a 1 1 1 0 0 0 0 0 1 0 1 0 0 0 0
Table 6.8: Matrices R
l
and S
l
for the approximate string matching us-
ing the generalized Levenshtein distance (P = adbbca, k = 3, and T =
adbcbaabadbbca)
220
is shifted (shr(D[t
i
])). All states of auxiliary level l

are stored in vector


S
l
i
:= shl(R
l
i1
) or shr(D[t
i
]), 0 < i < n, 0 l < k.
In next step i + 1 of the computation all states of auxiliary level l

are
moved to the next position in the right of level l +1 and only the transitions
corresponding to input symbol t
i+1
are selected by mask vector D[t
i+1
].
Since each transition leading to state of depth j +1 is labeled by symbol p
j
,
we have to shift mask vector D[t
i+1
] in the same direction in which vector S
l
i
is shifted. Therefore we insert term shl(S
l1
i1
or D[t
i
]) in Formula 16 as well
as we insert the formula for computation of vector S
l
i
in Formula 16. 2
6.3.3 Other methods of bit parallelism
In the previous Sections we described only Shift-Or algorithm. If we ex-
change the meaning of 0s and 1s and the usage of ands and ors in the
formulae presented, we get Shift-And algorithm [WM92].
Other method of bit parallelism is Shift-Add algorithm [BYG92]. In
this algorithm we have one bit-vector, which contains m blocks of bits of
size b = log
2
m| (one block for each depth of NFA for the approximate
string matching using the Hamming distance). The formula for computing
such vector then consists of shifting the vector by b bits and adding with the
mask vector for current input symbol t. This vector contains 1 in (b.j)-th
position, if t ,= p
j+1
, or 0, elsewhere. This algorithm runs in time O(
mb
w
|n).
We can also use Shift-Add algorithm for the weighted approximate string
matching using the Hamming distance. In such case each block of bits in
mask vector contains binary representation of weight of the corresponding
edit operation replace. We have also to enlarge the length of block of bits
to prevent a carry to the next block of bits. Note, that Shift-Add algorithm
can also be considered as an implementation of dynamic programming.
In [BYN96a] they improve the approximate string matching using the
Levenshtein distance in such a way, that they search for any of the rst
(k + 1) symbols of the pattern. If they nd any of them, they start NFA
simulation and if the simulation then reaches the initial situation (i.e., only
the states located in -diagonal leading from the initial state are active),
they again start searching for any of the rst (k +1) symbols of the pattern,
which is faster than the simulation.
Shift-Or algorithm can also be used for the exact string matching with
dont care symbols, classes of symbols, and complements as shown in
[BYG92]. In [WM92] they extend this method to unlimited wild cards and
apply the above features for the approximate string matching. They also use
bit parallelism for regular expressions, for the weighted approximate string
matching, for set of patterns, and for the situations, when errors are not
allowed in some parts of pattern (another mask vector is used).
All the previous cases have good time complexity if the NFA simulation
221
ts in one computer word (each vector in one computer word). If NFA is
larger, it is necessary to partition NFA or pattern as described in [BYN96b,
BYN96a, BYN97, BYN99, NBY98, WM92].
Shift-Or algorithm can also be used for multiple pattern matching
[BYG92, BYN97] or for the distributed pattern matching [HIMM99].
6.3.4 Time and space complexity
The time and space analysis of the Shift-Or algorithm is as follows [WM92].
Denote the computer word size by w. The preprocessing requires O(m[A[)
time plus O(k
m
w
|) to initialize the k vectors. The running time is O(nk
m
w
|).
The space complexity is O([A[
m
w
|) for mask matrix D plus O(k
m
w
|) for
the k vectors.
222
References
[Abr87] K. Abrahamson. Generalized string matching. SIAM J. Com-
put., 16(6):10391051, 1987.
[BYG92] R. A. Baeza-Yates and G. H. Gonnet. A new approach to text
searching. Commun. ACM, 35(10):7482, 1992.
[BYN96a] R. A. Baeza-Yates and G. Navarro. A fast heuristic for ap-
proximate string matching. In N. Ziviani, R. Baeza-Yates, and
K. Guimar aes, editors, Proceedings of the 3rd South American
Workshop on String Processing, pages 4763, Recife, Brazil,
1996. Carleton University Press.
[BYN96b] R. A. Baeza-Yates and G. Navarro. A faster algorithm for ap-
proximate string matching. In D. S. Hirschberg and E. W. Myers,
editors, Proceedings of the 7th Annual Symposium on Combina-
torial Pattern Matching, number 1075 in Lecture Notes in Com-
puter Science, pages 123, Laguna Beach, CA, 1996. Springer-
Verlag, Berlin.
[BYN97] R. A. Baeza-Yates and G. Navarro. Multiple approximate string
matching. In F. K. H. A. Dehne, A. Rau-Chaplin, J.-R. Sack,
and R. Tamassia, editors, Proceedings of the 5th Workshop on Al-
gorithms and Data Structures, number 1272 in Lecture Notes in
Computer Science, pages 174184, Halifax, Nova Scotia, Canada,
1997. Springer-Verlag, Berlin.
[BYN99] R. A. Baeza-Yates and G. Navarro. Faster approximate string
matching. Algorithmica, 23(2):127158, 1999.
[CH97a] R. Cole and R. Hariharan. Tighter upper bounds on the exact
complexity of string matching. SIAM J. Comput., 26(3):803856,
1997.
[CH97b] M. Crochemore and C. Hancart. Automata for matching pat-
terns. In G. Rozenberg and A. Salomaa, editors, Handbook of
Formal Languages, volume 2 Linear Modeling: Background and
Application, chapter 9, pages 399462. Springer-Verlag, Berlin,
1997.
[D om64] B. D om olki. An algorithm for syntactical analysis. Computa-
tional Linguistics, (3):2946, 1964.
[GL89] R. Grossi and F. Luccio. Simple and ecient string matching
with k mismatches. Inf. Process. Lett., 33(3):113120, 1989.
223
[GP89] Z. Galil and K. Park. An improved algorithm for approximate
string matching. In G. Ausiello, M. Dezani-Ciancaglini, and
S. Ronchi Della Rocca, editors, Proceedings of the 16th Interna-
tional Colloquium on Automata, Languages and Programming,
number 372 in Lecture Notes in Computer Science, pages 394
404, Stresa, Italy, 1989. Springer-Verlag, Berlin.
[HIMM99] J. Holub, C. S. Iliopoulos, B. Melichar, and L. Mouchard. Dis-
tributed string matching using nite automata. In R. Raman
and J. Simpson, editors, Proceedings of the 10th Australasian
Workshop On Combinatorial Algorithms, pages 114128, Perth,
WA, Australia, 1999.
[Hol97] J. Holub. Simulation of NFA in approximate string and sequence
matching. In J. Holub, editor, Proceedings of the Prague Stringol-
ogy Club Workshop 97, pages 3946, Czech Technical University,
Prague, Czech Republic, 1997. Collaborative Report DC9703.
[LV88] G. M. Landau and U. Vishkin. Fast string matching with k
dierences. J. Comput. Syst. Sci., 37(1):6378, 1988.
[MC97] C. Moore and J. P. Crutcheld. Quan-
tum automata and quantum grammars. 1997.
http://xxx.lanl.gov/abs/quant-ph/9707031.
[Mel95] B. Melichar. Approximate string matching by nite automata.
In V. Hlav ac and R.

S ara, editors, Computer Analysis of Images
and Patterns, number 970 in Lecture Notes in Computer Science,
pages 342349. Springer-Verlag, Berlin, 1995.
[NBY98] G. Navarro and R. Baeza-Yates. Improving an algorithm for
approximate pattern matching. Technical Report TR/DCC-98-
5, Dept. of Computer Science, University of Chile, 1998.
[Sel80] P. H. Sellers. The theory and computation of evolutionary dis-
tances: Pattern recognition. J. Algorithms, 1(4):359373, 1980.
[Shy76] R. K. Shyamasundar. A simple string matching algorithm. Tech-
nical report, Tata Institute of Fundamental Research, 1976.
[Ukk85] E. Ukkonen. Finding approximate patterns in strings. J. Algo-
rithms, 6(13):132137, 1985.
[WF74] R. A. Wagner and M. Fischer. The string-to-string correction
problem. J. Assoc. Comput. Mach., 21(1):168173, 1974.
[WM92] S. Wu and U. Manber. Fast text searching allowing errors. Com-
mun. ACM, 35(10):8391, 1992.
224

You might also like