Pattern Matching Algorithms
Pattern Matching Algorithms
PATTERN MATCHING
Computers are well recognized to perform numerical computations – but they are equally
capable of processing textual data. A document generally contains textual data, or simply
text. Computers are used to edit documents, to search documents, to transmit documents
over the Internet, to display documents on monitors, and to print documents on printers.
The main concern in the text processing is centered on either manipulation or movement of
characters or searching for pattern or words.
Before proceeding to the search algorithms for text processing, we recall the operations on
strings. Representing a string as an array of characters is simple and efficient.
For example: We may wish to determine whether or not the substring “DATA” occurs in the
text: ADVANCED DATA STRUCTURES
Two substrings can be said to match when they are equal character-by-character from the
first to the last character. It follows that if the number of character-for-character matches is
equal to the length of both the substring sought and the text substring
A mismatch between two substrings must therefore mean that the two substrings are not
the same character-for-character or in other words the number of character-for-character
matches is less than the length of the two substrings.
This section uses the following notation: For a text string T of length n and a string P of
length m, we want to find whether P is a substring of T, The concept of a match is that there
is a substring of T starting at some index I that matches P, character-by-character, so that
A D V A N C E D D A T A S T R U C T U R E S
D A T A
P
A brute – force algorithm solves a problem in the most simple, direct or obvious way
ADVANCED DATA STRUCTURES UNIT - VI CSE
A simple method to string matching is starting the comparison of P (string pattern) and T
(text string) from the first character of T and the first character of P. If there is a mismatch,
the comparison starts from the second character of T and so on.
The running time of brute – force pattern matching algorithm is not efficient in the worst
case. The running time of the algorithm is O (nm). The algorithm has a quadratic running
time O (n2), when m = n/2
This algorithm scans the character of the search pattern from right to left. If a match is not
found then a shift is made by some number of characters. This algorithm is also called as
“looking glass heuristic”
Algorithm:
Program Logic:
ADVANCED DATA STRUCTURES UNIT - VI CSE
Analysis:
The computation of the last function takes O (m+|∑|) time and actual search takes O(mn)
time. Therefore the worst case running time of Boyer-Moore algorithm is O(nm + |∑|).
Implies that the worst-case running time is quadratic, in case of n = m, the same as the naïve
algorithm.
Boyer-Moore algorithm is extremely fast on large alphabet (relative to the length of the
pattern).
Example:
Consider a text T = X Y X Z X X Y X T Z X Y X Z X Y Y X X Y
We will first build last table using last(c) function where ‘c’ represents character from T
The remaining characters in text string is T which is not present in pattern i.e. last (T) = -1
STEP 1:
X Y X Z X X Y X T Z X Y X Z X Y Y X X Y
Mismatch
X Y X Z X Y
0 1 2 3 4 5 j
STEP 2:
X Y X Z X X Y X T Z X Y X Z X Y Y X X Y
Mismatch
X Y X Z X Y
0 1 2 3 4 5 j
l = last (X) = 4, j = 3
As j < l +1 3 < 4 + 1 pattern shift by ( l – j) positions i.e. by 1 position
STEP 3:
X Y X Z X X Y X T Z X Y X Z X Y Y X X Y
Mismatch
X Y X Z X Y
0 1 2 3 4 5 j
STEP 4:
ADVANCED DATA STRUCTURES UNIT - VI CSE
X Y X Z X X Y X T Z X Y X Z X Y Y X X Y
Mismatch
X Y X Z X Y
0 1 2 3 4 5 j
STEP 5:
X Y X Z X X Y X T Z X Y X Z X Y Y X X Y
Mismatch
X Y X Z X Y
0 1 2 3 4 5 j
STEP 5:
X Y X Z X X Y X T Z X Y X Z X Y Y X X Y
Matched
X Y X Z X Y
0 1 2 3 4 5 j
Now the match for the given pattern is found in the given string
The basic method of the B – M algorithm ultimately reduces the efficiency of pattern
matching algorithm. Hence the K – M – P algorithm came up which avoids the repeated
comparison of characters
The basic idea behind this algorithm is to build a prefix array. Some times this array is also
called ∏ array. This prefix array is built using the prefix and suffix information of pattern
The KMP algorithm achieves the efficiency of O (m + n) which is optimal in worst case.
Where n is the length of text and m is length of pattern
Let us first understand how to compute the prefix array for given pattern
ADVANCED DATA STRUCTURES UNIT - VI CSE
Algorithm 1:
Example:
STEP 1:
0 1 2 3 4 5
a b a d a b
STEP 2:
ADVANCED DATA STRUCTURES UNIT - VI CSE
0 1 2 3 4 5
a b a d a b
0 0
STEP 3:
0 1 2 3 4 5
a b a d a b
0 0 1
ADVANCED DATA STRUCTURES UNIT - VI CSE
STEP 4:
0 1 2 3 4 5
a b a d a b
0 0 1 0
STEP 5:
suffix: a , d a , a d a , b a d a
0 1 2 3 4 5
a b a d a b
0 0 1 0 1
STEP 6:
suffix: b , a b , d a b , a d a b , b a d a b
ADVANCED DATA STRUCTURES UNIT - VI CSE
0 1 2 3 4 5
a b a d a b
0 0 1 0 1 2
NOTE:
If there is more than one matching prefix – suffix then make entry with largest length of
matching into the prefix table
E.g.: string a b a b a
Prefix: a , a b , a b a , a b a b
Suffix: a , b a , a b a , b a b a
Make an entry into the prefix table with largest length of matching is 3.
Tries:
A trie (pronounced ‘try’ means retrieval) is a tree based data structure for sorting strings in
order to support fast pattern matching. The main application for tries is in information
retrieval. The trie uses the digits in the keys to organize and search the dictionary.
a b
l n r a
l o d t e d t
o n
ADVANCED DATA STRUCTURES UNIT - VI CSE
t e
The above trie shows words like allot, alone, and, are, bat, bad. The idea is that all strings
sharing common prefix should come from a common node. The tries are used in spell
checking programs.
A trie is a data structure that supports pattern matching queries in time proportional to the
pattern size.
Advantages:
In tries the keys are searched using common prefixes. Hence it is faster. The lookup
of keys depends upon the height in case of binary search tree.
Tries takes less space when they contain a large number of short strings. As nodes
are shared between the keys
Digital search tree is a binary tree in which each node contains one element
Every element is attached as a node using the binary representation
The bits are read from left to right
All the keys in the left sub-tree of a node at level i have bit 0 at ith position, similarly
the right sub-tree of a node at a level i have bit 1 at ith position
Example:
Consider following stream of keys with binary representation to construct a digital search
tree
A T F R C H I N
A
00001
F T
00101 10011
R
10010
C 00001 H 00001
00001 00001
ADVANCED DATA STRUCTURES UNIT - VI CSE
I N
NOTE:
Binary trie:
A binary tri is a binary tree that has two kinds of nodes
Branch nodes
Element nodes
The branch node has two child’s one left and another right
The element node has single data member
Example:
Consider the elements 0000, 0010, 0001, 1001, 1000, and 1100. The binary trie can be built
as the numbers in square represents bit number
Level 1
Level 2
Level 3
Level 4
The binary trie may contain branch nodes whose degree is one. For creating compressed
binary trie, eliminate degree one nodes. Compressed binary trie above is as follows
ADVANCED DATA STRUCTURES UNIT - VI CSE
Patricia:
The Patricia stands for Practical Algorithm To Retrieve Information Coded In Alphanumeric.
Building a Patricia is quite simple.
In Patricia every node will have a bit index. This number is written at every node. Based on
this number the trie will be searched.
Let us understand the procedure of building Patricia with the help of an example
Index 4 3 2 1 0
A 0 0 0 0 1
S 1 0 0 1 1
E 0 0 1 0 1
R 1 0 0 1 0
C 0 0 0 1 1
H 0 1 0 0 0
I 0 1 0 0 1
STEP 1:
As this is very first node we will simply create it as root node. For obtaining its bit index, we
will search the index of leftmost 1
43210
00001
Left most index is at index 0
th
Hence bit index of A is 0. The 0 index of A has value 1. Hence we will have right link up to
self node
ADVANCED DATA STRUCTURES UNIT - VI CSE
STEP 2: Insert S: 1 0 0 1 1
We will start searching for S in existing trie. A bit index is 0. The 0th index of S denotes 1.
That means S should be attached as a right child of A. but wait! before attaching the node S
we must find out the bit index of S. as S should be attached to A, the A is closest node of S.
hence compare S and A.
STEP 3: Insert E: 0 0 1 0 1
For inserting E in an existing trie we go on searching from root. The bit index at root is 4. The
4th index bit of E is 0, so move onto left branch. With A, the bit index is 0. The 0 th index of E
is 1.
Hence E can be attached as right child of A. but before attaching E to A, we must find bit
index of E. the closest node of E is A. hence compare E and A.
ADVANCED DATA STRUCTURES UNIT - VI CSE
As bit index of E is 2, we can not attach E as a child of A 9since bit index of A is 0). Hence we
traverse upwards. But as bit index of S is 4, we must attach E as child of S the bit index of E is
2 and the 2nd index bit is 1. Hence right link up to self node
STEP 4: Insert R: 1 0 0 1 0
We start from root, the bit index is 4. The 4th index of R is 1. That is attach R as right child of
S, the S is nearest neighbor of R. hence compare S and R
STEP 5: Insert C: 0 0 0 1 1
The search will be S – E – A. the 0th index of A is 1. Hence C can be attached as right child of
A. the A is nearest of C.
But bit index of A is 0 and bit index of C is 1. Hence C can not be the child of A. we traverse
up, the 1st index bit of C is 1. Then right link up of C is up to self node.
STEP 6: Insert H: 0 1 0 0 0
We can not attach H as child of A. traverse up, towards S. the 3 rd index bit of H is 1, so right
link up to self node
STEP 7: Insert I: 0 1 0 0 1