0% found this document useful (0 votes)
1K views

Pattern Matching Algorithms

The document discusses various pattern matching algorithms. It begins by introducing the problem of pattern searching in text and representing strings as character arrays. It then describes three algorithms: the brute force algorithm, Boyer-Moore algorithm, and Knuth-Morris-Pratt algorithm. The brute force algorithm has quadratic runtime. Boyer-Moore improves on this by shifting the pattern by larger amounts when a mismatch is found. Knuth-Morris-Pratt uses prefix information to avoid re-examining characters.

Uploaded by

rajasekharv86
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1K views

Pattern Matching Algorithms

The document discusses various pattern matching algorithms. It begins by introducing the problem of pattern searching in text and representing strings as character arrays. It then describes three algorithms: the brute force algorithm, Boyer-Moore algorithm, and Knuth-Morris-Pratt algorithm. The brute force algorithm has quadratic runtime. Boyer-Moore improves on this by shifting the pattern by larger amounts when a mismatch is found. Knuth-Morris-Pratt uses prefix information to avoid re-examining characters.

Uploaded by

rajasekharv86
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

ADVANCED DATA STRUCTURES UNIT - VI CSE

PATTERN MATCHING
Computers are well recognized to perform numerical computations – but they are equally
capable of processing textual data. A document generally contains textual data, or simply
text. Computers are used to edit documents, to search documents, to transmit documents
over the Internet, to display documents on monitors, and to print documents on printers.

The main concern in the text processing is centered on either manipulation or movement of
characters or searching for pattern or words.

Before proceeding to the search algorithms for text processing, we recall the operations on
strings. Representing a string as an array of characters is simple and efficient.

Pattern Matching Algorithms:


Pattern searching problem and its related variations are commonly come across in
computing.

For example: We may wish to determine whether or not the substring “DATA” occurs in the
text: ADVANCED DATA STRUCTURES

To examine text in a computer we are essentially limited to examining it one character at a


time. We there fore need a way of deciding when a match has been made.

Two substrings can be said to match when they are equal character-by-character from the
first to the last character. It follows that if the number of character-for-character matches is
equal to the length of both the substring sought and the text substring

A mismatch between two substrings must therefore mean that the two substrings are not
the same character-for-character or in other words the number of character-for-character
matches is less than the length of the two substrings.

This section uses the following notation: For a text string T of length n and a string P of
length m, we want to find whether P is a substring of T, The concept of a match is that there
is a substring of T starting at some index I that matches P, character-by-character, so that

T[i] = P [0], T [i+1] = P [1], . . . T [i+m-1] = P [m-1]

T[i] for i=0 to 35

A D V A N C E D D A T A S T R U C T U R E S

D A T A
P 

Brute – Force Algorithm:

A brute – force algorithm solves a problem in the most simple, direct or obvious way
ADVANCED DATA STRUCTURES UNIT - VI CSE

A simple method to string matching is starting the comparison of P (string pattern) and T
(text string) from the first character of T and the first character of P. If there is a mismatch,
the comparison starts from the second character of T and so on.

The running time of brute – force pattern matching algorithm is not efficient in the worst
case. The running time of the algorithm is O (nm). The algorithm has a quadratic running
time O (n2), when m = n/2

Boyer – Moore Algorithm:

In Brute – force, it is required to compare every character in text T in order to locate a


pattern P as a substring. But, Boyer – Moore (B-M) pattern matching algorithm sometimes
avoids comparisons between P and a considerable fraction of the character in T.

This algorithm scans the character of the search pattern from right to left. If a match is not
found then a shift is made by some number of characters. This algorithm is also called as
“looking glass heuristic”

Algorithm:

Boyer-Moore (Text: T*0..n+, Pattern: P*0…m+)


{
// set i and j to the last index of p
i  m-1;
j  m-1;
// loop to the end of the text string
While (i<n)
// if both characters match
if (P[ j ] = T[ i ])
// end reached the end of P
if ( j = = 0 )
// found a match
Return i;
else
// go to next char
i  i-1
j  i-1;
else
// skip over the whole word or shift to last occurrence
i  i + m – min ( j, 1+last [ T[i] ])
j  m-1;
retirn -1; //no match
}

Program Logic:
ADVANCED DATA STRUCTURES UNIT - VI CSE

 j and l are the two vales


 where j is the value of the pattern index and l is the value of last(c) function which is
the right most occurrence of particular pattern character ‘c’

Analysis:
The computation of the last function takes O (m+|∑|) time and actual search takes O(mn)
time. Therefore the worst case running time of Boyer-Moore algorithm is O(nm + |∑|).
Implies that the worst-case running time is quadratic, in case of n = m, the same as the naïve
algorithm.
Boyer-Moore algorithm is extremely fast on large alphabet (relative to the length of the
pattern).
Example:

Consider a text T = X Y X Z X X Y X T Z X Y X Z X Y Y X X Y

To make against the pattern P = X Y X Z X Y

We will first build last table using last(c) function where ‘c’ represents character from T

Consider the pattern


ADVANCED DATA STRUCTURES UNIT - VI CSE

The remaining characters in text string is T which is not present in pattern i.e. last (T) = -1

STEP 1:

X Y X Z X X Y X T Z X Y X Z X Y Y X X Y

Mismatch

X Y X Z X Y

0 1 2 3 4 5 j

We will find l = last (c)  last (X) = 4, j=5

As 1 + l <= j  1 + 4 <= 5  then we shift the pattern by j – l i.e. by 1 position

STEP 2:

X Y X Z X X Y X T Z X Y X Z X Y Y X X Y

Mismatch

X Y X Z X Y

0 1 2 3 4 5 j

l = last (X) = 4, j = 3
As j < l +1  3 < 4 + 1  pattern shift by ( l – j) positions i.e. by 1 position

STEP 3:

X Y X Z X X Y X T Z X Y X Z X Y Y X X Y

Mismatch

X Y X Z X Y

0 1 2 3 4 5 j

l = last (X) = 4, j=5  l + 1 <= j  4 + 1 <= 5  ( j – l ) positions i.e. by 1 position

STEP 4:
ADVANCED DATA STRUCTURES UNIT - VI CSE

X Y X Z X X Y X T Z X Y X Z X Y Y X X Y

Mismatch

X Y X Z X Y

0 1 2 3 4 5 j

l = last (T) = -1, j=5  l + 1 <= j  -1 + 1 <= 5  ( j – l )  (5 – ( -1 )) positions i.e. by 6


position

STEP 5:

X Y X Z X X Y X T Z X Y X Z X Y Y X X Y

Mismatch

X Y X Z X Y

0 1 2 3 4 5 j

Pattern will shift by 1 position

STEP 5:

X Y X Z X X Y X T Z X Y X Z X Y Y X X Y

Matched

X Y X Z X Y

0 1 2 3 4 5 j

Now the match for the given pattern is found in the given string

Knuth – Morris – Pratt Algorithm:

The basic method of the B – M algorithm ultimately reduces the efficiency of pattern
matching algorithm. Hence the K – M – P algorithm came up which avoids the repeated
comparison of characters

The basic idea behind this algorithm is to build a prefix array. Some times this array is also
called ∏ array. This prefix array is built using the prefix and suffix information of pattern

The overlapping prefix and suffix is used in K – M – P algorithm

The KMP algorithm achieves the efficiency of O (m + n) which is optimal in worst case.
Where n is the length of text and m is length of pattern

Let us first understand how to compute the prefix array for given pattern
ADVANCED DATA STRUCTURES UNIT - VI CSE

Algorithm 1:

Compute – prefix (char p[size])


{
// problem description: this algorithm computes prefix
// table for given pattern
// input: pattern p
// output: prefix table for given pattern

prefix – table [0]  0

for ( q  1 to m ) do //m is length of pattern


{
while ( k > 0 AND p[k] ! = p[q])

k  prefix – table [k-1]


If ( p[k] = p[q] ) then
kk+1

prefix – table [q]  k


}

return prefix – table;

Example:

Suppose the given pattern is “abadab“


The prefix or ∏ array for this pattern can be built as follows

STEP 1:

Initially, put 0 in the 0th location of prefix array

0 1 2 3 4 5

a b a d a b

STEP 2:
ADVANCED DATA STRUCTURES UNIT - VI CSE

Consider a string from given pattern

0 1 2 3 4 5

a b a d a b

0 0

STEP 3:

Consider a string from given pattern

0 1 2 3 4 5

a b a d a b

0 0 1
ADVANCED DATA STRUCTURES UNIT - VI CSE

STEP 4:

Consider a string from given pattern

0 1 2 3 4 5

a b a d a b

0 0 1 0

STEP 5:

Consider a string from given pattern

abada from abadab

prefix: a , a b , a b a , a b a d The length of matching prefix – suffix is 1

suffix: a , d a , a d a , b a d a

0 1 2 3 4 5

a b a d a b

0 0 1 0 1

STEP 6:

Consider a string from given pattern

abadab from abadab

prefix: a , a b , a b a , a b a d , a b a d a The length of matching prefix – suffix is 2

suffix: b , a b , d a b , a d a b , b a d a b
ADVANCED DATA STRUCTURES UNIT - VI CSE

0 1 2 3 4 5

a b a d a b

0 0 1 0 1 2

NOTE:

If there is more than one matching prefix – suffix then make entry with largest length of
matching into the prefix table

E.g.: string a b a b a

Prefix: a , a b , a b a , a b a b

Suffix: a , b a , a b a , b a b a

Make an entry into the prefix table with largest length of matching is 3.

Tries:
A trie (pronounced ‘try’ means retrieval) is a tree based data structure for sorting strings in
order to support fast pattern matching. The main application for tries is in information
retrieval. The trie uses the digits in the keys to organize and search the dictionary.

The example Trie is as follows:

a b

l n r a

l o d t e d t

o n
ADVANCED DATA STRUCTURES UNIT - VI CSE

t e

The above trie shows words like allot, alone, and, are, bat, bad. The idea is that all strings
sharing common prefix should come from a common node. The tries are used in spell
checking programs.

A trie is a data structure that supports pattern matching queries in time proportional to the
pattern size.

Advantages:

 In tries the keys are searched using common prefixes. Hence it is faster. The lookup
of keys depends upon the height in case of binary search tree.

 Tries takes less space when they contain a large number of short strings. As nodes
are shared between the keys

Digital Search Tree:

 Digital search tree is a binary tree in which each node contains one element
 Every element is attached as a node using the binary representation
 The bits are read from left to right
 All the keys in the left sub-tree of a node at level i have bit 0 at ith position, similarly
the right sub-tree of a node at a level i have bit 1 at ith position
Example:

Consider following stream of keys with binary representation to construct a digital search
tree

A T F R C H I N

00001 10011 00101 10010 00011 01000 01001 01110

A
00001

F T
00101 10011

R
10010

C 00001 H 00001

00001 00001
ADVANCED DATA STRUCTURES UNIT - VI CSE

I N

NOTE:

 Assume that there are fixed of bits

 If we read bit 0 then move onto the left sub branch

 If we read bit 1 then move onto the right sub branch

Binary trie:
A binary tri is a binary tree that has two kinds of nodes
 Branch nodes
 Element nodes
The branch node has two child’s one left and another right
The element node has single data member

Example:
Consider the elements 0000, 0010, 0001, 1001, 1000, and 1100. The binary trie can be built
as the numbers in square represents bit number

Level 1

Level 2

Level 3

Level 4

0000 0001 0010 1000 1001 1100

Compressed Binary Trie:

The binary trie may contain branch nodes whose degree is one. For creating compressed
binary trie, eliminate degree one nodes. Compressed binary trie above is as follows
ADVANCED DATA STRUCTURES UNIT - VI CSE

Patricia:
The Patricia stands for Practical Algorithm To Retrieve Information Coded In Alphanumeric.
Building a Patricia is quite simple.
In Patricia every node will have a bit index. This number is written at every node. Based on
this number the trie will be searched.

Let us understand the procedure of building Patricia with the help of an example

Index 4 3 2 1 0
A 0 0 0 0 1
S 1 0 0 1 1
E 0 0 1 0 1
R 1 0 0 1 0
C 0 0 0 1 1
H 0 1 0 0 0
I 0 1 0 0 1

STEP 1:

As this is very first node we will simply create it as root node. For obtaining its bit index, we
will search the index of leftmost 1

43210
00001
Left most index is at index 0
th
Hence bit index of A is 0. The 0 index of A has value 1. Hence we will have right link up to
self node
ADVANCED DATA STRUCTURES UNIT - VI CSE

STEP 2: Insert S: 1 0 0 1 1

We will start searching for S in existing trie. A bit index is 0. The 0th index of S denotes 1.
That means S should be attached as a right child of A. but wait! before attaching the node S
we must find out the bit index of S. as S should be attached to A, the A is closest node of S.
hence compare S and A.

Where bits of S and A differ. This index is 4. Hence bit index of S is 4.


But now S can not be attached as a child of A because index of S > index of child of A. hence
S should be moved up. The 4th index bit of S denotes 1, hence right link will go up to self
node.

STEP 3: Insert E: 0 0 1 0 1

For inserting E in an existing trie we go on searching from root. The bit index at root is 4. The
4th index bit of E is 0, so move onto left branch. With A, the bit index is 0. The 0 th index of E
is 1.

Hence E can be attached as right child of A. but before attaching E to A, we must find bit
index of E. the closest node of E is A. hence compare E and A.
ADVANCED DATA STRUCTURES UNIT - VI CSE

As bit index of E is 2, we can not attach E as a child of A 9since bit index of A is 0). Hence we
traverse upwards. But as bit index of S is 4, we must attach E as child of S the bit index of E is
2 and the 2nd index bit is 1. Hence right link up to self node

STEP 4: Insert R: 1 0 0 1 0

We start from root, the bit index is 4. The 4th index of R is 1. That is attach R as right child of
S, the S is nearest neighbor of R. hence compare S and R

The 0th index entry of R is 0. Hence left link up to self node


ADVANCED DATA STRUCTURES UNIT - VI CSE

STEP 5: Insert C: 0 0 0 1 1

The search will be S – E – A. the 0th index of A is 1. Hence C can be attached as right child of
A. the A is nearest of C.
But bit index of A is 0 and bit index of C is 1. Hence C can not be the child of A. we traverse
up, the 1st index bit of C is 1. Then right link up of C is up to self node.

STEP 6: Insert H: 0 1 0 0 0

Search traverse through S – E – C – A, nearest node of H is A. compare H and E


ADVANCED DATA STRUCTURES UNIT - VI CSE

We can not attach H as child of A. traverse up, towards S. the 3 rd index bit of H is 1, so right
link up to self node

STEP 7: Insert I: 0 1 0 0 1

At node S, bit index is 4, 4th index of I is 0 so move left


At node H, bit index is 3, 3rd index bit of I is 1 so move right
The H and I are now nearest node, comparing them

Multi Way Trie:


The multi way trie is an ordered prefix trie. Using the prefixes of the given word the trie
structure is built. Consider the words SET, SKY
For these words the multi way trie can be built as follows:
ADVANCED DATA STRUCTURES UNIT - VI CSE

You might also like