Unit 8
String Matching
Introduction
• What is String Matching?
• Suppose T is a large string and P is a substring.
• In simple terms, we want to find all the occurrences of some string P in a
larger string T.
Example :
T = ababaabbababbabaabbab
P = ababb
• Algorithms
1) The naive string matching algorithm,
2) The Rabin-Karp algorithm,
3) String Matching with finite automata.
4) The Knuth-Morris-Pratt algorithm (KMP Algorithm)
The naive string matching algorithm
• Match pattern string against input string character by character.
• When there is a mismatch, shift the whole input string down by one
character in relation to the pattern string, and start again at the beginning.
• Execution of Naïve Method
Suppose we have
large Text T = “ XYZABCDEXYZDEFGHXYZ “ and
Pattern P = “ XYZ ”
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
X Y Z A B C D E X Y Z D E F G H X Y Z
0 1 2
X Y Z
The naive string matching algorithm
• Output
• Pattern found at index=0 Length of T: n = 19
• Pattern found at index=8 Length of P: m=3
• Pattern found at index=16
Outer loop = n-m
Inner loop = m
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
X Y Z A B C D E X Y Z D E F G H X Y Z
0 1 2
X Y Z
Algorithm
void naïve_method( P , T )
{ m = Length (P)
n = Length (T)
for (i = 0; i <= n - m; i++)
{ for(j = 0; j < m; j++)
{ if ( txt [ i+j ] != pat[j] )
{ break;
}
}
if ( j = = m )
{ printf("Pattern found at index %d \n", i);
}
}
Worst case time complexity of O (mn).
}
Rabin-Karp Method
• A string search algorithm which compares a string's hash values, rather than
the strings themselves.
• For efficiency, the hash value of the next position in the text is easily
computed from the hash value of the current position.
Pattern: P
p mod q
Text: T T Algorithm
Example
• Given T = 3 1 4 1 5 9 2 6 5 3 5 and P = 26
• Now find hash value : p mod q = 26 mod 11 = 4
• Here we have taken q=11 (prime number)
3 1 4 1 5 9 2 6 5 3 5
31 mod 11 = 9 , here 9 is not equal to 4. So Don’t Compare
3 1 4 1 5 9 2 6 5 3 5
14 mod 11 = 3 , here 3 is not equal to 4. So Don’t Compare
3 1 4 1 5 9 2 6 5 3 5
41 mod 11 = 8 , here 8 is not equal to 4. So Don’t Compare
Cont...
• Given T = 3 1 4 1 5 9 2 6 5 3 5 and P = 26
3 1 4 1 5 9 2 6 5 3 5
15 mod 11 = 4 , here 4 is equal to 4 spurious hit
3 1 4 1 5 9 2 6 5 3 5
59 mod 11 = 4 equal to 4 spurious hit
3 1 4 1 5 9 2 6 5 3 5
92 mod 11 = 4 equal to 4 spurious hit
3 1 4 1 5 9 2 6 5 3 5
26 mod 11 = 4 equal to 4 an exact match!!
3 1 4 1 5 9 2 6 5 3 5
65 mod 11 = 10 not equal to 4
Cont...
• Given T = 3 1 4 1 5 9 2 6 5 3 5 and P = 26
3 1 4 1 5 9 2 6 5 3 5
53 mod 11 = 9 , here 9 is not equal to 4, So don’t Compare.
3 1 4 1 5 9 2 6 5 3 5
35 mod 11 = 2 not equal to 4
As we can see, when a match is found, further testing is
done to insure that a match has indeed been found.
Example
• T = B A E C D E AA D A C
• P= ADA
• Here first give unique number to each character such as
A=1 , B=2 , C=3 , D=4 , E=5
1 2 3 4 5 6 7 8 9 10 11
B A E C D E A A D A C
2 1 5 3 4 5 1 1 4 1 3
1 2 3
A D A
1 4 1
• First fine Hash values of Pattern :
• 141 mod 11 = 9
Cont…
• T = B A E C D E AA D A C
• P= ADA
• Here first give unique number to each character such as
A=1 , B=2 , C=3 , D=4 , E=5
1 2 3 4 5 6 7 8 9 10 11
B A E C D E A A D A C
2 1 5 3 4 5 1 1 4 1 3
1 2 3
A D A
1 4 1
• First fine Hash values of Pattern :
• 141 mod 11 = 9
Cont…
• T = B A E C D E AA D A C
• P= ADA
• Here first give unique number to each character such as
A=1 , B=2 , C=3 , D=4 , E=5
1 2 3 4 5 6 7 8 9 10 11
B A E C D E A A D A C
2 1 5 3 4 5 1 1 4 1 3
1 2 3
A D A
1 4 1
• First fine Hash values of Pattern :
• 141 mod 11 = 9
Cont…
•T
String Matching with Finite Automata
• What is Finite Automata?
• Finite Automata is the simple machine to recognize patterns.
• It is also known as finite state machine.
• It has five different tuples.
• It has a set of states and rules for moving from one state to another.
• 5-Tuples of Finite Automata
• A finite automaton M is a 5-tuple (Q, q0, A, , δ), where
• Q is a finite set of states
• q0 ε Q is the start state
• A Q is a set of accepting states
• is a finite input alphabet
• δ is the transition function that gives the next state for a given current state
and input.
Algorithm
Input: Text string T [1..n], δ and m
Result: All valid shifts displayed
FINITE-AUTOMATON-MATCHER (T, m, δ)
n ← length[T]
q←0
for i ← 1 to n
q ← δ (q, T [i])
if q = m
print “pattern occurs with shift” i-m
REFER CLASSNOTE FOR THIS TOPIC
Knuth Morris Pratt (KMP) Algorithm
• This algorithm is named after the scientists knuth, Morris and Pratt.
• The basic idea behind this algorithm is to built using the prefix and suffix
information of pattern.
• Let us first understand about how to find prefix and suffix :
String Prefix Suffix
AB A B
ABC A, AB C, BC
ABCD A, AB, ABC D, CD, BCD
Steps of KMP Algorithm
1) Find prefix array or π Table.
2) Use π Table as a reference for shifting the pattern for matching with text.
3) When all characters of Pattern match with text then use following formula
to find index
i – length of pattern + 1
Example
Cont…
Cont…
Cont…
Cont…
Cont…
0 1 2 3 4 5 6
a b a b a d a
0 0 1 2 3 0 1
Cont…
GTU Questions
• What is finite automata? How it can be used in string matching?- 3m
• Explain rabin-karp string matching algorithm. -7m
• What is Finite Automata? Explain use of finite automata for string
matching with suitable example. -7m
Thank You