String Matching
Algorithms
Topics
Basics of Strings
Brute-force String Matcher
Rabin-Karp String Matching Algorithm
KMP Algorithm
1
In string matching problems, it is required to find the
occurrences of a pattern in a text.
These problems find applications in text processing,
text-editing, computer security, and DNA sequence
analysis.
Find and Change in word processing
Sequence of the human cyclophilin 40 gene
CCCAGTCTGG AATACAGTGG CGCGATCTCG GTTCACTGCA
ACCGCCGCCT CCCGGGTTCA AACGATTCTC CTGCCTCAGC
CGCGATCTCG : DNA binding protein GATA-1
CCCGGG : DNA binding protein Sma 1
C: Cytosine, G : Guanine, A : Adenosine, T : Thymine
CSE5311 Kumar
2
Text : T[1..n] of length n and Pattern P[1..m] of length m.
The elements of P and T are characters drawn from a finite
alphabet set .
For example = {0,1} or = {a,b, . . . , z}, or = {c, g, a, t}.
The character arrays of P and T are also referred to as
strings of characters.
Pattern P is said to occur with shift s in text T
if 0 s n-m and
T[s+1..s+m] = P[1..m] or
T[s+j] = P[j] for 1 j m,
such a shift is called a valid shift.
The string-matching problem is the problem of finding all
valid shifts with which a given pattern P occurs in a given
text T.
CSE5311 Kumar
3
Brute force string-matching algorithm
To find all valid shifts or possible values of s so that
P[1..m] = T[s+1..s+m] ;
There are n-m+1 possible values of s.
Procedure BF_String_Matcher(T,P)
1. n length [T];
2. m length[P];
3. for s 0 to n-m
4. do if P[1..m] = T[s+1..s+m]
5. then shift s is valid
This algorithm takes ((n-m+1)m)
CSE5311
in the worst case.
Kumar
4
a c a a b c a c a a b c
a a b a a b
a c a a b c
a a b
a c a a b c matches
a a b
CSE5311 Kumar
5
Rabin-Karp Algorithm
Let = {0,1,2, . . .,9}.
We can view a string of k consecutive characters as
representing a length-k decimal number.
Let p denote the decimal number for P[1..m]
Let ts denote the decimal value of the length-m
substring T[s+1..s+m] of T[1..n] for s = 0, 1, . . ., n-m.
ts = p if and only if
T[s+1..s+m] = P[1..m], and s is a valid shift.
p = P[m] + 10(P[m-1] +10(P[m-2]+ . . . +10(P[2]+10(P[1]))
We can compute p in O(m) time.
Similarly we can compute t0 from T[1..m] in O(m) time.
CSE5311 Kumar
6
m =4
6378 = 8 + 7 10 + 3 102 + 6 103
= 8 + 10 (7 + 10 (3 + 10(6)))
= 8 + 70 + 300 + 6000
p = P[m] + 10(P[m-1] +10(P[m-2]+ . . .
+10(P[2]+10(P[1]))
CSE5311 Kumar
7
ts+1 can be computed from ts in constant time.
ts+1 = 10(ts –10m-1 T[s+1])+ T[s+m+1]
Example : T = 314152
ts = 31415, s = 0, m= 5 and T[s+m+1] = 2
ts+1= 10(31415 –10000*3) +2 = 14152
Thus p and t0, t1, . . ., tn-m can all be computed in O(n+m)
time.
And all occurences of the pattern P[1..m] in the text
T[1..n] can be found in time O(n+m).
However, p and ts may beCSE5311
too large
Kumar to work with 8
conveniently.
Computation of p and t0 and the recurrence is done using modulus q.
In general, with a d-ary alphabet {0,1,…,d-1}, q is chosen such that
dq fits within a computer word.
The recurrence equation can be rewritten as
ts+1 = (d(ts –T[s+1]h)+ T[s+m+1]) mod q,
where h = dm-1(mod q) is the value of the digit “1” in the high
order position of an m-digit text window.
Note that ts p mod q does not imply that ts = p.
However, if ts is not equivalent to p mod q ,
then ts p, and the shift s is invalid.
We use ts p mod q as a fast heuristic test to rule out the
invalid shifts.
Further testing is done to eliminate spurious hits.
- an explicit test to check whether
P[1..m] = T[s+1..s+m]
CSE5311 Kumar
9
ts+1 = (d(ts –T[s+1]h)+ T[s+m+1]) mod q
h = dm-1(mod q)
Example :
T = 31415; P = 26, n = 5, m = 2, q = 11
p = 26 mod 11 = 4
t0 = 31 mod 11 = 9
t1 = (10(9 - 3(10) mod 11 ) + 4) mod 11
= (10 (9- 8) + 4) mod 11 = 14 mod 11 = 3
CSE5311 Kumar
10
Procedure RABIN-KARP-MATCHER(T,P,d,q)
Input : Text T, pattern P, radix d ( which is typically =),
and the prime q.
Output : valid shifts s where P matches
1. n length[T];
2. m length[P];
3. h dm-1 mod q;
4. p 0;
5. t0 0;
6. for i 1 to m
7. do p (dp + P[i] mod q;
8. t0 (dt0 +T[i] mod q;
9. for s 0 to n-m
10. do if p = ts
11. then if P[1..m] = T[s+1..s+m]
12. then “pattern occurs with shift
‘s’
13. if s < n-m
CSE5311 Kumar
11
14. then ts+1 (d(ts –T[s+1]h)+ T[s+m+1])
Comments on Rabin-Karp Algorithm
All characters are interpreted as radix-d digits
h is initiated to the value of high order digit position of an
m-digit window
p and t0 are computed in O(m+m) time
The loop of line 9 takes ((n-m+1)m) time
The loop 6-8 takes O(m) time
The overall running time is O((n-m)m)
CSE5311 Kumar
12
Exercises
-- Home work
Study KMP Algorithm for String Matching
-- Knuth Morris Pratt (KMP)
Study Boyer-Moore Algorithm for String matching
Extend Rabin-Karp method to the problem of searching a text string for an
occurrence of any one of a given set of k patterns? Start by assuming that all
k patterns have the same length. Then generalize your solution to allow the
patterns to have different lengths.
Let P be a set of n points in the plane. We define the depth of a point in P as
the number of convex hulls that need to be peeled (removed) for p to become
a vertex of the convex hull. Design an O(n2) algorithm to find the depths of all
points in P.
The input is two strings of characters A = a1, a2,…, an and B = b1, b2, …,
bn. Design an O(n) time algorithm to determine whether B is a cyclic shift of
A. In other words, the algorithm should determine whether there exists an
index k, 1 k n such that ai = b(k+i) mod n , for all i, 1 i n.
CSE5311 Kumar
13