See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/323988995
String Matching Algorithms
Article in International Journal Of Engineering And Computer Science · March 2018
DOI: 10.18535/ijecs/v7i3.19
CITATIONS READS
6 1,602
3 authors, including:
Preeti Narooka
Terna Engineering College
3 PUBLICATIONS 12 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Use of Ensemble & Hybrid Classifiers for Intrusion Detection Systems View project
All content following this page was uploaded by Preeti Narooka on 04 February 2020.
The user has requested enhancement of the downloaded file.
www.ijecs.in
International Journal Of Engineering And Computer Science ISSN:2319-7242
Volume 7 Issue 3 March 2018, Page No. 23769-23772
Index Copernicus Value (2015): 58.10, 76.25 (2016) DOI: 10.18535/ijecs/v7i3.19
String Matching Algorithms
Mukku Bhagya Sri, Rachita Bhavsar, Preeti Narooka
Computer Department
Terna engineering college, Nerul
Computer Department
Terna Engineering college, nerul
Assistant professor
Computer Department
Terna Engineering college, Nerul
Abstract:
To analyze the content of the documents, the various pattern matching algorithms are used to find all the
occurrences of a limited set of patterns within an input text or input document. In order to perform this task,
this research work used four existing string matching algorithms; they are Brute Force algorithm, Knuth-
Morris-Pratt algorithm (KMP), Boyer Moore algorithm and Rabin Karp algorithm. This work also proposes
three new string matching algorithms. They are Enhanced Boyer Moore algorithm, Enhanced Rabin Karp
algorithm and Enhanced Knuth-Morris-Pratt algorithm.
Findings: For experimentation, this work has used two types of documents, i.e. .txt and .docx. Performance
measures used are search time, number of iterations and accuracy. From the experimental results, it is
realized that the enhanced KMP algorithm gives better accuracy compared to other string matching
algorithms. Application/Improvements: Normally, these algorithms are used in the field of text mining,
document classification, content analysis and plagiarism detection. In future, these algorithms have to be
enhanced to improve their performance and the various types of documents will be used for
experimentation.
Keywords: Brute Force, Boyer Moore, Information shift s in text T (or equivalently that the pattern P
Retrieval, Knuth-Morris-Pratt, Pattern Matching, occurs beginning at position s+1 in text T) if
Rabin Karp 0<=s<=n-m and T[s+1….s+m]=P[1..m]. If P occurs
with shift s in T then we
I. Introduction
calls a valid shift otherwise we call s an invalid
String searching algorithms, sometimes called shift. The string matching algorithm is the problem
string matching algorithms, are an important class of finding all valid shift with which a pattern P
of string algorithms that try to find a place where occurs in given text.
one or several strings (also called patterns) are
found within a larger string or text. Let Σ be an Large number of algorithms is known to exist to
alphabet (finite set). Formally, both the pattern and solve string matching problem. Based on the
searched text are vectors of elements of Σ. The Σ number of patterns searched for the algorithms can
may be a usual human alphabet (for example, the be classified as single pattern and multiple pattern
letters A through Z in the Latin alphabet). Other algorithms. Applications may require exact or
applications may use binary alphabet (Σ = {0,1}) or approximate string matching.
DNA alphabet (Σ = A,C,G,T}) in Exact String Matching Problem
bioinformatics.[11] We assume that the text is an We are given a text string pattern string we want to
array T[1..n] of length n and that the pattern is an find all occurrences of P in T. In Exact string
array of length[1..m] of length m and that m<=n. matching problem the pattern is exactly found
The character arrays T and P are often called strings inside the text. Consider the following example:
of characters. We say that pattern P occurs with T=AGCCTAAGCTCCTAAGTC
Mukku Bhagya Sri, IJECS Volume 7 Issue 3 March 2018 Page No. 23769-23772 Page 23769
P=CCTA
There are two occurrences of P in T as shown
below:
AGCCTAAGCTCCTAAGTC
A brute force method for exact string matching
algorithm:
T=ACCACTAGA
P=ACTA
ACTA
ACTA
ACTA
If the brute force method is used, many characters
which had been matched will be matched again
because each time a mismatch occurs, the pattern is
moved only one step. There are many exact string
matching algorithms. Nearly all of them are
concerned with how to slide the pattern. Few of
them are listed below. Fig: Methodology
Existing Algorithms:
II.Metholodgy: 1 Rabin- Karp Algorithm
The main goal of this research work is to match the Rabin-Karp Algorithm is the simplest string
patterns of text by analyzing the contents of the searching algorithm. This algorithm was developed
documents using string matching algorithms. In by Michael O. Rabin and Richard M. Karp in 1987.
order to perform this task, this research work uses This algorithm uses the hash function to discover
four existing string matching algorithms; they are the potential pattern in the input text. For the length
Brute Force algorithm, Knuth-Morris-Pratt of text n and pattern p of mutual length m, its
algorithm (KMP), Boyer Moore algorithm and average and best case running time is O (n+m) in
Rabin Karp algorithm. This work also proposes space O (p), and also the worst-case time is O (nm)
three new string matching algorithms. They are in space O (m). It is used to discover the hash value
Enhanced Boyer Moore algorithm, Enhanced Rabin of the certain pattern substring and then it discovers
Karp algorithm and Enhanced Knuth-Morris-Pratt the hash value of all possible m length substring of
algorithm. The performance factors are used time the input text. If the hash value of the pattern and
taken for searching the pattern, number of iterations text substring match than it returns the value
required and its accuracy for single word search, otherwise next substring value is matched to
multiple words search and a file search. But in this calculate the string of length m.
research work we study in detail about , Knuth- Algorithm: Rabin-Karp
Morris-Pratt algorithm (KMP) and Rabin Karp RABIN-KARP-MATCHER(T,P,d,q)
algorithm. 1 N=T.length
2 M=P.length
3 h=dm-1mod q
4 p=0
5 t0=0
6 for i=1 to m
7 p =(dp+P[i])mod q
8 t0=(dt0+t[i])mod q
9 for s = 0 to n-m
10 if p == ts
11 if p[1..m] == T[s+1...s+m]
12 Print”Pattern occcurs with shift ”s
13 If s<n-m
14 ts+1 =(d(ts – T[s+1]h)+T[s+m+1]) mod q
The procedure works as follows. All characters are
interpreted as radix-d digits. The subscript on t are
Mukku Bhagya Sri, IJECS Volume 7 Issue 3 March 2018 Page No. 23769-23772 Page 23770
provided only for clarity; the program works This searching algorithm that uses the hashing
correctly if all the subsripts are dropped. Line 3 function to find any one of a set of pattern in input
intializes h to the value of the high-order dogit text. Hashing offers a simple method to avoid a total
podition of an m-digit window. Line 4-8compute p number of character comparisons. For length of text
as the value of the of P[1...m] mod q and t0 as the N and the pattern P of combined length M, its best
value of T[1...m]modq. The for loop of lines 9-14 case running time is O (N+M). And the worst case
iterates thrugh all possible shifts s, maintaining the time is O (NM). First the algorithm used to find the
following invariant. hash value of the pattern. Then it checks the input
text along with its hash value. If mismatch occurs,
Knuth-Morris-Pratt Algorithm shift the window to the next character then calculate
The Knuth–Morris–Pratt were developed a linear the hash value and the same process will continue.
time string searching algorithm by analysis of the Otherwise it returns the index position of the
brute force algorithm or naïve algorithm. particular character.
The algorithm was developed in 1974 by Donald Algorithm: Enhanced Rabin Karp Algorithm
Knuth and Vaughan Pratt, and independently 1 Functrion relation(S,P,n,m,k,q)
by James H. Morris and they published it jointly in 2 Begin
1977.The Knuth-Morris-Pratt algorithm moderates 3 h – Km-1 mod q;
the total number of comparisons of the pattern 4 p – 0;
against the input string. A matching time of O(n) is 5 t0 – 0;
accomplished by evading associations with 6 for i=1 to m do
essentials of „S‟ that have earlier been 7 P –(K, p+ p[i])modq;
1. The prefix function, Π The prefix function, Π for 8 T0 –(K, t0+s[i])modq;
a pattern summarizes the knowledge regarding 9 End for
however the pattern matches in contradiction of 10 For j=0 to n-m do
shifts of itself. This information may be accustomed a) If p-tj then
avoid unusable shifts of the pattern “p”. In other i) If p=s[j+1,j+m] then
words, this succeeds avoiding backtracking on the Out j+1;
string “S”. ii) End if
2. The KMP Matcher With string “S”, pattern “p” b) End if
and prefix function “Π” as inputs, the prevalence of
11 If j<n-m then
“p” in “S” is found and the algorithm yields the
12 Tj-1 =(K(tj-s[j+1]).h)+s[j+m+1])mod q;
variety of shifts of “p” after which the existence is
13 End for
found.
14 End.
3. Running - time analysis: The period of time for
computing the prefix function is Θ (m) and period
Enhanced Knuth-Morris-Pratt Algorithm
of time of matching function is Θ (n).
Knuth-Morris-Pratt algorithm is one of the efficient
Algorithm:Knuth-Morris-
string matching algorithms. This algorithm
Pratt
examines for existences of a pattern p within a main
1 n = T.length
text t by using the reflection that while matching,
2 m=P.length
the mismatch occurs, the word itself represents
3 3.14 = Computer-Prefix-function(p)
satisfactory information to regulate where the next
4 q=0
match can begin, thus avoiding the re examination
5 for i = 1to n
of formerly matched characters. The KMP
6 while q>0 and P[q+1]=/ T[i]
algorithm uses a bit table to discover the mismatch
7 q = 3.14[q]
of the pattern in an input text. This algorithm
8 if P[q+1]== T[i]
performs the comparison from left to right. It uses
9 q = q+1
the bit table for the comparison, if match it returns
10 if q == m
the index of the text. Otherwise it checks the next
11 print”Pattern occurs with shift” i-m
bit.
12 q=3.14[q]
Algorithm: Enhanced Knuth-Morris-Pratt
Enhanced Algorithms:
Algorithm
1 KMP_search(E(p),E(T))
Enhanced Rabin Karp Algorithm
Mukku Bhagya Sri, IJECS Volume 7 Issue 3 March 2018 Page No. 23769-23772 Page 23771
2 Begin character in the alphabet. If a mismatch occurs on
3 Preprocess E(p)to obtain the next_bit table character in the text, the failure function table for
4 While (not emd of input)do character is consulted for the index in the pattern at
a) Get next bit b; which the mismatch took place. This will return the
b) If (j>=0)&(b!=E(p)[j])do length of the longest substring ending at matching a
c) End if prefix of the pattern, with the added condition that
d) If (j=|E(p)|) the character after the prefix is With this restriction,
i) Return a match character in the text need not be checked again in
ii) J--1 the next phase, and so only a constant number of
e) End if operations are executed between the processing of
f) J-j+1 each index of the text. This satisfies the real-time
g) End while computing restriction.
5 End.
III.Conclusion:
Variants: This research work analyzes the performance
Robin-Karp Algorithm measures of existing and enhanced string matching
algorithms. The performance factors are time,
A. Long patterns and Σ For long patterns and Σ, number of iteration and its accuracy for single line,
Boyer-Moore algorithm gives much better multiple lines and a file. From the analysis, in
efficiency compared to other string matching existing the KMP algorithm gives the better
algorithms. The program involves two heuristics accuracy for all the inputs. In enhanced algorithms,
that allows the program to skip many text characters the enhanced KMP algorithm gives the better
altogether. The algorithm makes successive accuracy. Form the existing and enhanced KMP
comparisons from right to left. When a mismatch algorithms; the enhanced KMP algorithm gives the
occurs, both heuristics proposes a value (maximum better accuracy.
of which is chosen) by which shift is increased
without skipping any valid shift. IV. Acknowledgement
We feel privileged to express our deepest sense of
B. Repetition Factors An efficient algorithm for gratitude. To our guide Profs.Preeti mam. Her
string matching based on repetition factors was prompt and kind help led to completion of work.
developed by Galil and Seiferas. The algorithm has
linear running time complexity and requires only References
O(1) storage beyond P and T. [1] Verma A, Kaur I, Singh I. Comparative
analysis of data mining tools and techniques
C. Approximate String Matching The Bitap for information retrieval. Indian Journal of
algorithm performs approximate string matching Science and Technology.
based on Levenshtein distance between strings. The
algorithm requires much lesser preprocessing and
can uses mostly bitwise operations, making the [2] Al-Mazroi A, Rashid NA. A Fast Hybrid
algorithm extremely fast. Algorithm for the Exact String Matching
Problem. American Journal of Engineering
D. Dictionary Matching Aho-Corasick algorithm and Applied Sciences. 2011.
can perform multiple (but finite) pattern matching in
a text in parallel achieving linear running time. [3] Algorithm book by Cormen.
E. Polymorphic String Matching Combination of
more than one string matching algorithm (example
KMP and Boyer-Moore fusion) can be used to
provide a better functional algorithm with decreased
space and
Knuth-Morris-Pratt
A real-time version of KMP can be implemented
using a separate failure function table for each
Mukku Bhagya Sri, IJECS Volume 7 Issue 3 March 2018 Page No. 23769-23772 Page 23772
View publication stats