String Matching Algorithms

AADI JAIN
3 min readApr 7, 2022

--

Have you ever imagined, how google or any search engine performs searching of keywords that you have entered in the bucket of millions, trillions of words and sentences?
Do you how the implementation of detecting plagiarism is done, due to which all students are scared of their assignments?
Yes, behind all these findings and searching for a pattern either in a text or sequence is called String Matching particular to finding a pattern in a given text.

“String Matching” algorithms can be very helpful for text-editing programs as these are also used, for example, to search for particular patterns in DNA sequences.

Do data science also uses technique of String Matching?

Applications of String Matching Algorithms:

  • Digital Forensics
  • Spelling Checker
  • Spam Filters
  • Intrusion Detection System

There are various algorithms that are useful in the case of searching a string within another string which helps in performing time-efficient tasks in multiple domains. In this article, we’ll be going to concentrate on the Naive String Algorithm(Character comparison) and Rabin Karp Algorithm(Hashing String Matching).

Naive String-Matching Algorithm
A very simple and inefficient way to check whether a pattern occurs inside a given text at a particular index or not, we see if there’s a replica of the needle starting at the first character of our Haystack. If it’s not we move ahead and compare the same with the next character, and so forth.
The number of comparisons in the worst-case scenario for this algorithm is O(m*(n-m+1)), where m is the length of the pattern and n is the length of the text. Remember, we always assume that m≤n as the pattern to be searched in the text must always be smaller than the main text.

Naive String Algorithm

Because the information gathered about the text for one value of s is completely ignored when examining other values of s, the Naive string match is not an efficient strategy. The worst case in Naive is further improved by Rabin and Karp, let’s have a look upon that👀

The Rabin-Karp Algorithm

Rabin and Karp devised a string-matching algorithm that works well in reality and generalises to other methods for similar issues like two-dimensional pattern matching. The Rabin-Karp algorithm uses Θ(m) preprocessing time, and its worst-case running time is Θ((n — m +1)m). However, it has a better average-case running time based on specific assumptions.

In the Rabin-Karp algorithm. Each character is a decimal digit, and we compute values modulo from a prime number say 13. (a) A text string. A window of length 5 is shown. The numerical value of this number is computed modulo 13, yielding the value 7. (b) (b) For each conceivable position of a length-5 window, the same text string with values computed modulo 13.
Using the pattern P = 31415, we check for windows with a modulo 13 value of 7, because 31415 = 7. (mod 13). Two such windows are found. Starting at the first text position 7 is actually the appearance of the pattern, and starting at the second text position 13 is a false match. Computing the value for a window in constant time, given the value for the previous window. The first window has value 31415. Dropping the high-order digit 3, shifting left (multiplying by 10), and then adding in the low order digit 2 gives us the new value 14152. However, because both computations are modulo 13, the value computed for the first window is 7, and the value computed for the next window is 8.

The following procedure of Rabin-Karp-Matcher works as follows:

Rabin-Karp Algorithm

The RABIN-KARP-MATCHER takes Θ(m) preprocessing time, and its matching time is Θ((n — m + 1)m) in the worst case, since (just like the naive string-matching algorithm) the Rabin-Karp algorithm explicitly verifies every valid shift ‘s’. If P = a^m and T = a^n, then the verification takes time Θ((n — m + 1)m), since each of the n — m + 1 possible shifts is valid.

Limitations of Rabin-Karp Algorithm

Spurious Hit

It is a condition when the hash value of the window of the text matches, the hash value of the pattern but the window is not the actual pattern.

The time complexity of this algorithm is increased due to the spurious hits. In order to minimize it, we use modulus. It reduces the spurious hits.

When spurious hits occur a number for all the windows, the worst-case complexity occurs.

There are various other algorithms as well like Knuth-Morris-Pratt Algorithm, Boyer Moore Algorithm, etc.

However, because both computations are modulo 13, the value computed for the first window is 7, and the value computed for the next window is 8.
To get the gist, see this explanatory video of mine: https://youtu.be/OdMtqRruMfg

--

--

AADI JAIN

A learner, who learn things and try to express my learning by writing it down. Trying to be a good engineer :)