Boyer-Moore-Horspool Algorithm

Introduction

In developing an application, it is common to have acceptance testing completed before deploying to production. In a typical Create-Read-Update-Delete (CRUD) application, users and business analysts complete the acceptance testing; and now it can be automated thanks to Quality & Assurance developers.

In these kind of CRUD applications, an acceptance consists of (and not limited to):

Create a record and check in database if it created
Retrieve the record and records from the database and view validity after creating or updating
Edit a record and verify it did update in the database
Delete a record and check if it is either deleted or flagged as inactive in the database. This is dependent on how the application was designed by the architect.
Also, check for race conditions when two users are editing the same record
And stress testing by adding multiple users using the application.

However, not all applications are CRUD applications. There are many applications which convert data from legacy systems to modern systems, say from ASCII text files to store in database or from XML (eXtended Markup Language) to JSON (JavaScript Object Notation) etc.

How will analysts or stake holders accept that this data has been migrated correctly without any loss of critical information? The answer is simple we compare the incoming data into the application and compare it to the outgoing and it should match exactly.

String comparison is the solution. Comparing the two documents can be done by seraching if one string (the pattern) occurs in the other string (the source). Let us take a simple example:

Question: Does the word ‘grapes‘ occur in the sentence ‘The grapes are sour.’

Answer: Yes, as is shown by coloring above.

The naive approach is to compare if each character from the 1st position matches with the first character in the pattern. The first letter in the sentence start ‘T’ and not match the first letter (‘g’) of the pattern. So we compare ‘g’ with the second letter which is ‘h’ and so on until the first letter matches with one of the words in the sentence. So we go on until we find a match – character by character in the sentence.

In the Knuth-Morris-Pratt algorithm is based on the naive algorithm. However, in case there is a partial match as shown in the wikepedia link example then it will move that many characters and start, instead of jumping one character to one character, for partial matches.

The BMH (Boyer-Moore-Horspool) is based on the Knuth-Morris-Pratt algorithm. The difference is it starts the comparison from right to left for the pattern while parsing from left to right in the sentence.

A good introduction video of Boyer-Moore-Horspool by Mike Slade is below:

I hope you saw the video for an introduction of the algorithm. I give a simple example below of the method of pattern searching the BMH agorithm.

index	0	1	2	3	4	5	6	7	8
Source	f	o	r		a	p	p	l	e
Pattern	a	p	p	l	e

As’a’ in source does not match the last letter in pattern =’apple’ by 5

index	0	1	2	3	4	5	6	7	8
Source	f	o	r		a	p	p	l	e
Pattern					a	p	p	l	e

After moving, we compare from right to left and we find a match.

Now with that introduction, let us do some implementation and coding.

Implementation

We will write the program in kotlin and the code can be found in github. I have written unit tests to show the implementation in the code. These unit tests can be run in Intellij IDE after configuring or on the command line as

./gradlew test

A simple ‘happy path’ unit test is given :

  @Test
  fun `Happy path`() {
    val source = "Test Tooth String. Molar teeth are growing."
    val pattern = "Tooth"
    val returnValue = horspool.boyerMooreHorspoolSearch(source, pattern)
     assert(returnValue == 5)
  }

In this code, the pattern ‘Tooth‘ occurs after the word ‘Test ‘ in the long string source. So the horspool returns the value 5. Remember that we count from zero(0) in the array. The main class which has the function computing the pattern search is:

class Horspool {

    fun boyerMooreHorspoolSearch(source: String, pattern: String): Int {
        val pattChar = pattern.toCharArray()
        val patternLength: Int = pattChar.size
        if (patternLength == 0) {
            return 0
        }
        val src = source.toCharArray()
        val srcLength: Int = src.size

        var shift = IntArray(200) { patternLength }
        for (k in 0..(patternLength - 2)) {
            shift[pattChar[k].toInt()] = patternLength - 1 - k
        }

        var i = 0
        var j: Int
        while ((i + patternLength) <= srcLength) {
            j = patternLength - 1
            while (source[i + j] == pattChar[j]) {
                j -= 1
                if (j < 0) {
                    return i
                }
            }
            i += shift[src[i + patternLength - 1].toInt()]
        }
        return -1
    }
}

Now we can also search for phrases if we need using the same code as seen below:

@Test
fun `Phrase search`() {
  val source = "The cow jumped over the moon."
  val pattern = "jumped over"
  val returnValue = horspool.boyerMooreHorspoolSearch(source, pattern)
  assert(returnValue == 8)
}

So the value is position 8 in the source string where the phrase ‘jumped over‘ occurs.

All the implementation and call to the function is done in unit tests. You can also find a unit test which finds positions of multiple occurrences of the pattern in the source string.

Conclusion

This algorithm can be used to compare say two documents and state clearly whether they are the same in an efficiently.

The code can be extended as shown in one of the unit test to check if pattern occurs multiple times in the source string.

Of course, more work would be needed to be done if we want to exclude certain words before we compare the two documents. I leave that as an exercise for you.

References

The code for the Horspool Example
The Introduction to Horspool Algorithm by Mike Slade

7 thoughts on “Boyer-Moore-Horspool Algorithm – What & When Used?”

7 comments

erotik
March 2, 2021, 7:02 AM

Im grateful for the article post. Thanks Again. Keep writing. Sonnnie Darrel Renny

Igor
January 26, 2021, 3:04 PM

This is a great article! Thanks for writing it.

- admin
  February 23, 2021, 9:53 PM
  
  you are welcome Igor!
  
Sitaram Ramachandrula
January 25, 2021, 10:14 AM

Thanks for the interesting article. It may be worthwhile to compare and contrast with “Levenshtein distance” and “Longest common subsequence problem” algorithms. Both do the string matching with slightly different motives and applications.

Secondly when matching sentences, is it worthwhile to hash each word in the sentence? Won’t that help faster matches?

- Rajeev
  February 23, 2021, 10:05 PM
  
  You are welcome! thank you for the note on the other algorithms. I am not sure about hashing each word, however that is more for Machine Language. About speed in searches I have not done extensive testing on it and cannot comment.
  
Prena Shah
January 24, 2021, 12:27 PM

Very nice article!

- Rajeev
  February 23, 2021, 10:06 PM
  
  Thank you Prena!

Introduction