Boyer-Moore-Horspool Algorithm – What & When Used?

Introduction

In developing an application, it is common to have acceptance testing completed before deploying to production. In a typical Create-Read-Update-Delete (CRUD) application, users and business analysts complete the acceptance testing; and now it can be automated thanks to Quality & Assurance developers.

In these kind of CRUD applications, an acceptance consists of (and not limited to):

  • Create a record and check in database if it created
  • Retrieve the record and records from the database and view validity after creating or updating
  • Edit a record and verify it did update in the database
  • Delete a record and check if it is either deleted or flagged as inactive in the database. This is dependent on how the application was designed by the architect.
  • Also, check for race conditions when two users are editing the same record
  • And stress testing by adding multiple users using the application.

However, not all applications are CRUD applications. There are many applications which convert data from legacy systems to modern systems, say from ASCII text files to store in database or from XML (eXtended Markup Language) to JSON (JavaScript Object Notation) etc.

How will analysts or stake holders accept that this data has been migrated correctly without any loss of critical information?  The answer is simple we compare the incoming data into the application and compare it to the outgoing and it should match exactly.

Boyer-Moore-Horspool Algorithm

String comparison is the solution. Comparing the two documents can be done by seraching if one string (the pattern) occurs in the other string (the source). Let us take a simple example:

Question: Does the word ‘grapes‘ occur in the sentence ‘The grapes are sour.’

Answer: Yes, as is shown by coloring above. 

The naive approach is to compare if each character from the 1st position matches with the first character in the pattern. The first letter in the sentence start ‘T’ and not match the first letter (‘g’) of the pattern. So we compare ‘g’ with the second letter which is ‘h’ and so on until the first letter matches with one of the words in the sentence. So we go on until we find a match – character by character in the sentence. 

In the Knuth-Morris-Pratt algorithm is based on the naive algorithm. However, in case there is a partial match as shown in the wikepedia link example then it will move that many characters and start, instead of jumping one character to one character, for partial matches.

The BMH (Boyer-Moore-Horspool) is based on the Knuth-Morris-Pratt algorithm. The difference is it starts the comparison from right to left  for the pattern while parsing from left to right in the sentence.

A good introduction video of Boyer-Moore-Horspool by Mike Slade is below:

I hope you saw the video for an introduction of the algorithm. I give a simple example below of the method of pattern searching the BMH agorithm.

index012345678
Sourceforapple
Patternapple
As’a’ in source does not match the last letter in pattern =’apple’ by 5
index012345678
Sourceforapple
Patternapple
After moving, we compare from right to left and we find a match.

Now with that introduction, let us do some implementation and coding.

Implementation

We will write the program in kotlin and the code can be found in github. I have written unit tests to show the implementation in the code. These unit tests can be run in Intellij IDE after configuring or on the command line as

./gradlew test

A simple ‘happy path’ unit test is given :

  @Test
  fun `Happy path`() {
    val source = "Test Tooth String. Molar teeth are growing."
    val pattern = "Tooth"
    val returnValue = horspool.boyerMooreHorspoolSearch(source, pattern)
     assert(returnValue == 5)
  }

In this code, the pattern ‘Tooth‘ occurs after the word ‘Test ‘ in the long string source. So the horspool returns the value 5. Remember that we count from zero(0) in the array. The main class which has the function computing the pattern search is:

class Horspool {

    fun boyerMooreHorspoolSearch(source: String, pattern: String): Int {
        val pattChar = pattern.toCharArray()
        val patternLength: Int = pattChar.size
        if (patternLength == 0) {
            return 0
        }
        val src = source.toCharArray()
        val srcLength: Int = src.size

        var shift = IntArray(200) { patternLength }
        for (k in 0..(patternLength - 2)) {
            shift[pattChar[k].toInt()] = patternLength - 1 - k
        }

        var i = 0
        var j: Int
        while ((i + patternLength) <= srcLength) {
            j = patternLength - 1
            while (source[i + j] == pattChar[j]) {
                j -= 1
                if (j < 0) {
                    return i
                }
            }
            i += shift[src[i + patternLength - 1].toInt()]
        }
        return -1
    }
}

Now we can also search for phrases if we need using the same code as seen below:

@Test
fun `Phrase search`() {
  val source = "The cow jumped over the moon."
  val pattern = "jumped over"
  val returnValue = horspool.boyerMooreHorspoolSearch(source, pattern)
  assert(returnValue == 8)
}

So the value is position 8 in the source string where the phrase ‘jumped over‘ occurs.

All the implementation and call to the function is done in unit tests. You can also find a unit test which finds positions of multiple occurrences of the pattern in the source string.

Conclusion

This algorithm can be used to compare say two documents and state clearly whether they are the same in an efficiently.

The code can be extended as shown in one of the unit test to check if pattern occurs multiple times in the source string.

Of course, more work would be needed to be done if we want to exclude certain words before we compare the two documents. I leave that as an exercise for you.

References

  1. The code for the Horspool Example
  2. The Introduction to Horspool Algorithm by Mike Slade

Leave a comment

Your email address will not be published. Required fields are marked *

7 thoughts on “Boyer-Moore-Horspool Algorithm – What & When Used?”