Rabin-Karp Algorithm

Description

The Rabin-Karp algorithm is a string-searching algorithm that uses hashing to find an exact match of a pattern string in a text. It uses a rolling hash to efficiently calculate hash values for subsequent substrings of the text. The algorithm was created by Richard M. Karp and Michael O. Rabin in 1987.

How It Works

Hash Calculation:
- Calculate hash value of the pattern
- Calculate hash value of first window of text
- Use a rolling hash function for efficiency
Window Sliding:
- Slide window one character at a time
- Update hash value using rolling hash property
- Compare new hash with pattern hash
Hash Matching:
- If hash values match, check characters
- Handle hash collisions with direct comparison
- Continue until all matches are found
Rolling Hash:
- Remove leading digit: hash = hash - val(txt[i-1]) * d^(m-1)
- Shift left: hash = hash * d
- Add trailing digit: hash = hash + val(txt[i+m-1])

History

The Rabin-Karp algorithm was developed by Richard M. Karp and Michael O. Rabin in 1987. It combines two key ideas: rolling hash functions and string matching. The algorithm was a significant improvement over naive string matching for specific use cases, particularly in detecting plagiarism by finding duplicate content in large texts. The use of rolling hash functions made it possible to update hash values efficiently, leading to its widespread adoption in document similarity detection and bioinformatics applications.

Visualization

Text:

Pattern:

Click Start to begin visualization

Implementation


def rabin_karp(text, pattern):
    """
    Find all occurrences of pattern in text using Rabin-Karp algorithm.
    
    Args:
        text: String to search in
        pattern: String to search for
    
    Returns:
        List of starting indices where pattern is found
    """
    if not pattern or not text:
        return []
    
    # Constants for hash calculation
    prime = 101
    base = 256
    
    n, m = len(text), len(pattern)
    if m > n:
        return []
    
    # Calculate pattern hash
    pattern_hash = 0
    window_hash = 0
    base_power = 1
    
    # Calculate initial hash values
    for i in range(m):
        pattern_hash = (pattern_hash * base + ord(pattern[i])) % prime
        window_hash = (window_hash * base + ord(text[i])) % prime
        if i < m - 1:
            base_power = (base_power * base) % prime
    
    matches = []
    
    # Check first window
    if pattern_hash == window_hash and text[:m] == pattern:
        matches.append(0)
    
    # Slide window
    for i in range(n - m):
        # Remove leading digit
        window_hash = window_hash - (ord(text[i]) * base_power) % prime
        if window_hash < 0:
            window_hash += prime
            
        # Add trailing digit
        window_hash = (window_hash * base + ord(text[i + m])) % prime
        
        # Check window
        start = i + 1
        if pattern_hash == window_hash and text[start:start + m] == pattern:
            matches.append(start)
    
    return matches

# Example usage
if __name__ == "__main__":
    text = "AABAACAADAABAAABAA"
    pattern = "AABA"
    
    matches = rabin_karp(text, pattern)
    print(f"Pattern found at positions: {matches}")


#include <iostream>
#include <vector>
#include <string>

class RabinKarp {
private:
    const int prime = 101;
    const int base = 256;
    
public:
    std::vector<int> search(const std::string& text, const std::string& pattern) {
        std::vector<int> matches;
        
        if (pattern.empty() || text.empty())
            return matches;
            
        int n = text.length();
        int m = pattern.length();
        
        if (m > n)
            return matches;
            
        // Calculate pattern hash
        int pattern_hash = 0;
        int window_hash = 0;
        int base_power = 1;
        
        // Calculate initial hash values
        for (int i = 0; i < m; i++) {
            pattern_hash = (pattern_hash * base + pattern[i]) % prime;
            window_hash = (window_hash * base + text[i]) % prime;
            if (i < m - 1)
                base_power = (base_power * base) % prime;
        }
        
        // Check first window
        if (pattern_hash == window_hash && 
            text.substr(0, m) == pattern) {
            matches.push_back(0);
        }
        
        // Slide window
        for (int i = 0; i < n - m; i++) {
            // Remove leading digit
            window_hash = window_hash - 
                         (text[i] * base_power) % prime;
            if (window_hash < 0)
                window_hash += prime;
                
            // Add trailing digit
            window_hash = (window_hash * base + text[i + m]) % prime;
            
            // Check window
            int start = i + 1;
            if (pattern_hash == window_hash && 
                text.substr(start, m) == pattern) {
                matches.push_back(start);
            }
        }
        
        return matches;
    }
};

int main() {
    std::string text = "AABAACAADAABAAABAA";
    std::string pattern = "AABA";
    
    RabinKarp rk;
    auto matches = rk.search(text, pattern);
    
    std::cout << "Pattern found at positions:";
    for (int pos : matches)
        std::cout << " " << pos;
    std::cout << std::endl;
    
    return 0;
}


public class RabinKarp {
    private const int prime = 101;
    private const int base = 256;
    
    public List<int> Search(string text, string pattern) {
        var matches = new List<int>();
        
        if (string.IsNullOrEmpty(pattern) || 
            string.IsNullOrEmpty(text))
            return matches;
            
        int n = text.Length;
        int m = pattern.Length;
        
        if (m > n)
            return matches;
            
        // Calculate pattern hash
        int patternHash = 0;
        int windowHash = 0;
        int basePower = 1;
        
        // Calculate initial hash values
        for (int i = 0; i < m; i++) {
            patternHash = (patternHash * base + pattern[i]) % prime;
            windowHash = (windowHash * base + text[i]) % prime;
            if (i < m - 1)
                basePower = (basePower * base) % prime;
        }
        
        // Check first window
        if (patternHash == windowHash && 
            text.Substring(0, m) == pattern) {
            matches.Add(0);
        }
        
        // Slide window
        for (int i = 0; i < n - m; i++) {
            // Remove leading digit
            windowHash = windowHash - 
                        (text[i] * basePower) % prime;
            if (windowHash < 0)
                windowHash += prime;
                
            // Add trailing digit
            windowHash = (windowHash * base + text[i + m]) % prime;
            
            // Check window
            int start = i + 1;
            if (patternHash == windowHash && 
                text.Substring(start, m) == pattern) {
                matches.Add(start);
            }
        }
        
        return matches;
    }
    
    public static void Main() {
        string text = "AABAACAADAABAAABAA";
        string pattern = "AABA";
        
        var rk = new RabinKarp();
        var matches = rk.Search(text, pattern);
        
        Console.Write("Pattern found at positions:");
        foreach (int pos in matches)
            Console.Write($" {pos}");
        Console.WriteLine();
    }
}

Complexity Analysis

Time Complexity

The time complexity varies:

Average case: O(n + m)
Worst case: O(nm) when many hash collisions
n is text length, m is pattern length
Hash calculation is O(1) with rolling hash

Space Complexity

The space complexity is O(1), which includes:

Storage for hash values
Rolling hash variables
No additional data structures needed

Advantages and Disadvantages

Advantages

Efficient average case performance
Good for multiple pattern search
Rolling hash makes updates O(1)
Suitable for plagiarism detection

Disadvantages

Poor worst-case performance
Hash collisions require extra checks
Complex hash function needed
Not suitable for very short patterns

Quick Info