MANG Problem #31: Regular Expression Matching (Hard)

1. Problem Statement

Mental Model

Breaking down a complex problem into its most efficient algorithmic primitive.

Given an input string s and a pattern p, implement regular expression matching with support for '.' and '*' where:

'.' Matches any single character.
'*' Matches zero or more of the preceding element.

The matching should cover the entire input string (not partial).

Input: s = "aab", p = "c*a*b"
Output: true

2. The Mental Model: The "Non-Deterministic State Machine"

Imagine the pattern p as a set of instructions for a machine.

A character like 'a' means: "Consume one 'a'."
A dot '.' means: "Consume any single character."
A star '*' is the complex one. It means: "Look at the previous character. You can either delete it (zero occurrences) or keep it and stay in this state (multiple occurrences)."

Because of the '*', a single state can lead to multiple future states. This is why we use Dynamic Programming to track all possible "Matches" simultaneously.

3. Visual Execution (The DP Transition)

graph TD
    Table[DP Table: s[i] x p[j]]
    Table --> CharMatch{p[j] == s[i] or '.'?}
    CharMatch -- Yes --> Diagonal[dp[i][j] = dp[i-1][j-1]]
    
    Table --> StarMatch{p[j] == '*'}
    StarMatch -- Case 1 --> ZeroOccur[dp[i][j] = dp[i][j-2]]
    StarMatch -- Case 2 --> MultiOccur[If p[j-1] matches s[i]: dp[i][j] = dp[i-1][j]]

4. Java Implementation (Bottom-Up DP)

public boolean isMatch(String s, String p) {
    int m = s.length();
    int n = p.length();
    
    // 1. Initialize (m+1) x (n+1) table
    // dp[i][j] means s[0..i-1] matches p[0..j-1]
    boolean[][] dp = new boolean[m + 1][n + 1];
    
    // 2. Base Case: Empty string matches empty pattern
    dp[0][0] = true;

    // 3. Handle patterns like a* or a*b* matching empty string
    for (int j = 2; j <= n; j++) {
        if (p.charAt(j - 1) == '*') {
            dp[0][j] = dp[0][j - 2];
        }
    }

    // 4. Fill the DP Table
    for (int i = 1; i <= m; i++) {
        for (int j = 1; j <= n; j++) {
            char sc = s.charAt(i - 1);
            char pc = p.charAt(j - 1);
            
            if (pc == sc || pc == '.') {
                // Diagonal Match
                dp[i][j] = dp[i - 1][j - 1];
            } else if (pc == '*') {
                // Case 1: Match 0 of the preceding character (Skip the star and its prefix)
                dp[i][j] = dp[i][j - 2];
                
                // Case 2: Match 1 or more of the preceding character
                char preChar = p.charAt(j - 2);
                if (preChar == sc || preChar == '.') {
                    dp[i][j] = dp[i][j] || dp[i - 1][j];
                }
            }
        }
    }
    
    return dp[m][n];
}

5. Verbal Interview Script (Staff Tier)

Interviewer: "Walk me through the logic of the '' character in your DP table."*

You: "The Kleene star is the most complex state transition because it allows the pattern to effectively 'branch' into two possibilities. In my implementation, when p[j] is a star, I evaluate both paths. First, the Zero-Occurrence path: I check dp[i][j-2], which effectively asks: 'Does the string match the pattern if we delete the character before the star?' Second, the Multi-Occurrence path: if the character before the star matches the current character in the string, I check dp[i-1][j]. This is the 'Aha!' moment—by looking at the previous row in the same column, we are checking if the string ending one character earlier also matched the same star pattern. This allows the star to greedily consume as many matching characters as needed without creating an exponential recursion tree."

6. Staff-Level Follow-Ups

Follow-up 1: "Can we use Recursion with Memoization?"

The Answer: "Yes. A Top-Down approach with a Boolean[][] memo would be equally efficient in terms of time complexity. It is often easier to write during a 45-minute interview, but it carries the $O(S+P)$ stack overhead. The Bottom-Up Tabulation I implemented is generally more memory-efficient and predictable for production compilers."

Follow-up 2: "How do you handle regex with '?' or '+'?"

The Answer: "A + is just one mandatory character followed by a * (e.g., a+ is aa*). A ? is just a choice between dp[i-1][j-1] (match) or dp[i][j-1] (skip). The DP framework I built is highly extensible to any standard regular expression operator."

7. Performance Nuances (The Java Perspective)

Boolean Table Overhead: A boolean[1000][1000] matrix in Java consumes roughly 1MB of heap space. For extremely long patterns, we can reduce this to $O(P)$ space by using only the current and previous rows, as dp[i][j] only ever references row i or i-1.
String.charAt(): For large loops, I'd cache the string lengths and potentially convert the strings to char[] once to minimize the overhead of multiple charAt() calls and their associated bounds checks.

Key Takeaways

'.' Matches any single character.
'*' Matches zero or more of the preceding element.
A character like 'a' means: "Consume one 'a'."