MANG Problem #8: Alien Dictionary (Hard)

1. Problem Statement

Mental Model

Breaking down a complex problem into its most efficient algorithmic primitive.

There is a new alien language that uses the English alphabet. However, the order of the letters is unknown to you. You are given a list of strings words from the alien language dictionary, where the strings in words are sorted lexicographically by the rules of this new language.

Return a string of the unique letters in the new alien language sorted in lexicographical increasing order. If there is no solution, return "".

Input: words = ["wrt","wrf","er","ett","rftt"]
Output: "wertf"

2. The Mental Model: The "Dependency Chain"

This is a Constraint Satisfaction problem. Every time you see two words in a sorted dictionary, like "apple" and "apply", the first mismatch character (e.g., 'e' and 'y') gives you a directed relationship: e -> y ('e' must come before 'y').

In Computer Science, a set of "must come before" rules is a Directed Acyclic Graph (DAG).

Nodes: The unique characters in the language.
Edges: The relationships we extract from adjacent words.
Goal: Find a linear order that satisfies all edges. This is the definition of Topological Sort.

3. Visual Execution (Extracting Edges)

graph LR
    subgraph "Edge Extraction"
        W1[wrt] --- W2[wrf] --> E1[t -> f]
        W2 --- W3[er] --> E2[w -> e]
        W3 --- W4[ett] --> E3[r -> t]
        W4 --- W5[rftt] --> E4[e -> r]
    end
    
    subgraph "Final Graph"
        W --> E --> R --> T --> F
    end

4. Java Implementation (Kahn's BFS)

public String alienOrder(String[] words) {
    // 1. Initialize data structures
    Map<Character, Set<Character>> adj = new HashMap<>();
    Map<Character, Integer> inDegree = new HashMap<>();
    for (String w : words) {
        for (char c : w.toCharArray()) inDegree.put(c, 0);
    }

    // 2. Build the graph by comparing adjacent words
    for (int i = 0; i < words.length - 1; i++) {
        String w1 = words[i], w2 = words[i+1];
        
        // Edge Case: If w2 is a prefix of w1, it is invalid (e.g., "abc", "ab")
        if (w1.length() > w2.length() && w1.startsWith(w2)) return "";
        
        for (int j = 0; j < Math.min(w1.length(), w2.length()); j++) {
            char c1 = w1.charAt(j), c2 = w2.charAt(j);
            if (c1 != c2) {
                // If this is a new edge, add it and update inDegree
                if (!adj.getOrDefault(c1, new HashSet<>()).contains(c2)) {
                    adj.computeIfAbsent(c1, k -> new HashSet<>()).add(c2);
                    inDegree.put(c2, inDegree.get(c2) + 1);
                }
                break; // Only the first mismatch counts!
            }
        }
    }

    // 3. Kahn's Algorithm (BFS)
    StringBuilder sb = new StringBuilder();
    Queue<Character> q = new LinkedList<>();
    for (char c : inDegree.keySet()) {
        if (inDegree.get(c) == 0) q.offer(c);
    }

    while (!q.isEmpty()) {
        char curr = q.poll();
        sb.append(curr);
        if (adj.containsKey(curr)) {
            for (char next : adj.get(curr)) {
                inDegree.put(next, inDegree.get(next) - 1);
                if (inDegree.get(next) == 0) q.offer(next);
            }
        }
    }

    // 4. Cycle Check: If SB length != unique characters, a cycle exists
    return sb.length() == inDegree.size() ? sb.toString() : "";
}

5. Verbal Interview Script (Staff Tier)

Interviewer: "How do you handle invalid inputs or cycles in the dependency graph?"

You: "There are two critical validation steps. First, during graph construction, I check if a word is a longer prefix of the word that comes after it—for example, ['apple', 'app']. Since the dictionary is sorted, this is physically impossible in lexicographical order, so I return an empty string immediately. Second, I use Kahn's Algorithm for the topological sort. A key property of Kahn's is that it only processes nodes with an inDegree of 0. If the graph contains a cycle (e.g., a -> b -> a), the nodes in the cycle will never reach an inDegree of 0. Therefore, if the final length of my result string is less than the number of unique characters, I know a cycle was present and return an empty string."

6. Real-World Application: Package Managers

This exact logic is used in:

Build Systems (Maven/Gradle): Determining the order in which modules should be compiled based on their dependencies.
OS Package Managers (apt/brew): Finding a valid installation sequence for libraries that depend on each other.

7. Performance Nuances (Staff Level)

Space Efficiency: In Java, using a Map<Character, Set<Character>> is flexible but heavy. For a production system limited to English letters, a List<Integer>[] adjacency list with an int[26] in-degree array is much more CPU-cache friendly and reduces garbage collection.
BFS vs DFS: While Kahn's BFS is easier to understand, a DFS-based TopoSort using a visited and onStack state is equally valid and slightly more memory-efficient as it avoids the Queue overhead.

6. Staff-Level Interview Follow-Ups

Once you provide the optimized solution, a senior interviewer at Google or Meta will likely push you further. Here is how to handle the most common follow-ups:

Follow-up 1: "How does this scale to a Distributed System?"

If the input data is too large to fit on a single machine (e.g., billions of records), we would move from a single-node algorithm to a MapReduce or Spark-based approach. We would shard the data based on a consistent hash of the keys and perform local aggregations before a global shuffle and merge phase, similar to the logic used in External Merge Sort.

Follow-up 2: "What are the Concurrency implications?"

In a multi-threaded Java environment, we must ensure that our state (e.g., the DP table or the frequency map) is thread-safe. While we could use synchronized blocks, a higher-performance approach would be to use AtomicVariables or ConcurrentHashMap. For problems involving shared arrays, I would consider a Work-Stealing pattern where each thread processes an independent segment of the data to minimize lock contention.

7. Performance Nuances (The Java Perspective)

Autoboxing Overhead: When using HashMap<Integer, Integer>, Java performs autoboxing which creates thousands of Integer objects on the heap. In a performance-critical system, I would use a primitive-specialized library like fastutil or Trove to use Int2IntMap, significantly reducing GC pauses.
Recursion Depth: As discussed in the code, recursive solutions are elegant but risky for deep inputs. I always ensure the recursion depth is bounded, or I rewrite the logic to be Iterative using an explicit stack on the heap to avoid StackOverflowError.

Key Takeaways

Nodes: The unique characters in the language.
Edges: The relationships we extract from adjacent words.
Goal: Find a linear order that satisfies all edges. This is the definition of Topological Sort.

A curated practice bank for turning patterns into repetition.