Lesson 1 of 70 6 min

MANG Problem #8: Alien Dictionary (Hard)

Learn how to reconstruct the character order of an unknown language using Graph TopoSort and adjacency lists. Master dependency management at scale.

Reading Mode

Hide the curriculum rail and keep the lesson centered for focused reading.

1. Problem Statement

Mental Model

Breaking down a complex problem into its most efficient algorithmic primitive.

There is a new alien language that uses the English alphabet. However, the order of the letters is unknown to you. You are given a list of strings words from the alien language dictionary, where the strings in words are sorted lexicographically by the rules of this new language.

Return a string of the unique letters in the new alien language sorted in lexicographical increasing order. If there is no solution, return "".

Input: words = ["wrt","wrf","er","ett","rftt"]
Output: "wertf"

2. The Mental Model: The "Dependency Chain"

This is a Constraint Satisfaction problem. Every time you see two words in a sorted dictionary, like "apple" and "apply", the first mismatch character (e.g., 'e' and 'y') gives you a directed relationship: e -> y ('e' must come before 'y').

In Computer Science, a set of "must come before" rules is a Directed Acyclic Graph (DAG).

  • Nodes: The unique characters in the language.
  • Edges: The relationships we extract from adjacent words.
  • Goal: Find a linear order that satisfies all edges. This is the definition of Topological Sort.

3. Visual Execution (Extracting Edges)

graph LR
    subgraph "Edge Extraction"
        W1[wrt] --- W2[wrf] --> E1[t -> f]
        W2 --- W3[er] --> E2[w -> e]
        W3 --- W4[ett] --> E3[r -> t]
        W4 --- W5[rftt] --> E4[e -> r]
    end
    
    subgraph "Final Graph"
        W --> E --> R --> T --> F
    end

4. Java Implementation (Kahn's BFS)

public String alienOrder(String[] words) {
    // 1. Initialize data structures
    Map<Character, Set<Character>> adj = new HashMap<>();
    Map<Character, Integer> inDegree = new HashMap<>();
    for (String w : words) {
        for (char c : w.toCharArray()) inDegree.put(c, 0);
    }

    // 2. Build the graph by comparing adjacent words
    for (int i = 0; i < words.length - 1; i++) {
        String w1 = words[i], w2 = words[i+1];
        
        // Edge Case: If w2 is a prefix of w1, it is invalid (e.g., "abc", "ab")
        if (w1.length() > w2.length() && w1.startsWith(w2)) return "";
        
        for (int j = 0; j < Math.min(w1.length(), w2.length()); j++) {
            char c1 = w1.charAt(j), c2 = w2.charAt(j);
            if (c1 != c2) {
                // If this is a new edge, add it and update inDegree
                if (!adj.getOrDefault(c1, new HashSet<>()).contains(c2)) {
                    adj.computeIfAbsent(c1, k -> new HashSet<>()).add(c2);
                    inDegree.put(c2, inDegree.get(c2) + 1);
                }
                break; // Only the first mismatch counts!
            }
        }
    }

    // 3. Kahn's Algorithm (BFS)
    StringBuilder sb = new StringBuilder();
    Queue<Character> q = new LinkedList<>();
    for (char c : inDegree.keySet()) {
        if (inDegree.get(c) == 0) q.offer(c);
    }

    while (!q.isEmpty()) {
        char curr = q.poll();
        sb.append(curr);
        if (adj.containsKey(curr)) {
            for (char next : adj.get(curr)) {
                inDegree.put(next, inDegree.get(next) - 1);
                if (inDegree.get(next) == 0) q.offer(next);
            }
        }
    }

    // 4. Cycle Check: If SB length != unique characters, a cycle exists
    return sb.length() == inDegree.size() ? sb.toString() : "";
}

5. Verbal Interview Script (Staff Tier)

Interviewer: "How do you handle invalid inputs or cycles in the dependency graph?"

You: "There are two critical validation steps. First, during graph construction, I check if a word is a longer prefix of the word that comes after it—for example, ['apple', 'app']. Since the dictionary is sorted, this is physically impossible in lexicographical order, so I return an empty string immediately. Second, I use Kahn's Algorithm for the topological sort. A key property of Kahn's is that it only processes nodes with an inDegree of 0. If the graph contains a cycle (e.g., a -> b -> a), the nodes in the cycle will never reach an inDegree of 0. Therefore, if the final length of my result string is less than the number of unique characters, I know a cycle was present and return an empty string."

6. Real-World Application: Package Managers

This exact logic is used in:

  1. Build Systems (Maven/Gradle): Determining the order in which modules should be compiled based on their dependencies.
  2. OS Package Managers (apt/brew): Finding a valid installation sequence for libraries that depend on each other.

7. Performance Nuances (Staff Level)

  1. Space Efficiency: In Java, using a Map<Character, Set<Character>> is flexible but heavy. For a production system limited to English letters, a List<Integer>[] adjacency list with an int[26] in-degree array is much more CPU-cache friendly and reduces garbage collection.
  2. BFS vs DFS: While Kahn's BFS is easier to understand, a DFS-based TopoSort using a visited and onStack state is equally valid and slightly more memory-efficient as it avoids the Queue overhead.

6. Staff-Level Interview Follow-Ups

Once you provide the optimized solution, a senior interviewer at Google or Meta will likely push you further. Here is how to handle the most common follow-ups:

Follow-up 1: "How does this scale to a Distributed System?"

If the input data is too large to fit on a single machine (e.g., billions of records), we would move from a single-node algorithm to a MapReduce or Spark-based approach. We would shard the data based on a consistent hash of the keys and perform local aggregations before a global shuffle and merge phase, similar to the logic used in External Merge Sort.

Follow-up 2: "What are the Concurrency implications?"

In a multi-threaded Java environment, we must ensure that our state (e.g., the DP table or the frequency map) is thread-safe. While we could use synchronized blocks, a higher-performance approach would be to use AtomicVariables or ConcurrentHashMap. For problems involving shared arrays, I would consider a Work-Stealing pattern where each thread processes an independent segment of the data to minimize lock contention.

7. Performance Nuances (The Java Perspective)

  1. Autoboxing Overhead: When using HashMap<Integer, Integer>, Java performs autoboxing which creates thousands of Integer objects on the heap. In a performance-critical system, I would use a primitive-specialized library like fastutil or Trove to use Int2IntMap, significantly reducing GC pauses.
  2. Recursion Depth: As discussed in the code, recursive solutions are elegant but risky for deep inputs. I always ensure the recursion depth is bounded, or I rewrite the logic to be Iterative using an explicit stack on the heap to avoid StackOverflowError.

Key Takeaways

  • Nodes: The unique characters in the language.
  • Edges: The relationships we extract from adjacent words.
  • Goal: Find a linear order that satisfies all edges. This is the definition of Topological Sort.

Want to track your progress?

Sign in to save your progress, track completed lessons, and pick up where you left off.