1. Problem Statement
Mental Model
Breaking down a complex problem into its most efficient algorithmic primitive.
There is a new alien language that uses the English alphabet. However, the order of the letters is unknown to you. You are given a list of strings words from the alien language dictionary, where the strings in words are sorted lexicographically by the rules of this new language.
Return a string of the unique letters in the new alien language sorted in lexicographical increasing order. If there is no solution, return "".
Input: words = ["wrt","wrf","er","ett","rftt"]
Output: "wertf"
2. The Mental Model: The "Dependency Chain"
This is a Constraint Satisfaction problem. Every time you see two words in a sorted dictionary, like "apple" and "apply", the first mismatch character (e.g., 'e' and 'y') gives you a directed relationship: e -> y ('e' must come before 'y').
In Computer Science, a set of "must come before" rules is a Directed Acyclic Graph (DAG).
- Nodes: The unique characters in the language.
- Edges: The relationships we extract from adjacent words.
- Goal: Find a linear order that satisfies all edges. This is the definition of Topological Sort.
3. Visual Execution (Extracting Edges)
graph LR
subgraph "Edge Extraction"
W1[wrt] --- W2[wrf] --> E1[t -> f]
W2 --- W3[er] --> E2[w -> e]
W3 --- W4[ett] --> E3[r -> t]
W4 --- W5[rftt] --> E4[e -> r]
end
subgraph "Final Graph"
W --> E --> R --> T --> F
end
4. Java Implementation (Kahn's BFS)
public String alienOrder(String[] words) {
// 1. Initialize data structures
Map<Character, Set<Character>> adj = new HashMap<>();
Map<Character, Integer> inDegree = new HashMap<>();
for (String w : words) {
for (char c : w.toCharArray()) inDegree.put(c, 0);
}
// 2. Build the graph by comparing adjacent words
for (int i = 0; i < words.length - 1; i++) {
String w1 = words[i], w2 = words[i+1];
// Edge Case: If w2 is a prefix of w1, it is invalid (e.g., "abc", "ab")
if (w1.length() > w2.length() && w1.startsWith(w2)) return "";
for (int j = 0; j < Math.min(w1.length(), w2.length()); j++) {
char c1 = w1.charAt(j), c2 = w2.charAt(j);
if (c1 != c2) {
// If this is a new edge, add it and update inDegree
if (!adj.getOrDefault(c1, new HashSet<>()).contains(c2)) {
adj.computeIfAbsent(c1, k -> new HashSet<>()).add(c2);
inDegree.put(c2, inDegree.get(c2) + 1);
}
break; // Only the first mismatch counts!
}
}
}
// 3. Kahn's Algorithm (BFS)
StringBuilder sb = new StringBuilder();
Queue<Character> q = new LinkedList<>();
for (char c : inDegree.keySet()) {
if (inDegree.get(c) == 0) q.offer(c);
}
while (!q.isEmpty()) {
char curr = q.poll();
sb.append(curr);
if (adj.containsKey(curr)) {
for (char next : adj.get(curr)) {
inDegree.put(next, inDegree.get(next) - 1);
if (inDegree.get(next) == 0) q.offer(next);
}
}
}
// 4. Cycle Check: If SB length != unique characters, a cycle exists
return sb.length() == inDegree.size() ? sb.toString() : "";
}
5. Verbal Interview Script (Staff Tier)
Interviewer: "How do you handle invalid inputs or cycles in the dependency graph?"
You: "There are two critical validation steps. First, during graph construction, I check if a word is a longer prefix of the word that comes after it—for example, ['apple', 'app']. Since the dictionary is sorted, this is physically impossible in lexicographical order, so I return an empty string immediately. Second, I use Kahn's Algorithm for the topological sort. A key property of Kahn's is that it only processes nodes with an inDegree of 0. If the graph contains a cycle (e.g., a -> b -> a), the nodes in the cycle will never reach an inDegree of 0. Therefore, if the final length of my result string is less than the number of unique characters, I know a cycle was present and return an empty string."
6. Real-World Application: Package Managers
This exact logic is used in:
- Build Systems (Maven/Gradle): Determining the order in which modules should be compiled based on their dependencies.
- OS Package Managers (apt/brew): Finding a valid installation sequence for libraries that depend on each other.
7. Performance Nuances (Staff Level)
- Space Efficiency: In Java, using a
Map<Character, Set<Character>>is flexible but heavy. For a production system limited to English letters, aList<Integer>[]adjacency list with anint[26]in-degree array is much more CPU-cache friendly and reduces garbage collection. - BFS vs DFS: While Kahn's BFS is easier to understand, a DFS-based TopoSort using a
visitedandonStackstate is equally valid and slightly more memory-efficient as it avoids the Queue overhead.
6. Staff-Level Interview Follow-Ups
Once you provide the optimized solution, a senior interviewer at Google or Meta will likely push you further. Here is how to handle the most common follow-ups:
Follow-up 1: "How does this scale to a Distributed System?"
If the input data is too large to fit on a single machine (e.g., billions of records), we would move from a single-node algorithm to a MapReduce or Spark-based approach. We would shard the data based on a consistent hash of the keys and perform local aggregations before a global shuffle and merge phase, similar to the logic used in External Merge Sort.
Follow-up 2: "What are the Concurrency implications?"
In a multi-threaded Java environment, we must ensure that our state (e.g., the DP table or the frequency map) is thread-safe. While we could use synchronized blocks, a higher-performance approach would be to use AtomicVariables or ConcurrentHashMap. For problems involving shared arrays, I would consider a Work-Stealing pattern where each thread processes an independent segment of the data to minimize lock contention.
7. Performance Nuances (The Java Perspective)
- Autoboxing Overhead: When using
HashMap<Integer, Integer>, Java performs autoboxing which creates thousands ofIntegerobjects on the heap. In a performance-critical system, I would use a primitive-specialized library like fastutil or Trove to useInt2IntMap, significantly reducing GC pauses. - Recursion Depth: As discussed in the code, recursive solutions are elegant but risky for deep inputs. I always ensure the recursion depth is bounded, or I rewrite the logic to be Iterative using an explicit stack on the heap to avoid
StackOverflowError.
Key Takeaways
- Nodes: The unique characters in the language.
- Edges: The relationships we extract from adjacent words.
- Goal: Find a linear order that satisfies all edges. This is the definition of Topological Sort.