Given a root folder/directory, find all duplicate files within it. Return groups of file paths where each group contains files with identical content.
For a directory structure like:
root/
├── a/
│ ├── f1.mp4
│ └── f2.mp4
└── b/
├── f3.mp4
└── tmp/
└── f4.mp4
Expected output format:
[
["a/f1.mp4", "a/f2.mp4"], # Group 1: identical content
["b/f3.mp4", "b/tmp/f4.mp4"] # Group 2: identical content
]
Traverse all files in the directory tree starting from the root folder
Group files by content — files with identical content should be in the same group
Return only duplicates — files that appear alone (no duplicates) should not be included in the output
Handle nested directories — recursively search through subdirectories
A straightforward solution:
Use os.walk() to traverse the directory tree
Compute a hash (e.g., MD5, SHA-256) of each file's full content
Group files by their hash value
Return groups with more than one file
def find_duplicate_files(root_path: str) -> List[List[str]]:
"""
Find all duplicate files in the directory tree.
Args:
root_path: Root directory to search
Returns:
List of groups, where each group contains paths to duplicate files
"""
pass
You will need to manually create test files in the interview environment (e.g., Google Colab)
Consider edge cases: empty files, symbolic links, file permissions
Think about the time complexity: it depends on the total size of all files
The basic approach of hashing every file's complete content can be slow for large files. Consider a multi-stage filtering approach to avoid unnecessary full hashing:
Stage 1 - File Size (Metadata):
Group files by size first
Files with different sizes cannot be duplicates
This is very fast (just metadata lookup, no I/O)
Stage 2 - Partial Hash (First Chunk):
For files with the same size, read and hash only the first chunk (e.g., 1024 bytes)
Files with different partial hashes cannot be duplicates
Much faster than reading entire large files
Stage 3 - Full Hash:
Only when size and partial hash match, compute the full content hash
This ensures correctness while minimizing expensive operations
Best case: Most files have unique sizes → O(number of files) for metadata only
Worst case: All files have the same size and same initial content → O(total file size) for full hashing
Average case: Significantly better than naive approach due to early filtering
def find_duplicate_files_optimized(root_path: str) -> List[List[str]]:
# Stage 1: Group by file size
size_groups = group_by_file_size(root_path)
# Result: {size1: [file1, file2, ...], size2: [file3, ...], ...}
# Stage 2: For each size group, compute partial hash
partial_hash_groups = {}
for files in size_groups.values():
if len(files) > 1:
group_by_partial_hash(files, partial_hash_groups)
# Result: {partial_hash1: [file1, file2, ...], partial_hash2: [file3, ...], ...}
# Note: Stage 3 will handle any rare hash collisions across different sizes
# Stage 3: For each partial hash group, compute full hash
full_hash_groups = {}
for files in partial_hash_groups.values():
if len(files) > 1:
group_by_full_hash(files, full_hash_groups)
# Result: {full_hash1: [file1, file2, ...], full_hash2: [file3, ...], ...}
# Return only groups with duplicates
return [files for files in full_hash_groups.values() if len(files) > 1]
The following questions are for verbal discussion. The interviewer does not expect you to code solutions, but be prepared to explain your approach and reasoning.
Question: Is this duplicate file detection program I/O bound or CPU bound? How would you determine this?
Discussion Points:
Determining the bottleneck:
Use profiling tools to monitor CPU usage and I/O wait time
If CPU is mostly idle while the program runs → I/O bound (waiting for disk reads)
If CPU is consistently busy (high utilization) → CPU bound (hash computation is the bottleneck)
For typical file deduplication:
Usually I/O bound because reading files from disk is slower than computing hashes
Exception: With modern SSDs and very fast I/O, especially with small files, it may become more CPU bound
With SSDs, the boundary between I/O bound and CPU bound becomes less clear
Optimization strategies:
If I/O bound: Use asynchronous I/O, read multiple files in parallel, optimize chunk size
If CPU bound: Use faster hash algorithms (e.g., xxHash instead of SHA-256), parallelize hash computation across CPU cores
Question: Which hash algorithm would you choose and why? What are the trade-offs?
Discussion Points:
Common hash algorithms:
MD5: Fast but cryptographically broken; acceptable collision rate for file deduplication
SHA-256: Slower but very low collision probability; cryptographically secure
xxHash / CityHash: Extremely fast non-cryptographic hashes; good for deduplication
Trade-offs:
Speed vs Collision Rate: Faster hashes (MD5, xxHash) have higher collision rates than slower hashes (SHA-256)
Cryptographic vs Non-cryptographic: For file deduplication, cryptographic security is not required
No perfect hash: Every hash function has potential collisions; you can always construct adversarial cases
Recommendation:
For most cases: SHA-256 provides excellent collision resistance with acceptable speed
For very large-scale systems prioritizing speed: xxHash or CityHash with additional verification
After hash matching, consider byte-by-byte comparison for absolute certainty (if critical)
Important Note on Hash Collisions:
Theoretically: Hash collisions exist (pigeonhole principle - infinite inputs map to finite output space)
Practically with SHA-256: Collision probability is ~2^-128, which is astronomically small (never observed in practice)
Trade-off: Byte-by-byte verification after hash match provides 100% certainty but doubles I/O cost
Recommendation: For SHA-256, the collision risk is negligible; verification is overkill for most systems
Question: How would you design this system to work across multiple machines with massive numbers of files?
Discussion Points:
MapReduce Approach:
Map Phase:
Each mapper processes a subset of files
Emit key-value pairs: (file_hash, file_path) or (file_size, file_path)
Shuffle Phase:
Group files with the same hash/size together
Files with similar characteristics are shuffled to the same reducer
Reduce Phase:
Each reducer receives files with the same hash
Perform final verification and output duplicate groups
Partitioning Strategy:
Partition by file size first: Route files of similar size to the same machine
Then perform hash computation and grouping on each machine
This minimizes cross-machine communication
Distributed Hash Table (DHT):
Use a distributed key-value store (e.g., Redis, Cassandra)
Store hash → [file_paths] mapping across multiple nodes
Use consistent hashing for load balancing
Considerations:
Network bandwidth: Avoid sending file content across network; compute hashes locally
Load balancing: Ensure even distribution of work across machines
Fault tolerance: Handle machine failures and retries
Question: How would you design a system that continuously monitors a directory for duplicate files and notifies users when duplicates are added or removed?
Discussion Points:
System Design:
File System Watcher:
Use OS-level file system events (e.g., inotify on Linux, FSEvents on macOS)
Monitor for file creation, modification, and deletion events
Database Schema:
Table 1: hash_to_files — maps file hash to list of file paths (for duplicate detection)
Table 2: file_to_hash — maps file path to its hash (for handling deletions)
Event Handling:
File Added: Compute hash, check if hash exists in database
If exists → duplicate detected, notify file owner
Add to both tables
File Deleted: Look up hash from file_to_hash, remove from both tables
If this was part of a duplicate group, notify other owners
Notification System:
Queue-based notification system (e.g., RabbitMQ, Kafka)
Send notifications to file owners when duplicates are detected
Scalability:
Use message queues to handle high event rates
Process events asynchronously to avoid blocking
Consider eventual consistency for very large systems
Question: How would you determine the optimal chunk size for partial hashing? What factors should you consider?
Discussion Points:
Factors to consider:
I/O throughput: Larger chunks reduce the number of I/O operations (better for HDDs)
Early differentiation: Larger chunks increase probability of detecting differences
Memory usage: Larger chunks require more memory per file being processed
Disk sector size: Align with disk block size (typically 4KB) for efficient reads
Typical values:
1KB - 4KB: Good for quick initial check with minimal I/O
64KB - 1MB: Better for SSDs where sequential read speed is high
Adaptive: Use larger chunks for large files, smaller chunks for small files
Measurement:
Profile with different chunk sizes on representative data
Measure total runtime and I/O throughput
Consider the distribution of file sizes in your specific use case