Back

Deduplicate Files

Low-Level DesignCodingOnsitePhoneMachine Learning EngineerSoftware EngineerReported May, 2026High Frequency

Basic Problem

Given a root folder/directory, find all duplicate files within it. Return groups of file paths where each group contains files with identical content.

Example

For a directory structure like:

root/

├── a/

│ ├── f1.mp4

│ └── f2.mp4

└── b/

    ├── f3.mp4
    └── tmp/
        └── f4.mp4

Expected output format:

[

    ["a/f1.mp4", "a/f2.mp4"],           # Group 1: identical content
    ["b/f3.mp4", "b/tmp/f4.mp4"]        # Group 2: identical content
]

Requirements

Traverse all files in the directory tree starting from the root folder

Group files by content — files with identical content should be in the same group

Return only duplicates — files that appear alone (no duplicates) should not be included in the output

Handle nested directories — recursively search through subdirectories

Basic Approach

A straightforward solution:

Use os.walk() to traverse the directory tree

Compute a hash (e.g., MD5, SHA-256) of each file's full content

Group files by their hash value

Return groups with more than one file

Interface

def find_duplicate_files(root_path: str) -> List[List[str]]:
    """
    Find all duplicate files in the directory tree.

    Args:
        root_path: Root directory to search

    Returns:
        List of groups, where each group contains paths to duplicate files
    """
    pass

Note for Candidate

You will need to manually create test files in the interview environment (e.g., Google Colab)

Consider edge cases: empty files, symbolic links, file permissions

Think about the time complexity: it depends on the total size of all files

Optimized Approach

The basic approach of hashing every file's complete content can be slow for large files. Consider a multi-stage filtering approach to avoid unnecessary full hashing:

Three-Stage Filtering

Stage 1 - File Size (Metadata):

Group files by size first

Files with different sizes cannot be duplicates

This is very fast (just metadata lookup, no I/O)

Stage 2 - Partial Hash (First Chunk):

For files with the same size, read and hash only the first chunk (e.g., 1024 bytes)

Files with different partial hashes cannot be duplicates

Much faster than reading entire large files

Stage 3 - Full Hash:

Only when size and partial hash match, compute the full content hash

This ensures correctness while minimizing expensive operations

Time Complexity Analysis

Best case: Most files have unique sizes → O(number of files) for metadata only
Worst case: All files have the same size and same initial content → O(total file size) for full hashing

Average case: Significantly better than naive approach due to early filtering

Example Implementation Structure

def find_duplicate_files_optimized(root_path: str) -> List[List[str]]:
    # Stage 1: Group by file size
    size_groups = group_by_file_size(root_path)
    # Result: {size1: [file1, file2, ...], size2: [file3, ...], ...}

    # Stage 2: For each size group, compute partial hash
    partial_hash_groups = {}
    for files in size_groups.values():
        if len(files) > 1:
            group_by_partial_hash(files, partial_hash_groups)
    # Result: {partial_hash1: [file1, file2, ...], partial_hash2: [file3, ...], ...}
    # Note: Stage 3 will handle any rare hash collisions across different sizes

    # Stage 3: For each partial hash group, compute full hash
    full_hash_groups = {}
    for files in partial_hash_groups.values():
        if len(files) > 1:
            group_by_full_hash(files, full_hash_groups)
    # Result: {full_hash1: [file1, file2, ...], full_hash2: [file3, ...], ...}

    # Return only groups with duplicates
    return [files for files in full_hash_groups.values() if len(files) > 1]

Additional Discussion Topics

The following questions are for verbal discussion. The interviewer does not expect you to code solutions, but be prepared to explain your approach and reasoning.

1. I/O Bound vs CPU Bound

Question: Is this duplicate file detection program I/O bound or CPU bound? How would you determine this?

Discussion Points:

Determining the bottleneck:

Use profiling tools to monitor CPU usage and I/O wait time

If CPU is mostly idle while the program runs → I/O bound (waiting for disk reads)

If CPU is consistently busy (high utilization) → CPU bound (hash computation is the bottleneck)

For typical file deduplication:

Usually I/O bound because reading files from disk is slower than computing hashes

Exception: With modern SSDs and very fast I/O, especially with small files, it may become more CPU bound

With SSDs, the boundary between I/O bound and CPU bound becomes less clear

Optimization strategies:

If I/O bound: Use asynchronous I/O, read multiple files in parallel, optimize chunk size

If CPU bound: Use faster hash algorithms (e.g., xxHash instead of SHA-256), parallelize hash computation across CPU cores

2. Hash Function Selection

Question: Which hash algorithm would you choose and why? What are the trade-offs?

Discussion Points:

Common hash algorithms:

MD5: Fast but cryptographically broken; acceptable collision rate for file deduplication

SHA-256: Slower but very low collision probability; cryptographically secure

xxHash / CityHash: Extremely fast non-cryptographic hashes; good for deduplication

Trade-offs:

Speed vs Collision Rate: Faster hashes (MD5, xxHash) have higher collision rates than slower hashes (SHA-256)

Cryptographic vs Non-cryptographic: For file deduplication, cryptographic security is not required

No perfect hash: Every hash function has potential collisions; you can always construct adversarial cases

Recommendation:

For most cases: SHA-256 provides excellent collision resistance with acceptable speed

For very large-scale systems prioritizing speed: xxHash or CityHash with additional verification

After hash matching, consider byte-by-byte comparison for absolute certainty (if critical)

Important Note on Hash Collisions:

Theoretically: Hash collisions exist (pigeonhole principle - infinite inputs map to finite output space)

Practically with SHA-256: Collision probability is ~2^-128, which is astronomically small (never observed in practice)

Trade-off: Byte-by-byte verification after hash match provides 100% certainty but doubles I/O cost

Recommendation: For SHA-256, the collision risk is negligible; verification is overkill for most systems

3. Distributed System Implementation

Question: How would you design this system to work across multiple machines with massive numbers of files?

Discussion Points:

MapReduce Approach:

Map Phase:

Each mapper processes a subset of files

Emit key-value pairs: (file_hash, file_path) or (file_size, file_path)

Shuffle Phase:

Group files with the same hash/size together

Files with similar characteristics are shuffled to the same reducer

Reduce Phase:

Each reducer receives files with the same hash

Perform final verification and output duplicate groups

Partitioning Strategy:

Partition by file size first: Route files of similar size to the same machine

Then perform hash computation and grouping on each machine

This minimizes cross-machine communication

Distributed Hash Table (DHT):

Use a distributed key-value store (e.g., Redis, Cassandra)

Store hash → [file_paths] mapping across multiple nodes

Use consistent hashing for load balancing

Considerations:

Network bandwidth: Avoid sending file content across network; compute hashes locally

Load balancing: Ensure even distribution of work across machines

Fault tolerance: Handle machine failures and retries

4. Continuous Monitoring System

Question: How would you design a system that continuously monitors a directory for duplicate files and notifies users when duplicates are added or removed?

Discussion Points:

System Design:

File System Watcher:

Use OS-level file system events (e.g., inotify on Linux, FSEvents on macOS)

Monitor for file creation, modification, and deletion events

Database Schema:

Table 1: hash_to_files — maps file hash to list of file paths (for duplicate detection)
Table 2: file_to_hash — maps file path to its hash (for handling deletions)

Event Handling:

File Added: Compute hash, check if hash exists in database

If exists → duplicate detected, notify file owner

Add to both tables

File Deleted: Look up hash from file_to_hash, remove from both tables

If this was part of a duplicate group, notify other owners

Notification System:

Queue-based notification system (e.g., RabbitMQ, Kafka)

Send notifications to file owners when duplicates are detected

Scalability:

Use message queues to handle high event rates

Process events asynchronously to avoid blocking

Consider eventual consistency for very large systems

5. Chunk Size Selection

Question: How would you determine the optimal chunk size for partial hashing? What factors should you consider?

Discussion Points:

Factors to consider:

I/O throughput: Larger chunks reduce the number of I/O operations (better for HDDs)

Early differentiation: Larger chunks increase probability of detecting differences

Memory usage: Larger chunks require more memory per file being processed

Disk sector size: Align with disk block size (typically 4KB) for efficient reads

Typical values:

1KB - 4KB: Good for quick initial check with minimal I/O

64KB - 1MB: Better for SSDs where sequential read speed is high

Adaptive: Use larger chunks for large files, smaller chunks for small files

Measurement:

Profile with different chunk sizes on representative data

Measure total runtime and I/O throughput

Consider the distribution of file sizes in your specific use case


Auto-save enabled
Loading editor…
Output
Run your code to see the output here.