Tokenize (Python)

Low-Level DesignCodingOnsitePhoneSoftware EngineerReported Apr, 2026High Frequency

Part 1: Code Reading and Understanding

You are given two functions: tokenize and detokenize. Your task is to:

Read and understand what the code is doing

Identify problems with this tokenization approach

def tokenize(text: str, vocab: dict):
    tokens = []
    key = ""
    for i in range(len(text)):
        key += text[i]
        if key in vocab:
            tokens.append(vocab[key])
            key = ""
    return tokens

def detokenize(tokens, vocab: dict):
    text = ""
    reversed_vocab = {value: key for key, value in vocab.items()}
    for token in tokens:
        text += reversed_vocab[token]
    return text

Example usage:

vocab = {
    "a": 1,
    "b": 2,
    "cd": 3
}

tokens = tokenize("acdebe", vocab)

result = detokenize(tokens, vocab)

Your Task

Explain what this code does and how it works

Identify what problems exist with this implementation

Discuss when and why this approach would fail

Key Points to Consider

How does the algorithm decide what to tokenize?

What happens if a character in the text is not covered by any key in vocab?

What happens to the key variable if no match is ever found?

Is this a greedy or optimal approach?

Part 2: Code Review

You are now shown a modified version where another engineer attempted to fix the problem:

def tokenize(text: str, vocab: dict):
    tokens = []
    key = ""
    for i in range(len(text)):
        key += text[i]
        if key in vocab:
            tokens.append(vocab[key])
            key = ""

    # NEW: Handle remaining unmatched characters
    if key:
        tokens.append(vocab.get("UNK", -1))

    return tokens

vocab = {
    "a": 1,
    "b": 2,
    "cd": 3,
    "UNK": -1  # NEW: Added UNK token
}

The changes made:

Added logic after the for loop to handle the case when key is not empty

Added "UNK": -1 to the vocabulary to represent unknown tokens

Your Task

Perform a code review on this change. Provide comments, questions, and concerns about this implementation.

Points to Consider

What is the purpose of UNK?

What is the expected behavior of this change?

Does this fix actually solve the problem? Why or why not?

What edge cases might still fail?

Are there any conceptual or implementation issues?

Example Edge Cases to Think About

What if the text is "xabc" where "x" is not in vocab?

What if the text literally contains the string "UNK"? How would you distinguish between an unknown character being replaced by UNK vs the literal string "UNK" in the original text?

Does the matching strategy always produce optimal results?

Part 3: Implementation

Now implement your own version of the tokenize function that properly handles characters that cannot be covered by the vocabulary.

Requirements

Your implementation should:

Handle unknown characters — gracefully deal with text that contains characters not in vocab

Maintain correctness — ensure both tokenize and detokenize still work properly

Use UNK token — represent unknown/unmatched characters with a special token (e.g., UNK: -1)

Note on round-trip limitations — be aware that if the original text contains the literal string "UNK", it cannot be distinguished from unknown characters after tokenization. This means perfect round-trip preservation is not always possible with this approach.

Key Challenges

When should you decide that a character sequence cannot be matched?

How do you detect when you're building a key that will never match anything in vocab?

Should you check if any vocab key starts with the current key being built?

How do you balance between trying to match longer sequences vs giving up early?

Note for Candidate

Background context: In natural language processing, tokenization converts text into discrete tokens (words, subwords, or characters) that can be mapped to numerical IDs. A vocabulary (vocab) defines the valid tokens.

UNK token: Stands for "UNKNOWN" - a special token used to represent text that doesn't appear in the vocabulary

Greedy matching: The original algorithm uses a greedy first-match strategy - it takes the first valid match it encounters while building up characters one at a time

You may need to check if the current key is a prefix of any vocab key to determine if you should continue building or give up

Example Solution Approach

One approach is to add an else clause that checks whether continuing to build the current key could ever result in a match:

def tokenize(text: str, vocab: dict):
    tokens = []
    key = ""
    for i in range(len(text)):
        key += text[i]
        if key in vocab:
            tokens.append(vocab[key])
            key = ""
        else:
            # Check if any vocab key starts with the current key
            if not any(k.startswith(key) for k in vocab.keys()):
                # No possible match - treat as unknown
                tokens.append(vocab.get("UNK", -1))
                key = ""

    # Handle any remaining unmatched characters
    if key:
        tokens.append(vocab.get("UNK", -1))

    return tokens

This is just one possible approach - discuss trade-offs and alternatives with your interviewer.

Additional Discussion Topics

Be prepared to discuss these topics verbally:

1. Tokenization Strategies

Question: What are different approaches to tokenization? What are the trade-offs?

Discussion Points:

Character-level: Every character is a token (large vocab, long sequences)

Word-level: Words are tokens (very large vocab, OOV problem)

Subword tokenization: BPE, WordPiece, SentencePiece (balance between vocab size and sequence length)

Greedy vs optimal: Greedy matching vs finding optimal segmentation

Byte-level: UTF-8 bytes as tokens (no UNK needed, but longer sequences)

2. Algorithm Complexity

Question: What is the time complexity of your tokenize implementation? Can it be improved?

Discussion Points:

Current approach: O(n × m × k) where n is text length, m is average vocab key length, k is vocab size

The any(k.startswith(key) for k in vocab.keys()) check is expensive
Optimization: Use a Trie (prefix tree) data structure for O(n × m) complexity

Trade-off: Pre-processing time to build Trie vs repeated tokenization calls

3. Ambiguity and Optimal Tokenization

Question: The current greedy approach might not always produce the optimal tokenization. Can you give an example? How would you find the optimal tokenization?

Discussion Points:

Example: vocab = {"ab": 1, "abc": 2, "c": 3}, text = "abc"

Greedy gives: ["ab", "c"] (2 tokens)

Optimal is: ["abc"] (1 token)

Dynamic Programming: Find segmentation with minimum number of tokens

Viterbi algorithm: Find most likely segmentation with language model scores

Trade-off: Speed vs optimality

4. Handling UNK in Practice

Question: In real NLP systems, how are unknown tokens typically handled?

Discussion Points:

Subword tokenization: Break unknown words into known subwords (reduces UNK frequency)

Character fallback: Use character-level tokens for unknown words

Special handling: Different UNK tokens for numbers, proper nouns, etc.

Training implications: UNK tokens hurt model performance, so minimizing them is important

Byte-pair encoding (BPE): Can represent any text without UNK tokens

Auto-save enabled

Loading editor…

Output

Run your code to see the output here.