Mastering Tokenization Techniques: Advanced Quiz — Questions & Answers

This quiz contains 5 questions. Below is a complete reference of all questions, answer choices, and correct answers. You can use this section to review after taking the interactive quiz above.

  1. Question 1: Byte-Pair Encoding Nuances

    Which of the following best describes how Byte-Pair Encoding (BPE) tokenization handles out-of-vocabulary words in a new text sample such as 'unhappiness'?

    • A. It recursively splits the word into the largest known subwords in the vocabulary.
    • B. It replaces unknown words with a generic <UNK> token.
    • C. It only splits on whitespace and punctuation.
    • D. It ignores unseen words entirely and skips them.
    • E. It learns character-level embeddings directly for out-of-vocabulary terms.
    Show correct answer

    Correct answer: A. It recursively splits the word into the largest known subwords in the vocabulary.

  2. Question 2: Comparing Subword Methods

    In which scenario would subword-level tokenization offer a clear advantage over pure word-level tokenization?

    • A. When processing a text that contains many rare or morphologically rich words such as 'antidisestablishmentarianism'.
    • B. When dealing exclusively with common stop-words such as 'and', 'the', and 'is'.
    • C. When analyzing texts consisting of only numbers.
    • D. When tokenizing sentences written entirely in a logographic script.
    • E. When performing basic sentence boundary detection.
    Show correct answer

    Correct answer: A. When processing a text that contains many rare or morphologically rich words such as 'antidisestablishmentarianism'.

  3. Question 3: Whitespace Tokenization Pitfalls

    Why might simple whitespace tokenization fail to accurately segment the phrase 'cannot re-enter the classroom'?

    • A. Because it cannot separate contractions or compound words such as 're-enter' into meaningful tokens.
    • B. Because it merges all tokens into a single sequence.
    • C. Because it removes all punctuation marks from the sequence.
    • D. Because it is only applicable to languages without spaces.
    • E. Because it only works for numbers, not words.
    Show correct answer

    Correct answer: A. Because it cannot separate contractions or compound words such as 're-enter' into meaningful tokens.

  4. Question 4: Language and Script Sensitivity

    Which tokenization technique is particularly challenged by scripts that lack explicit word boundaries, such as in some East Asian languages?

    • A. Rule-based word tokenization relying on whitespace and punctuation.
    • B. Character-level tokenization.
    • C. Morphological analysis with stemming.
    • D. Frequency-based chunking.
    • E. Embedding-based vectorization.
    Show correct answer

    Correct answer: A. Rule-based word tokenization relying on whitespace and punctuation.

  5. Question 5: Unigram Language Model Tokenization

    Given a corpus, what is the primary optimization goal of Unigram Language Model tokenization when generating its subword vocabulary?

    • A. Maximizing the likelihood of the observed data by selecting the most probable set of subword tokens.
    • B. Minimizing the number of unique characters in the vocabulary.
    • C. Ensuring every possible word appears as an individual token.
    • D. Equalizing token frequency distribution across all tokens.
    • E. Tokenizing each word into fixed-length chunks regardless of frequency.
    Show correct answer

    Correct answer: A. Maximizing the likelihood of the observed data by selecting the most probable set of subword tokens.