Byte-Pair Encoding Nuances
Which of the following best describes how Byte-Pair Encoding (BPE) tokenization handles out-of-vocabulary words in a new text sample such as 'unhappiness'?
- A. It recursively splits the word into the largest known subwords in the vocabulary.
- B. It replaces unknown words with a generic u003CUNKu003E token.
- C. It only splits on whitespace and punctuation.
- D. It ignores unseen words entirely and skips them.
- E. It learns character-level embeddings directly for out-of-vocabulary terms.
Comparing Subword Methods
In which scenario would subword-level tokenization offer a clear advantage over pure word-level tokenization?
- A. When processing a text that contains many rare or morphologically rich words such as 'antidisestablishmentarianism'.
- B. When dealing exclusively with common stop-words such as 'and', 'the', and 'is'.
- C. When analyzing texts consisting of only numbers.
- D. When tokenizing sentences written entirely in a logographic script.
- E. When performing basic sentence boundary detection.
Whitespace Tokenization Pitfalls
Why might simple whitespace tokenization fail to accurately segment the phrase 'cannot re-enter the classroom'?
- A. Because it cannot separate contractions or compound words such as 're-enter' into meaningful tokens.
- B. Because it merges all tokens into a single sequence.
- C. Because it removes all punctuation marks from the sequence.
- D. Because it is only applicable to languages without spaces.
- E. Because it only works for numbers, not words.
Language and Script Sensitivity
Which tokenization technique is particularly challenged by scripts that lack explicit word boundaries, such as in some East Asian languages?
- A. Rule-based word tokenization relying on whitespace and punctuation.
- B. Character-level tokenization.
- C. Morphological analysis with stemming.
- D. Frequency-based chunking.
- E. Embedding-based vectorization.
Unigram Language Model Tokenization
Given a corpus, what is the primary optimization goal of Unigram Language Model tokenization when generating its subword vocabulary?
- A. Maximizing the likelihood of the observed data by selecting the most probable set of subword tokens.
- B. Minimizing the number of unique characters in the vocabulary.
- C. Ensuring every possible word appears as an individual token.
- D. Equalizing token frequency distribution across all tokens.
- E. Tokenizing each word into fixed-length chunks regardless of frequency.