Core Concepts of Text Preprocessing u0026 Tokenization in NLP Quiz

Test your understanding of essential NLP preprocessing techniques, including Unicode normalization, case-folding, punctuation and whitespace handling, stopword removal, and word-frequency mapping. This quiz is designed to strengthen your knowledge of foundational steps in preparing text data for natural language processing tasks.

  1. Unicode Normalization Purpose

    In text preprocessing for NLP, what is the main goal of Unicode normalization?

    1. To add extra symbols to every word
    2. To count the number of unique words in a text
    3. To convert visually identical characters into a standard form
    4. To store data in a compressed format

    Explanation: Unicode normalization ensures that characters with different byte representations but similar appearance, like accented characters, are standardized for consistency. Counting unique words is related to frequency mapping but not normalization. Adding extra symbols or compressing data are unrelated to this specific process. The correct answer focuses on textual standardization, which aids downstream NLP tasks.

  2. Case-Folding Functionality

    Why do NLP pipelines often use case-folding, such as converting all text to lowercase?

    1. To identify the meaning of capitalized words
    2. To reduce variability caused by uppercase and lowercase letters
    3. To translate text into another language
    4. To remove all special characters from text

    Explanation: Case-folding standardizes text, so identical words like 'Dog' and 'dog' are treated the same, simplifying analysis. Recognizing the meaning of capitalized words is not the main aim of this step. Removing special characters is related to punctuation handling, not case-folding. Translation into another language is unrelated to this process.

  3. Handling Punctuation in Tokenization

    When tokenizing text for NLP, why might punctuation be removed?

    1. To increase the number of tokens for better results
    2. To ensure every sentence ends with a period
    3. To identify different languages within the same text
    4. To minimize irrelevant information in word analysis

    Explanation: Removing punctuation can reduce noise, helping focus on the meaningful content of words for tasks like frequency analysis. Increasing token count is not always beneficial if it adds noise. Ensuring sentences end with periods or identifying languages are not primary reasons for punctuation removal.

  4. Whitespace Normalization

    What does whitespace normalization achieve in text preprocessing?

    1. It removes all words shorter than four letters
    2. It merges consecutive spaces and tidies up line breaks
    3. It adds random spaces for data augmentation
    4. It translates spaces into punctuation marks

    Explanation: Whitespace normalization ensures formatting consistency, like replacing multiple spaces with one and removing stray line breaks, which is important for accurate tokenization. Translating spaces into punctuation or adding random spaces are not typical preprocessing steps. Removing short words is a separate possible filtering process, not whitespace normalization.

  5. Stopword Filtering Motivation

    Why is stopword removal commonly used when preparing text for NLP tasks?

    1. To convert numbers into their word forms
    2. To remove very common words that contribute little unique meaning
    3. To count sentences rather than words
    4. To discard all non-English words

    Explanation: Stopwords are frequent words like 'the' or 'and' which often add little semantic value, so removing them can highlight more informative terms. Discarding non-English words or converting numbers is outside the scope of stopword filtering. Counting sentences is unrelated to removing stopwords.

  6. Word-Frequency Map Trade-Offs

    Which is a key trade-off when building word-frequency maps (such as dictionaries) during preprocessing?

    1. Translating each word to its synonym
    2. Balancing memory usage with lookup speed
    3. Increasing the number of stopwords detected
    4. Improving punctuation detection accuracy

    Explanation: Storing word counts in frequency maps offers fast access, but uses memory that grows with vocabulary size. Increasing stopword detection or punctuation accuracy relates to other steps. Translating to synonyms is not part of frequency mapping.

  7. Example of Tokenization

    Given the sentence 'Cats, dogs! Fish?', what is a possible tokenization result after removing punctuation?

    1. ['Cats', 'dogs', 'Fish']
    2. ['Cats dogs Fish']
    3. ['Cats,', 'dogs!', 'Fish?']
    4. ['Cats@', 'dogs#', 'Fishu0026']

    Explanation: After removing punctuation, each word is separated cleanly, resulting in tokens like 'Cats', 'dogs', and 'Fish'. The second option leaves punctuation attached, which tokenization seeks to avoid. The third contains non-present symbols. The fourth does not split into separate tokens but combines words into one string.

  8. Normalization and Diacritics

    How does Unicode normalization help with diacritics in words such as 'café'?

    1. It converts words into lemmatized forms
    2. It reverses the word order in a sentence
    3. It ensures words with different accent styles are treated identically
    4. It removes all accented letters from the text

    Explanation: Unicode normalization ensures that words with the same base letters but different encodings for accents (diacritics) are processed in a standardized way. Removing all accented letters would strip meaning from words. Reversing word order and lemmatization are unrelated processes.

  9. Case-Folding Side Effect

    What is one possible downside of using case-folding during preprocessing?

    1. All special characters are automatically deleted
    2. Token boundaries become inconsistent
    3. The language of the text is changed
    4. Important distinctions such as proper nouns may be lost

    Explanation: Converting all text to lowercase removes cues like capitalization, which could be important for identifying names or sentence starts. It does not delete special characters, which is a different process. Token boundaries are not directly affected. Language does not change due to case-folding.

  10. Stopword List Customization

    Why might a stopword list need to be customized for specific NLP applications?

    1. Stopwords always include technical jargon
    2. All stopwords are the same in every language
    3. Adding stopwords increases processing speed
    4. Certain words may be meaningful in some contexts but not others

    Explanation: Depending on the domain, some common words could carry specific meaning and should not be removed as stopwords. Stopword lists differ by language and context. Simply adding stopwords doesn't necessarily speed up processing. Stopword lists generally exclude jargon, except in specialized cases.

  11. Whitespace Impact on Tokenization

    How can inconsistent whitespace in a text file affect tokenization?

    1. It converts numbers to words automatically
    2. It ensures all tokens are proper nouns
    3. It adds new vocabulary to the frequency map
    4. It can cause incorrect splitting of words or extra empty tokens

    Explanation: Irregular spaces or line breaks can split words incorrectly or add empty tokens, lowering the quality of tokenization. Vocabulary isn't increased due to whitespace alone. Numbers to words and token type changes are handled by other processes.

  12. Dictionary vs. List for Word Frequencies

    Why is a dictionary (hash map) usually preferred over a list for storing word frequencies?

    1. Because it orders words alphabetically by default
    2. Because it consumes less memory for large vocabularies
    3. Because it provides faster lookup and update times
    4. Because it automatically removes stopwords

    Explanation: Dictionaries (hash maps) give quick access to counts for each word, making frequency updates efficient. They do not order words, remove stopwords, or always use less memory than lists; in fact, they can use more memory but offer much faster operations.

  13. Basic Stopword Example

    In the sentence 'She and her friend went to the park', which words are most likely to be removed as stopwords?

    1. 'and', 'friend', 'her'
    2. 'went', 'friend', 'the'
    3. 'She', 'friend', 'park'
    4. 'and', 'her', 'to', 'the'

    Explanation: Common English words like 'and', 'her', 'to', and 'the' are typical stopwords because they don't add much unique meaning. 'She', 'friend', and 'park' are more content-specific and usually kept. 'Went' is a verb, not a stopword.

  14. Frequency Map Example

    If the word 'apple' appears three times and 'banana' once in a text, what would the frequency map look like?

    1. {'apple': 'banana', 3: 1}
    2. {'apple': 1, 'banana': 3}
    3. {'apple': 3, 'banana': 1}
    4. {'apple', 'banana'}

    Explanation: In a frequency map, each word is a key and its count is the value, so 'apple' maps to 3 and 'banana' to 1. The other options misrepresent the keys or values, list only words, reverse counts, or mix types.

  15. Splitting on Whitespace Only

    What is a limitation of simple whitespace-based tokenization?

    1. It removes out-of-vocabulary words automatically
    2. It increases storage efficiency
    3. It does not handle punctuation adjoining words, such as 'hello!'
    4. It accurately detects named entities

    Explanation: Whitespace tokenization splits text at spaces and ignores punctuation, so words like 'hello!' will include the exclamation mark, possibly reducing analysis accuracy. It doesn't impact storage efficiency, out-of-vocabulary removal, or entity detection.

  16. Tokenization Output Type

    After a typical word-level tokenization process, what type of data structure is commonly produced?

    1. A set of unique token lengths
    2. A single large string containing all tokens
    3. A numerical array representing token positions
    4. A list of strings, where each string is a token

    Explanation: Tokenization results in a list of individual word strings, making further text analysis straightforward. It does not create one concatenated string, a numeric position array, or a set of token lengths.