Test your understanding of essential NLP preprocessing techniques, including Unicode normalization, case-folding, punctuation and whitespace handling, stopword removal, and word-frequency mapping. This quiz is designed to strengthen your knowledge of foundational steps in preparing text data for natural language processing tasks.
In text preprocessing for NLP, what is the main goal of Unicode normalization?
Explanation: Unicode normalization ensures that characters with different byte representations but similar appearance, like accented characters, are standardized for consistency. Counting unique words is related to frequency mapping but not normalization. Adding extra symbols or compressing data are unrelated to this specific process. The correct answer focuses on textual standardization, which aids downstream NLP tasks.
Why do NLP pipelines often use case-folding, such as converting all text to lowercase?
Explanation: Case-folding standardizes text, so identical words like 'Dog' and 'dog' are treated the same, simplifying analysis. Recognizing the meaning of capitalized words is not the main aim of this step. Removing special characters is related to punctuation handling, not case-folding. Translation into another language is unrelated to this process.
When tokenizing text for NLP, why might punctuation be removed?
Explanation: Removing punctuation can reduce noise, helping focus on the meaningful content of words for tasks like frequency analysis. Increasing token count is not always beneficial if it adds noise. Ensuring sentences end with periods or identifying languages are not primary reasons for punctuation removal.
What does whitespace normalization achieve in text preprocessing?
Explanation: Whitespace normalization ensures formatting consistency, like replacing multiple spaces with one and removing stray line breaks, which is important for accurate tokenization. Translating spaces into punctuation or adding random spaces are not typical preprocessing steps. Removing short words is a separate possible filtering process, not whitespace normalization.
Why is stopword removal commonly used when preparing text for NLP tasks?
Explanation: Stopwords are frequent words like 'the' or 'and' which often add little semantic value, so removing them can highlight more informative terms. Discarding non-English words or converting numbers is outside the scope of stopword filtering. Counting sentences is unrelated to removing stopwords.
Which is a key trade-off when building word-frequency maps (such as dictionaries) during preprocessing?
Explanation: Storing word counts in frequency maps offers fast access, but uses memory that grows with vocabulary size. Increasing stopword detection or punctuation accuracy relates to other steps. Translating to synonyms is not part of frequency mapping.
Given the sentence 'Cats, dogs! Fish?', what is a possible tokenization result after removing punctuation?
Explanation: After removing punctuation, each word is separated cleanly, resulting in tokens like 'Cats', 'dogs', and 'Fish'. The second option leaves punctuation attached, which tokenization seeks to avoid. The third contains non-present symbols. The fourth does not split into separate tokens but combines words into one string.
How does Unicode normalization help with diacritics in words such as 'café'?
Explanation: Unicode normalization ensures that words with the same base letters but different encodings for accents (diacritics) are processed in a standardized way. Removing all accented letters would strip meaning from words. Reversing word order and lemmatization are unrelated processes.
What is one possible downside of using case-folding during preprocessing?
Explanation: Converting all text to lowercase removes cues like capitalization, which could be important for identifying names or sentence starts. It does not delete special characters, which is a different process. Token boundaries are not directly affected. Language does not change due to case-folding.
Why might a stopword list need to be customized for specific NLP applications?
Explanation: Depending on the domain, some common words could carry specific meaning and should not be removed as stopwords. Stopword lists differ by language and context. Simply adding stopwords doesn't necessarily speed up processing. Stopword lists generally exclude jargon, except in specialized cases.
How can inconsistent whitespace in a text file affect tokenization?
Explanation: Irregular spaces or line breaks can split words incorrectly or add empty tokens, lowering the quality of tokenization. Vocabulary isn't increased due to whitespace alone. Numbers to words and token type changes are handled by other processes.
Why is a dictionary (hash map) usually preferred over a list for storing word frequencies?
Explanation: Dictionaries (hash maps) give quick access to counts for each word, making frequency updates efficient. They do not order words, remove stopwords, or always use less memory than lists; in fact, they can use more memory but offer much faster operations.
In the sentence 'She and her friend went to the park', which words are most likely to be removed as stopwords?
Explanation: Common English words like 'and', 'her', 'to', and 'the' are typical stopwords because they don't add much unique meaning. 'She', 'friend', and 'park' are more content-specific and usually kept. 'Went' is a verb, not a stopword.
If the word 'apple' appears three times and 'banana' once in a text, what would the frequency map look like?
Explanation: In a frequency map, each word is a key and its count is the value, so 'apple' maps to 3 and 'banana' to 1. The other options misrepresent the keys or values, list only words, reverse counts, or mix types.
What is a limitation of simple whitespace-based tokenization?
Explanation: Whitespace tokenization splits text at spaces and ignores punctuation, so words like 'hello!' will include the exclamation mark, possibly reducing analysis accuracy. It doesn't impact storage efficiency, out-of-vocabulary removal, or entity detection.
After a typical word-level tokenization process, what type of data structure is commonly produced?
Explanation: Tokenization results in a list of individual word strings, making further text analysis straightforward. It does not create one concatenated string, a numeric position array, or a set of token lengths.