Understand essential concepts and foundational techniques crucial for anyone starting with natural language processing, including text pre-processing, feature extraction, and classic NLP tasks.
Which process involves breaking down sentences into smaller units such as words or phrases in NLP?
Explanation: Tokenization divides text into words or phrases, which are called tokens, making further analysis possible. Vectorization converts text into numbers, not chunks. Parsing refers to analyzing grammar structure. Clustering is for grouping similar items, not breaking them down.
What is a key difference between stemming and lemmatization in natural language preprocessing?
Explanation: Lemmatization returns the base or dictionary form using context and grammar, while stemming simply chops word endings and may not result in actual words. N-grams relate to sequences of text, not stemming or lemmatization. Stemming is usually faster but less accurate, so option D is incorrect.
Which method represents a document as a collection of word counts, disregarding grammar and word order?
Explanation: Bag-of-Words generates numerical vectors based on word frequency in a document, ignoring syntax and order. Part-of-Speech tagging labels word types, not document structure. Sentence segmentation divides text into sentences, and named entity recognition locates names and entities.
What is the primary goal of named entity recognition in NLP applications?
Explanation: Named entity recognition finds specific entities (such as people or places) in text. Counting word frequencies is feature extraction, not entity identification. Text summarization and converting speech to text are different NLP tasks.
Why is text classification considered an important task in natural language processing?
Explanation: Text classification organizes and labels text into categories, enabling tasks like sentiment analysis or spam detection. Visualization, grammar correction, and sentence splitting are separate processes and do not involve assigning categories.