Test your knowledge of using hash maps and sets in preprocessing for categorical encoding, frequency counting, deduplication, and handling missing or unknown categories. This quiz covers foundational concepts and practical scenarios relevant to efficient data preparation using hash-based structures.
Which of the following best describes how a hash map aids in converting categorical values to numerical codes during data preprocessing?
Explanation: A hash map efficiently assigns unique integers to each distinct category, enabling easy numerical encoding. Sorting categories alphabetically does not encode them for modeling. Displaying categories as color codes is for visualization, not encoding. Averaging values within categories relates to aggregation, not encoding categorical data.
When preparing a dataset, how does using a set help identify unique categories in a column containing repeated values like ['apple', 'banana', 'apple', 'orange']?
Explanation: A set inherently removes duplicates and keeps only unique values from a collection. Counting missing values requires a different method, such as checking for nulls. Sorting is not guaranteed when using a set. Converting numbers into strings is unrelated to deduplication.
If a new, previously unseen category appears in your test data, how can a hash map with a default value help handle this situation?
Explanation: A hash map with a default value provides a safe way to encode unknown categories by assigning them a specific code, often called 'other' or 'unknown'. Simply deleting unknown categories could lose important data. Changing case or appending them does not address encoding the unknowns appropriately.
What is the primary advantage of using a hash map to count the frequency of categorical values in a dataset?
Explanation: Hash maps support fast access and updates for category counts, making them ideal for tallying frequencies. They do not automatically generate plots or histograms; plotting requires additional tools. Forcing all counts to be equal is incorrect, and translating languages is unrelated to frequency counting.
When encoding, how can missing categorical values (such as nulls or blanks) be handled using hash maps?
Explanation: Assigning missing values a dedicated code ensures they are consistently represented after encoding and not lost or misinterpreted. Ignoring the hash map would mean losing its benefits in encoding. Doubling codes or sorting values does not address the handling of missing categories.
In deduplication, how might a set be applied to a list of tuples representing rows, such as [('A', 1), ('B', 2), ('A', 1)]?
Explanation: A set keeps just one instance of each unique row, making it effective for deduplication. Setting all values to zero or multiplying numbers alters the data meaninglessly. While set order is arbitrary, that aspect is not the main benefit for deduplication.
How can a hash map be used to facilitate one-hot encoding of categorical variables?
Explanation: Hash maps assist by linking each category to a unique column index, which is crucial for building the binary one-hot matrix. Multiplying or listing categories does not create a one-hot encoding. Ignoring rare categories would reduce information, not support encoding.
Why might a set be preferred over a list for storing all unique categories from a column with repeated values?
Explanation: A set automatically eliminates duplicates, which is essential for storing unique categories. Sets do not guarantee order or sort items, nor do they remove missing values automatically. Changing data types is not a typical function of sets.
When searching for whether a category exists during encoding, why is querying a hash map typically faster than searching a list?
Explanation: Hash maps use hashing to enable constant-time, direct access to entries, making lookups much faster than scanning through a list. They do not encrypt or correct values, nor do they perform data type conversion as stated in the other options.
Suppose your dataset column contains many repeated categories and you want to know how many unique categories there are. Which data structure should you use?
Explanation: A set is designed to store only unique elements, making it ideal for counting unique categories. A list preserves order but does not remove duplicates. Queues and stacks are focused on order of processing, not uniqueness.