Hash Maps and Sets for Efficient Categorical Encoding u0026 Frequency Counting Quiz

Test your knowledge of using hash maps and sets in preprocessing for categorical encoding, frequency counting, deduplication, and handling missing or unknown categories. This quiz covers foundational concepts and practical scenarios relevant to efficient data preparation using hash-based structures.

  1. Role of Hash Maps in Encoding

    Which of the following best describes how a hash map aids in converting categorical values to numerical codes during data preprocessing?

    1. It averages numerical values within each category.
    2. It sorts categories alphabetically.
    3. It assigns unique integers to each distinct category.
    4. It displays categories as color codes.

    Explanation: A hash map efficiently assigns unique integers to each distinct category, enabling easy numerical encoding. Sorting categories alphabetically does not encode them for modeling. Displaying categories as color codes is for visualization, not encoding. Averaging values within categories relates to aggregation, not encoding categorical data.

  2. Using Sets for Deduplication

    When preparing a dataset, how does using a set help identify unique categories in a column containing repeated values like ['apple', 'banana', 'apple', 'orange']?

    1. By removing duplicates, leaving only unique values.
    2. By counting the number of missing values.
    3. By sorting the categories automatically.
    4. By converting numbers into strings.

    Explanation: A set inherently removes duplicates and keeps only unique values from a collection. Counting missing values requires a different method, such as checking for nulls. Sorting is not guaranteed when using a set. Converting numbers into strings is unrelated to deduplication.

  3. Handling Unknown Categories

    If a new, previously unseen category appears in your test data, how can a hash map with a default value help handle this situation?

    1. By converting categories to uppercase letters.
    2. By deleting all unknown categories from the data.
    3. By appending the unknown category at the end.
    4. By assigning a default code for unknown categories.

    Explanation: A hash map with a default value provides a safe way to encode unknown categories by assigning them a specific code, often called 'other' or 'unknown'. Simply deleting unknown categories could lose important data. Changing case or appending them does not address encoding the unknowns appropriately.

  4. Frequency Counting with Hash Maps

    What is the primary advantage of using a hash map to count the frequency of categorical values in a dataset?

    1. It translates categories into different languages.
    2. It forces all counts to be equal.
    3. It automatically plots histograms for the categories.
    4. It allows fast updates and lookups for each category's count.

    Explanation: Hash maps support fast access and updates for category counts, making them ideal for tallying frequencies. They do not automatically generate plots or histograms; plotting requires additional tools. Forcing all counts to be equal is incorrect, and translating languages is unrelated to frequency counting.

  5. Dealing with Missing Categories

    When encoding, how can missing categorical values (such as nulls or blanks) be handled using hash maps?

    1. By sorting missing values to the top.
    2. By doubling all existing integer codes.
    3. By ignoring the hash map entirely.
    4. By mapping them to a dedicated integer code like -1 or 0.

    Explanation: Assigning missing values a dedicated code ensures they are consistently represented after encoding and not lost or misinterpreted. Ignoring the hash map would mean losing its benefits in encoding. Doubling codes or sorting values does not address the handling of missing categories.

  6. Identifying Duplicate Rows via Sets

    In deduplication, how might a set be applied to a list of tuples representing rows, such as [('A', 1), ('B', 2), ('A', 1)]?

    1. By storing the rows in a random order.
    2. By converting all row values to zero.
    3. By multiplying each number by two.
    4. By retaining a single instance of each unique row.

    Explanation: A set keeps just one instance of each unique row, making it effective for deduplication. Setting all values to zero or multiplying numbers alters the data meaninglessly. While set order is arbitrary, that aspect is not the main benefit for deduplication.

  7. Hash Maps for One-Hot Encoding Support

    How can a hash map be used to facilitate one-hot encoding of categorical variables?

    1. By ignoring rare categories entirely.
    2. By mapping categories to column indices for one-hot vectors.
    3. By listing out repeated categories multiple times.
    4. By multiplying categories by random numbers.

    Explanation: Hash maps assist by linking each category to a unique column index, which is crucial for building the binary one-hot matrix. Multiplying or listing categories does not create a one-hot encoding. Ignoring rare categories would reduce information, not support encoding.

  8. Distinguishing Sets vs Lists for Category Storage

    Why might a set be preferred over a list for storing all unique categories from a column with repeated values?

    1. A set changes data types to integers.
    2. A set removes missing values automatically.
    3. A set sorts items by default, while a list does not.
    4. A set ensures no duplicates, while a list may contain repeats.

    Explanation: A set automatically eliminates duplicates, which is essential for storing unique categories. Sets do not guarantee order or sort items, nor do they remove missing values automatically. Changing data types is not a typical function of sets.

  9. Matching Hash Maps with Lookup Efficiency

    When searching for whether a category exists during encoding, why is querying a hash map typically faster than searching a list?

    1. Because hash maps encrypt the data.
    2. Because hash maps provide constant-time lookup for keys.
    3. Because hash maps convert numbers to words.
    4. Because hash maps automatically correct typos.

    Explanation: Hash maps use hashing to enable constant-time, direct access to entries, making lookups much faster than scanning through a list. They do not encrypt or correct values, nor do they perform data type conversion as stated in the other options.

  10. Accurate Category Counting with Sets

    Suppose your dataset column contains many repeated categories and you want to know how many unique categories there are. Which data structure should you use?

    1. A stack, because it reverses elements.
    2. A set, because it keeps only unique elements.
    3. A queue, because it processes items in arrival order.
    4. A list, because it preserves order.

    Explanation: A set is designed to store only unique elements, making it ideal for counting unique categories. A list preserves order but does not remove duplicates. Queues and stacks are focused on order of processing, not uniqueness.