Encoding Categorical Variables Quiz: Boost Your ML Feature Engineering Skills — Questions & Answers

This quiz contains 10 questions. Below is a complete reference of all questions, answer choices, and correct answers. You can use this section to review after taking the interactive quiz above.

  1. Question 1: Definition of Categorical Variable

    What is a categorical variable in the context of machine learning?

    • A variable that represents discrete values with a limited set of possible categories
    • A variable that contains continuous numbers
    • A highly co-related numeric field
    • A special feature that cannot be encoded
    • A variable related to time series only
    Show correct answer

    Correct answer: A variable that represents discrete values with a limited set of possible categories

  2. Question 2: One-Hot Encoding Identification

    In the example where 'Color' has categories 'Red', 'Blue', and 'Green', which encoding method creates three new binary columns for each color?

    • One-Hot Encoding
    • Label Encoding
    • Binary Encoding
    • Frequency Encoding
    • Count Encoding
    Show correct answer

    Correct answer: One-Hot Encoding

  3. Question 3: Label Encoding Limitation

    Which statement best describes a potential drawback of using label encoding with tree-based models?

    • Label encoding may introduce unintended ordinal relationships between categories
    • Label encoding requires the data to be normalized first
    • Label encoding always results in multicollinearity problems
    • Label encoding is not compatible with string variables
    • Label encoding increases the number of columns for each category
    Show correct answer

    Correct answer: Label encoding may introduce unintended ordinal relationships between categories

  4. Question 4: Fit or Transform on Training Data

    When encoding categorical variables before splitting your dataset, should you fit or transform your encoder on the training data or the entire dataset to avoid data leakage?

    • Fit only on the training data, then transform both training and test sets
    • Fit and transform on the entire dataset
    • Transform on the test set first, then fit on train
    • Only fit and transform on the test set
    • Encode each set completely independently
    Show correct answer

    Correct answer: Fit only on the training data, then transform both training and test sets

  5. Question 5: Handling Unseen Categories

    If your test data contains a new category not seen during training, which encoding technique is most vulnerable to this problem?

    • One-Hot Encoding
    • Frequency Encoding
    • Mean Target Encoding
    • Hash Encoding
    • Count Encoding
    Show correct answer

    Correct answer: One-Hot Encoding

  6. Question 6: Encoding High-Cardinality Features

    For a feature like 'Zip Code' with hundreds of unique values, which encoding approach is likely to create an impractically wide dataset?

    • One-Hot Encoding
    • Label Encoding
    • Hash Encoding
    • Mean Target Encoding
    • Frequency Encoding
    Show correct answer

    Correct answer: One-Hot Encoding

  7. Question 7: Pandas get_dummies() Usage

    Given a pandas Series col = pd.Series(['dog', 'cat', 'fish']), which function would you use to perform one-hot encoding?

    • pd.get_dummies(col)
    • pd.factorize(col)
    • pd.LabelEncoder(col)
    • pd.binary(col)
    • pd.category(col)
    Show correct answer

    Correct answer: pd.get_dummies(col)

  8. Question 8: Ordinal Encoding

    Which encoding method is best suited for categorical features with an inherent order, such as 'Low', 'Medium', 'High'?

    • Ordinal Encoding
    • One-Hot Enconding
    • Random Encoding
    • Frequency Enconding
    • Binary Enconding
    Show correct answer

    Correct answer: Ordinal Encoding

  9. Question 9: Target Encoding Caution

    Why must you be careful when using mean target encoding in machine learning?

    • Because it may cause target leakage if encoding uses the whole dataset
    • Because it increases the number of columns drastically
    • Because it only works with numbers
    • Because it can only be used for ordinal data
    • Because it requires categories to be sorted alphabetically
    Show correct answer

    Correct answer: Because it may cause target leakage if encoding uses the whole dataset

  10. Question 10: Sklearn LabelEncoder Limitation

    Which is a limitation of sklearn's LabelEncoder for input features?

    • It can only process one column at a time
    • It replaces missing values automatically
    • It applies one-hot encoding automatically
    • It sorts categories based on frequency
    • It adds new categories in the test set automatically
    Show correct answer

    Correct answer: It can only process one column at a time