Encoding Categorical Variables Quiz: Boost Your ML Feature Engineering Skills Quiz

  1. Definition of Categorical Variable

    What is a categorical variable in the context of machine learning?

    1. A variable that represents discrete values with a limited set of possible categories
    2. A variable that contains continuous numbers
    3. A highly co-related numeric field
    4. A special feature that cannot be encoded
    5. A variable related to time series only
  2. One-Hot Encoding Identification

    In the example where 'Color' has categories 'Red', 'Blue', and 'Green', which encoding method creates three new binary columns for each color?

    1. One-Hot Encoding
    2. Label Encoding
    3. Binary Encoding
    4. Frequency Encoding
    5. Count Encoding
  3. Label Encoding Limitation

    Which statement best describes a potential drawback of using label encoding with tree-based models?

    1. Label encoding may introduce unintended ordinal relationships between categories
    2. Label encoding requires the data to be normalized first
    3. Label encoding always results in multicollinearity problems
    4. Label encoding is not compatible with string variables
    5. Label encoding increases the number of columns for each category
  4. Fit or Transform on Training Data

    When encoding categorical variables before splitting your dataset, should you fit or transform your encoder on the training data or the entire dataset to avoid data leakage?

    1. Fit only on the training data, then transform both training and test sets
    2. Fit and transform on the entire dataset
    3. Transform on the test set first, then fit on train
    4. Only fit and transform on the test set
    5. Encode each set completely independently
  5. Handling Unseen Categories

    If your test data contains a new category not seen during training, which encoding technique is most vulnerable to this problem?

    1. One-Hot Encoding
    2. Frequency Encoding
    3. Mean Target Encoding
    4. Hash Encoding
    5. Count Encoding
  6. Encoding High-Cardinality Features

    For a feature like 'Zip Code' with hundreds of unique values, which encoding approach is likely to create an impractically wide dataset?

    1. One-Hot Encoding
    2. Label Encoding
    3. Hash Encoding
    4. Mean Target Encoding
    5. Frequency Encoding
  7. Pandas get_dummies() Usage

    Given a pandas Series col = pd.Series(['dog', 'cat', 'fish']), which function would you use to perform one-hot encoding?

    1. pd.get_dummies(col)
    2. pd.factorize(col)
    3. pd.LabelEncoder(col)
    4. pd.binary(col)
    5. pd.category(col)
  8. Ordinal Encoding

    Which encoding method is best suited for categorical features with an inherent order, such as 'Low', 'Medium', 'High'?

    1. Ordinal Encoding
    2. One-Hot Enconding
    3. Random Encoding
    4. Frequency Enconding
    5. Binary Enconding
  9. Target Encoding Caution

    Why must you be careful when using mean target encoding in machine learning?

    1. Because it may cause target leakage if encoding uses the whole dataset
    2. Because it increases the number of columns drastically
    3. Because it only works with numbers
    4. Because it can only be used for ordinal data
    5. Because it requires categories to be sorted alphabetically
  10. Sklearn LabelEncoder Limitation

    Which is a limitation of sklearn's LabelEncoder for input features?

    1. It can only process one column at a time
    2. It replaces missing values automatically
    3. It applies one-hot encoding automatically
    4. It sorts categories based on frequency
    5. It adds new categories in the test set automatically