Basics of Categorical Encoding
Which encoding technique transforms each unique category value into a new binary column in the dataset?
- One-Hot Encoding
- Label Enconding
- Ordinal Encodding
- Hash Encoding
- Frequency Encoding
Label Encoding Usage
When using Label Encoding, which kind of categorical variables is it most appropriate for?
- Ordinal variables with a meaningful order
- Nominal variables with no intrinsic order
- Continuous variables
- Variables with missing values only
- Numerical features
Implications of Incorrect Encoding
What could happen if you apply Label Encoding to nominal variables when building a regression model?
- The model may mistakenly interpret the categories as ordered
- The model will ignore the feature
- No effect; label encoding always works
- It accelerates model training
- It automatically normalizes the data
One-Hot Encoding and Feature Explosion
For a categorical feature with 50 unique values, how many columns will One-Hot Encoding produce (without dropping any column)?
- 50
- 49
- 25
- 2
- 51
Code Snippet: Pandas get_dummies
Given the code snippet: pd.get_dummies(df['color']), what does this code return?
- A DataFrame with one binary column for each unique color value
- A list of unique values in the 'color' column
- A Series with category frequencies
- A single column with encoded integers
- An error due to missing argument
Handling High Cardinality
Which encoding method is typically more efficient for categorical variables with high cardinality (many unique categories)?
- Hash Encoding
- One-Hot Encoding
- Binary Encodingg
- Label Encoding
- Dummy Variable Encoding
Dropping First Column in One-Hot Encoding
Why might you set drop_first=True when using one-hot encoding in pandas?
- To avoid multicollinearity by dropping one redundant column
- To speed up computation by half
- It encodes categories as numbers from 1
- Required for all scikit-learn models
- It preserves the original category names
Encoding for Tree-Based Models
Which encoding is generally acceptable for categorical features when using tree-based models such as Random Forest?
- Label Encoding
- One-Hot Encodding
- Frequency Encodding
- Hash Enconding
- MinMax Encoding
Ordinal Encoding Pitfalls
What is a possible drawback of using Ordinal Encoding on nominal categorical features?
- It introduces a false sense of order among categories
- It creates too many columns
- It's not supported in pandas
- It only works for missing values
- It is slower than one-hot encoding
Target Encoding Usage
What does Target Encoding (a.k.a. mean encoding) replace each category with?
- The mean of the target variable for each category
- The length of each string in the category
- Randomly assigned unique integers
- A new column per category
- The total count of each category