Definition of Categorical Variable
What is a categorical variable in the context of machine learning?
- A variable that represents discrete values with a limited set of possible categories
- A variable that contains continuous numbers
- A highly co-related numeric field
- A special feature that cannot be encoded
- A variable related to time series only
One-Hot Encoding Identification
In the example where 'Color' has categories 'Red', 'Blue', and 'Green', which encoding method creates three new binary columns for each color?
- One-Hot Encoding
- Label Encoding
- Binary Encoding
- Frequency Encoding
- Count Encoding
Label Encoding Limitation
Which statement best describes a potential drawback of using label encoding with tree-based models?
- Label encoding may introduce unintended ordinal relationships between categories
- Label encoding requires the data to be normalized first
- Label encoding always results in multicollinearity problems
- Label encoding is not compatible with string variables
- Label encoding increases the number of columns for each category
Fit or Transform on Training Data
When encoding categorical variables before splitting your dataset, should you fit or transform your encoder on the training data or the entire dataset to avoid data leakage?
- Fit only on the training data, then transform both training and test sets
- Fit and transform on the entire dataset
- Transform on the test set first, then fit on train
- Only fit and transform on the test set
- Encode each set completely independently
Handling Unseen Categories
If your test data contains a new category not seen during training, which encoding technique is most vulnerable to this problem?
- One-Hot Encoding
- Frequency Encoding
- Mean Target Encoding
- Hash Encoding
- Count Encoding
Encoding High-Cardinality Features
For a feature like 'Zip Code' with hundreds of unique values, which encoding approach is likely to create an impractically wide dataset?
- One-Hot Encoding
- Label Encoding
- Hash Encoding
- Mean Target Encoding
- Frequency Encoding
Pandas get_dummies() Usage
Given a pandas Series col = pd.Series(['dog', 'cat', 'fish']), which function would you use to perform one-hot encoding?
- pd.get_dummies(col)
- pd.factorize(col)
- pd.LabelEncoder(col)
- pd.binary(col)
- pd.category(col)
Ordinal Encoding
Which encoding method is best suited for categorical features with an inherent order, such as 'Low', 'Medium', 'High'?
- Ordinal Encoding
- One-Hot Enconding
- Random Encoding
- Frequency Enconding
- Binary Enconding
Target Encoding Caution
Why must you be careful when using mean target encoding in machine learning?
- Because it may cause target leakage if encoding uses the whole dataset
- Because it increases the number of columns drastically
- Because it only works with numbers
- Because it can only be used for ordinal data
- Because it requires categories to be sorted alphabetically
Sklearn LabelEncoder Limitation
Which is a limitation of sklearn's LabelEncoder for input features?
- It can only process one column at a time
- It replaces missing values automatically
- It applies one-hot encoding automatically
- It sorts categories based on frequency
- It adds new categories in the test set automatically