Explore essential concepts of feature importance in Random Forest and XGBoost models. This quiz evaluates key terms, interpretation, and typical applications, helping you understand how both algorithms determine and utilize feature significance for better model insights.
What does a high feature importance score indicate in a Random Forest model trained to predict house prices?
Explanation: A high feature importance score signifies that the feature plays a significant role in making accurate predictions. Features with missing values may still receive high or low scores depending on their utility. Feature importance is not determined solely by whether a feature is categorical. If a feature was not used in any splits, its importance score would be low or zero.
Which method is commonly used by Random Forests to calculate default feature importance scores?
Explanation: Random Forest models typically use mean decrease in impurity, sometimes called Gini importance, to measure feature importance. This assesses how much each feature’s presence in a decision split reduces uncertainty. Root mean square error and correlation coefficient are general model evaluation metrics, not feature importance metrics in this context. Linear regression coefficients belong to regression models, not ensemble trees.
If the accuracy of a Random Forest model drops significantly after shuffling the values of a variable, what does this suggest about the variable’s importance?
Explanation: A significant accuracy drop after shuffling a variable’s values suggests the feature is important; it strongly influences predictions. Overfitting is not directly indicated by this procedure, nor does it mean the variable is always zero. Removal is not recommended for important features; instead, it indicates value in including that variable.
What effect does the presence of highly correlated features have on feature importance in tree-based models such as Random Forest and XGBoost?
Explanation: When features are highly correlated, tree-based models may split importance between them, reducing each feature's individual score. Their importance scores do not always increase; instead, the contribution is distributed. The model does not completely ignore correlated features nor assign zero importance to all but one, but the importance can be diluted.
In XGBoost, which metric used for feature importance represents the average improvement in accuracy from splits involving a feature?
Explanation: “Gain” measures the average improvement in the model's accuracy due to splits involving a specific feature. “Frequency” or “weight” counts the number of times a feature is used in splits, not the quality of splits. “Permutation” refers to a method used outside the algorithm, not an internal metric.
A variable receives a very low importance ranking in a Random Forest model predicting customer churn. What is a valid interpretation?
Explanation: A low importance score implies the feature does not notably influence the model’s predictions. The variable is not the target variable, as target variables are not used as features. There is no information about data leakage, and low importance alone does not automatically justify deleting a feature—it may still have relevance after further investigation.
Which type of plot is commonly used to display feature importance rankings from Random Forest or XGBoost models?
Explanation: Bar plots are widely used to visualize feature importance, as they clearly show the relative importance of each feature. Pie charts do not effectively represent ranking or relative importance in this context. Scatter plots and heatmaps serve different purposes, such as showing relationships or correlations.
Which issue is commonly associated with default feature importance measures in Random Forest and XGBoost?
Explanation: Default importance measures can be biased toward features with many unique values or categories, as these split the data more frequently. Categorical features are not ignored by default. These measures are not limited to time series data, and, importantly, they do not guarantee unbiased rankings.
How can feature importance scores assist during feature selection for a classification model?
Explanation: Feature importance helps rank the most useful predictors, supporting decisions about which features to keep or remove. Eliminating all numerical features is not typical or effective. Ranking models based on training speed is unrelated to feature importance. Retaining all features without evaluation is the opposite of purposeful feature selection.
Which statement accurately describes a similarity in how feature importance is used in Random Forest and XGBoost models?
Explanation: Feature importance in both algorithms serves to help users understand which features are influential to predictions. Neither model restricts consideration to the first feature only. They can automatically compute importance scores during model training. They do not use linear regression weights, as their importance calculation is based on how features are used in tree splits.