Explore the fundamentals of decision trees and random forest algorithms with this insightful quiz, designed to assess your understanding of splitting criteria, overfitting, feature importance, and ensemble strategies. Strengthen your grasp on key machine learning techniques used for classification and regression tasks.
Which metric is commonly used by decision trees to determine the best attribute for splitting at each node when performing classification?
Explanation: Gini impurity is frequently used in decision trees for classification to measure how often a randomly chosen element would be incorrectly labeled. Variance is typically used for regression tasks, not classification. R-squared score evaluates model performance but does not guide splits during tree construction. Euclidean distance is a metric for measuring distance in clustering or nearest neighbor algorithms, not decision tree splitting.
What is an effective method random forests use to reduce overfitting, compared to a single decision tree?
Explanation: Random forests reduce overfitting by aggregating the predictions of many trees, averaging their results to provide more robust outcomes and less variance. Increasing the tree depth may actually lead to overfitting rather than prevent it. Using linear regression at each node is not typical in decision trees; instead, splits are made based on criteria like Gini impurity. Excluding pruning can make overfitting worse, not better.
When constructing each tree in a random forest, how are features typically selected at each split?
Explanation: In random forests, each tree considers a random subset of features for each split, which introduces diversity among trees and improves ensemble performance. If all features were always used, the trees would be highly correlated, reducing the effectiveness of the ensemble. Selecting only the top correlated features would not add randomness, and choosing features alphabetically is not a meaningful or effective approach.
Why do random forests provide a reliable estimate of feature importance compared to individual decision trees?
Explanation: Random forests estimate feature importance by averaging scores over all trees in the forest, reducing the bias an individual tree may have toward certain features. They do not use support vector machines as part of their construction process. Redundant features are not explicitly ignored but may have lower importance. Consistently using the same root feature would prevent the assessment of other features' importance.
In the context of random forests, what does 'out-of-bag' (OOB) error refer to?
Explanation: Out-of-bag error is an internal validation method where each tree in a random forest is evaluated on samples not seen during its bootstrapped training, providing an unbiased estimate of model performance. Error after pruning refers to tree simplification, but OOB error does not involve pruning. Measuring error using only the training set or in-sample (bootstrapped) data may lead to optimistic bias and does not represent the OOB error concept.