Decision Trees and Random Forests Quiz Quiz

Explore the fundamentals of decision trees and random forest algorithms with this insightful quiz, designed to assess your understanding of splitting criteria, overfitting, feature importance, and ensemble strategies. Strengthen your grasp on key machine learning techniques used for classification and regression tasks.

  1. Splitting Criteria in Decision Trees

    Which metric is commonly used by decision trees to determine the best attribute for splitting at each node when performing classification?

    1. Euclidean distance
    2. R-squared score
    3. Variance
    4. Gini impurity

    Explanation: Gini impurity is frequently used in decision trees for classification to measure how often a randomly chosen element would be incorrectly labeled. Variance is typically used for regression tasks, not classification. R-squared score evaluates model performance but does not guide splits during tree construction. Euclidean distance is a metric for measuring distance in clustering or nearest neighbor algorithms, not decision tree splitting.

  2. Handling Overfitting in Random Forests

    What is an effective method random forests use to reduce overfitting, compared to a single decision tree?

    1. Aggregating predictions from multiple trees
    2. Increasing the tree depth
    3. Excluding pruning during training
    4. Using linear regression at each node

    Explanation: Random forests reduce overfitting by aggregating the predictions of many trees, averaging their results to provide more robust outcomes and less variance. Increasing the tree depth may actually lead to overfitting rather than prevent it. Using linear regression at each node is not typical in decision trees; instead, splits are made based on criteria like Gini impurity. Excluding pruning can make overfitting worse, not better.

  3. Feature Selection in Random Forests

    When constructing each tree in a random forest, how are features typically selected at each split?

    1. Features are selected based on alphabetical order
    2. Only the top correlated features are used
    3. A random subset of features is considered
    4. All features are used for every split

    Explanation: In random forests, each tree considers a random subset of features for each split, which introduces diversity among trees and improves ensemble performance. If all features were always used, the trees would be highly correlated, reducing the effectiveness of the ensemble. Selecting only the top correlated features would not add randomness, and choosing features alphabetically is not a meaningful or effective approach.

  4. Feature Importance Interpretation

    Why do random forests provide a reliable estimate of feature importance compared to individual decision trees?

    1. They average importance scores over many trees
    2. They ignore redundant features
    3. They always use the same feature for the root node
    4. They use support vector machines internally

    Explanation: Random forests estimate feature importance by averaging scores over all trees in the forest, reducing the bias an individual tree may have toward certain features. They do not use support vector machines as part of their construction process. Redundant features are not explicitly ignored but may have lower importance. Consistently using the same root feature would prevent the assessment of other features' importance.

  5. Out-of-Bag Error Concept

    In the context of random forests, what does 'out-of-bag' (OOB) error refer to?

    1. The error calculated after pruning the trees
    2. The prediction error estimated from data not used for building a particular tree
    3. The error measured using only the training set
    4. The in-sample error from bootstrapped data

    Explanation: Out-of-bag error is an internal validation method where each tree in a random forest is evaluated on samples not seen during its bootstrapped training, providing an unbiased estimate of model performance. Error after pruning refers to tree simplification, but OOB error does not involve pruning. Measuring error using only the training set or in-sample (bootstrapped) data may lead to optimistic bias and does not represent the OOB error concept.