Fundamentals of Model Evaluation: Splits, Metrics & Overfitting — Questions & Answers

Test your understanding of essential concepts in model evaluation, including train/validation/test splits, cross-validation, choosing metrics like accuracy, precision, recall, F1, and ROC-AUC, as well as methods to prevent overfitting and data leakage. This quiz helps you assess your knowledge on best practices for building robust machine learning models.

This quiz contains 16 questions. Below is a complete reference of all questions, answer choices, and correct answers. You can use this section to review after taking the interactive quiz above.

  1. Question 1: Train, Validation, and Test Splits

    Which dataset split should be used to adjust model hyperparameters before evaluating final performance?

    • Test set
    • Validation set
    • Training set
    • Production set
    Show correct answer

    Correct answer: Validation set

    Explanation: The validation set is used for tuning model hyperparameters and making model choices without peeking at the test set. The test set is reserved for final evaluation, so using it for tuning introduces bias. The training set is for learning model parameters, not for evaluation. A 'production set' is not a standard dataset split in model evaluation.

  2. Question 2: Purpose of the Test Set

    Why is it important to keep the test set completely separate from model training and selection?

    • To avoid high model complexity
    • To speed up the training process
    • To provide an unbiased estimate of how the model performs on new, unseen data
    • Because the test set contains only training examples
    Show correct answer

    Correct answer: To provide an unbiased estimate of how the model performs on new, unseen data

    Explanation: The test set is meant to simulate real-world data the model hasn't seen before, giving an honest measure of generalization. Including it in training would bias results and overestimate performance. The test set does not contain training examples, and it's unrelated to the speed of training or directly controlling model complexity.

  3. Question 3: Cross-Validation

    In k-fold cross-validation with k=5, how many times is the model trained and evaluated?

    • 5
    • 1
    • 10
    • 2
    Show correct answer

    Correct answer: 5

    Explanation: With k=5, the data is split into 5 folds and the model is trained and evaluated five times, each time using a different fold as the validation set. One would be incorrect because it doesn't rotate the folds. Ten is used for k=10, not k=5. Two is not the standard for k-fold processes.

  4. Question 4: Overfitting Basics

    Which situation best describes overfitting in a machine learning model?

    • The model performs well on training data but poorly on validation data
    • The model fails to learn any pattern from the training data
    • The model produces the same prediction for every input
    • The model performs equally well on both training and test sets
    Show correct answer

    Correct answer: The model performs well on training data but poorly on validation data

    Explanation: Overfitting occurs when a model memorizes noise or details specific to the training set and loses the ability to generalize, resulting in poor performance on unseen validation or test data. Not learning any pattern refers to underfitting. Performing equally well on both sets indicates good generalization. Outputting the same prediction always indicates underfitting, not overfitting.

  5. Question 5: Accuracy as a Metric

    In a dataset with 95% negatives and 5% positives, is accuracy a suitable metric for evaluating a classifier?

    • No, because accuracy ignores both false positives and false negatives
    • Yes, because accuracy penalizes all errors equally even in imbalanced data
    • Yes, because accuracy always reflects true performance
    • No, because accuracy can be misleading with imbalanced data
    Show correct answer

    Correct answer: No, because accuracy can be misleading with imbalanced data

    Explanation: Accuracy can be high by always predicting the majority class, which hides poor performance on the minority class. Accuracy does consider all kinds of errors, but it doesn't reflect them well when data is imbalanced. The other options are incorrect because they misunderstand how accuracy works in imbalanced situations.

  6. Question 6: Precision vs. Recall

    If it is more important to avoid false positives than false negatives, which metric should you focus on?

    • F1-score
    • Recall
    • Precision
    • Support
    Show correct answer

    Correct answer: Precision

    Explanation: Precision measures the proportion of positive predictions that are actually correct, so it's crucial when avoiding false positives is important. Recall would be prioritized when missing actual positives is costly. F1-score balances both, but isn't specific to avoiding false positives. Support is not a performance metric.

  7. Question 7: Understanding Recall

    Which scenario is best for prioritizing recall over precision?

    • Classifying email spam when false positives are highly undesirable
    • Recommending movies when showing irrelevant titles is strongly penalized
    • Detecting rare diseases where missing a positive case is worse than a false alarm
    • Predicting weather for outdoor events where false positives are harmless
    Show correct answer

    Correct answer: Detecting rare diseases where missing a positive case is worse than a false alarm

    Explanation: In medical diagnosis, it's often more important to catch all possible cases (high recall), even at the risk of some false positives. In situations like spam classification and recommendation engines, avoiding false positives might be more important. For weather prediction, the cost of false positives and negatives can vary, but it's not typically as critical as in disease detection.

  8. Question 8: F1-Score Meaning

    What does the F1-score represent in classification metrics?

    • The harmonic mean of precision and recall
    • The geometric mean of sensitivity and specificity
    • The sum of true positives and true negatives
    • The arithmetic mean of accuracy and recall
    Show correct answer

    Correct answer: The harmonic mean of precision and recall

    Explanation: The F1-score is the harmonic mean of precision and recall, offering a balanced metric when both are important. It's not based on the geometric mean of sensitivity and specificity, nor an arithmetic mean. Summing true positives and negatives relates to accuracy, not the F1-score.

  9. Question 9: ROC-AUC Score

    What does a ROC-AUC score of 1.0 indicate about a binary classifier?

    • The classifier performs no better than random guessing
    • The classifier perfectly distinguishes between the classes
    • The classifier always predicts the negative class
    • The classifier is overfitting
    Show correct answer

    Correct answer: The classifier perfectly distinguishes between the classes

    Explanation: A ROC-AUC (Receiver Operating Characteristic – Area Under Curve) score of 1.0 means the model makes no mistakes, distinguishing classes perfectly. Always predicting one class leads to poor discrimination and a low ROC-AUC. Random guessing gives a score of 0.5, not 1. Overfitting is not directly implied by the ROC-AUC score alone.

  10. Question 10: Preventing Data Leakage

    Which action might cause data leakage during model development?

    • Splitting the data into training and validation sets before any preprocessing
    • Applying k-fold cross-validation correctly on the dataset
    • Using only features available at prediction time
    • Using future information from test data when training your model
    Show correct answer

    Correct answer: Using future information from test data when training your model

    Explanation: Incorporating information from test data into training introduces data leakage, which leads to an overestimation of model performance. Splitting before preprocessing can sometimes help prevent leakage; it's not a cause itself. Correctly applying cross-validation doesn't create leakage. Features available only at prediction time are safe to use.

  11. Question 11: Random Splitting

    Why is randomly splitting data into train and test sets important?

    • To always achieve higher accuracy
    • To ensure the sets represent the overall data and avoid bias
    • To make training easier for the algorithm
    • To minimize overfitting by reducing model complexity
    Show correct answer

    Correct answer: To ensure the sets represent the overall data and avoid bias

    Explanation: Random splitting helps both sets contain data representative of the overall distribution, which avoids selection bias. It does not affect the ease of training the algorithm or directly reduce complexity. While it helps with fair evaluation, it does not guarantee higher accuracy.

  12. Question 12: Stratified Sampling

    In classification tasks with imbalanced classes, what is the main advantage of using stratified sampling for train/test splits?

    • It makes the dataset size smaller
    • It preserves the original proportion of classes in both sets
    • It eliminates all outliers
    • It guarantees higher model performance
    Show correct answer

    Correct answer: It preserves the original proportion of classes in both sets

    Explanation: Stratified sampling maintains the same class distribution in training and test sets, which is important for fair evaluation. It does not make datasets smaller, nor does it guarantee performance improvements. Outliers are not specifically dealt with by stratification.

  13. Question 13: Hyperparameter Tuning and Data Leakage

    How can tuning hyperparameters on the test set lead to data leakage?

    • It allows the model to indirectly 'see' the test data, biasing final evaluation
    • It improves interpretability of results
    • It reduces the need for cross-validation
    • It always increases the size of the training set
    Show correct answer

    Correct answer: It allows the model to indirectly 'see' the test data, biasing final evaluation

    Explanation: Using the test set for hyperparameter tuning means the model's choices are influenced by performance on those unseen data, causing overoptimistic results. Increasing the training size is unrelated, nor does it replace cross-validation or directly impact interpretability.

  14. Question 14: Early Stopping to Prevent Overfitting

    What is the primary purpose of early stopping in model training?

    • To shuffle the input data during each epoch
    • To prevent the model from seeing the test set
    • To halt training when the model's performance on the validation set stops improving
    • To select the largest possible model
    Show correct answer

    Correct answer: To halt training when the model's performance on the validation set stops improving

    Explanation: Early stopping ends training when improvements on validation performance plateau, helping prevent overfitting. It does not prevent the model from seeing the test set, select model size, or shuffle data (shuffling is usually a separate concern).

  15. Question 15: Regularization Techniques

    How does regularization help prevent overfitting in machine learning models?

    • It splits the dataset into more folds
    • It forces the model to use accuracy as the only metric
    • It increases the size of test data
    • It adds a penalty for complexity to the loss function, discouraging overly complex models
    Show correct answer

    Correct answer: It adds a penalty for complexity to the loss function, discouraging overly complex models

    Explanation: Regularization penalizes model complexity, encouraging simpler models that generalize better. Increasing test size, using only accuracy, or splitting into more folds are unrelated to regularization's effect on overfitting.

  16. Question 16: Metric Selection Scenario

    Which metric would you choose to evaluate a fraud detection model where catching all fraud cases is more important than minimizing false positives?

    • Recall
    • Precision
    • Accuracy
    • ROC-AUC
    Show correct answer

    Correct answer: Recall

    Explanation: Recall is ideal when missing positive cases (fraud) is particularly costly, even if there are more false positives. Precision would be important if minimizing false positives was the goal. Accuracy and ROC-AUC offer broader measures, but do not specifically focus on catching all positives.