Stratified Sampling and Data Splits Essentials Quiz Quiz

Explore foundational concepts of stratified sampling and data splitting in statistics and machine learning. This quiz covers key definitions, practical examples, and best practices to reinforce understanding of effective sampling and creating robust data splits.

  1. Understanding Stratified Sampling

    Which of the following best explains stratified sampling when collecting survey data?

    1. Dividing a population into groups based on shared characteristics and sampling from each group proportionally.
    2. Randomly selecting individuals from the entire population without grouping.
    3. Choosing the first available individuals until the sample size is reached.
    4. Selecting every nth person from a long list of names.

    Explanation: Stratified sampling involves separating a population into subgroups, or strata, based on shared attributes, then sampling proportionally from each stratum. This ensures that significant groups are represented in the sample. Simple random sampling ignores group differences, making the second option incorrect. Convenience sampling (option three) is less systematic and could bias results. Systematic sampling (option four) uses a set interval rather than grouping by characteristics.

  2. Purpose of Stratified Sampling

    Why is stratified sampling especially useful when data categories are imbalanced, such as having 90% cats and 10% dogs in a pet dataset?

    1. It ensures that only rare categories are included.
    2. It guarantees that both cats and dogs will be represented proportionally in the sample.
    3. It increases the size of the majority class only.
    4. It automatically corrects all errors in the dataset.

    Explanation: Stratified sampling ensures that all key categories, such as both cats and dogs, are represented according to their proportion in the population. This prevents the sample from being skewed toward the dominant group. The method does not correct errors (option two) or only increase the major class (option three). It does not solely focus on rare categories but works for all groups (option four).

  3. Data Splits in Machine Learning

    When splitting data into training and test sets, what is the main purpose of the test set?

    1. To remove duplicate entries from the dataset.
    2. To improve the model’s training accuracy.
    3. To evaluate how well the model performs on unseen data.
    4. To balance the number of examples in each class.

    Explanation: The test set is set aside during training to assess the model's ability to generalize to new, unseen examples. Training accuracy focuses only on the data the model learned from (option two). Removing duplicates is unrelated to splitting (option three). Balancing class numbers is a sampling concern; the test set’s purpose is evaluation, not balancing (option four).

  4. Stratification in Splitting Data

    If you are working with a dataset of students where only 5% are left-handed, how can stratified splitting help when creating train and test sets?

    1. It duplicates left-handed students to balance the classes.
    2. It ensures both sets will contain about 5% left-handed students.
    3. It randomly assigns all students without regard to handedness.
    4. It removes all left-handed students from the dataset.

    Explanation: Stratified splitting maintains the proportion of categories, like left-handed students, in both training and test sets. This avoids unintentional exclusion or underrepresentation. Removing students (option two) loses valuable data, while random splitting (option three) may lead to disproportionate representation. Duplicating data artificially (option four) is not the aim of stratification.

  5. Simple vs. Stratified Random Sampling

    How is stratified random sampling different from simple random sampling using an example of city residents grouped by age?

    1. Simple random sampling always picks only the youngest residents.
    2. Simple random sampling is more likely to balance all age groups automatically.
    3. Stratified sampling first divides residents by age groups, then samples randomly within each group.
    4. Stratified sampling samples without forming any groups.

    Explanation: Stratified sampling ensures each age group is represented by separating into age-defined strata and sampling within those. Option two is incorrect; simple random sampling picks individuals at random regardless of age. Stratified sampling always involves group formation, so option three is wrong. Simple random sampling does not guarantee balanced age groups, making option four incorrect.

  6. Application of Stratified Sampling

    In a medical study with equal numbers of men and women but different age ranges, which approach ensures both gender and age distribution are reflected in the sample?

    1. Divide by gender and age, then sample within each subgroup proportionally.
    2. Sample randomly from the entire participant pool.
    3. Select participants based only on gender.
    4. Organize participants alphabetically and select the first 50.

    Explanation: Stratifying by both gender and age guarantees that all combinations are included according to their representation, capturing the full structure of the population. Random sampling (option two) may miss minority subgroups. Sampling by gender alone (option three) ignores age variation. Selecting alphabetically (option four) has no relationship to meaningful attributes.

  7. Risks of Not Using Stratified Splits

    What could happen if you randomly split an imbalanced fraud detection dataset without stratification?

    1. Random splitting always maintains exact class distributions.
    2. The rare fraud class might be absent from the test set, causing misleading model evaluation.
    3. Fraud cases will automatically be oversampled.
    4. The test set will always include every possible class.

    Explanation: Without stratification, there's a risk that rare classes, like fraud, aren't included in one or both splits, making it impossible to evaluate the model on those cases. Classes are not guaranteed in all splits (option two), and random splitting does not ensure balanced distributions (option three). Oversampling is a separate technique (option four), unrelated to the splitting method.

  8. Holdout Method

    Which statement accurately describes the 'holdout' strategy for data splitting?

    1. The dataset is divided into a training set and a separate test set, each used only for their own purpose.
    2. It requires prior stratification every time.
    3. Training and test sets use the same data points to maximize model performance.
    4. The data is split after the model is trained.

    Explanation: The holdout method assigns part of the data exclusively for training and another for final evaluation. Using overlapping data (option two) risks evaluation bias. Stratification is helpful but not required for the holdout method (option three). Splitting must occur before model training, not after (option four).

  9. When Stratified Sampling is Less Necessary

    In which situation is stratified sampling usually less critical for splitting data?

    1. When there are rare and common categories.
    2. When every class or category has nearly the same number of examples.
    3. When one class makes up 95% of the data.
    4. When evaluating models on skewed datasets.

    Explanation: If the data is already balanced across classes, simple random sampling is usually sufficient because all groups are equally represented. When classes are highly imbalanced (option two and three), stratification is important to preserve class distribution. Evaluating on skewed datasets (option four) benefits from stratification to ensure the test set reflects the true diversity.

  10. Limitations of Stratified Sampling

    Which limitation sometimes affects stratified sampling when dealing with many small subgroups?

    1. Some strata might have too few members, making it hard to sample representatively.
    2. It eliminates the need for any randomization.
    3. It guarantees perfect predictions on new data.
    4. Stratified sampling always requires larger samples than other methods.

    Explanation: With many small subgroups, there may not be enough individuals to sample from each, which can lead to unreliable estimates for those strata. Stratified sampling does not always require larger sample sizes (option two) and does not guarantee prediction accuracy (option three). Random selection within strata is still necessary, so option four is incorrect.