Explore foundational concepts of stratified sampling and data splitting in statistics and machine learning. This quiz covers key definitions, practical examples, and best practices to reinforce understanding of effective sampling and creating robust data splits.
Which of the following best explains stratified sampling when collecting survey data?
Explanation: Stratified sampling involves separating a population into subgroups, or strata, based on shared attributes, then sampling proportionally from each stratum. This ensures that significant groups are represented in the sample. Simple random sampling ignores group differences, making the second option incorrect. Convenience sampling (option three) is less systematic and could bias results. Systematic sampling (option four) uses a set interval rather than grouping by characteristics.
Why is stratified sampling especially useful when data categories are imbalanced, such as having 90% cats and 10% dogs in a pet dataset?
Explanation: Stratified sampling ensures that all key categories, such as both cats and dogs, are represented according to their proportion in the population. This prevents the sample from being skewed toward the dominant group. The method does not correct errors (option two) or only increase the major class (option three). It does not solely focus on rare categories but works for all groups (option four).
When splitting data into training and test sets, what is the main purpose of the test set?
Explanation: The test set is set aside during training to assess the model's ability to generalize to new, unseen examples. Training accuracy focuses only on the data the model learned from (option two). Removing duplicates is unrelated to splitting (option three). Balancing class numbers is a sampling concern; the test set’s purpose is evaluation, not balancing (option four).
If you are working with a dataset of students where only 5% are left-handed, how can stratified splitting help when creating train and test sets?
Explanation: Stratified splitting maintains the proportion of categories, like left-handed students, in both training and test sets. This avoids unintentional exclusion or underrepresentation. Removing students (option two) loses valuable data, while random splitting (option three) may lead to disproportionate representation. Duplicating data artificially (option four) is not the aim of stratification.
How is stratified random sampling different from simple random sampling using an example of city residents grouped by age?
Explanation: Stratified sampling ensures each age group is represented by separating into age-defined strata and sampling within those. Option two is incorrect; simple random sampling picks individuals at random regardless of age. Stratified sampling always involves group formation, so option three is wrong. Simple random sampling does not guarantee balanced age groups, making option four incorrect.
In a medical study with equal numbers of men and women but different age ranges, which approach ensures both gender and age distribution are reflected in the sample?
Explanation: Stratifying by both gender and age guarantees that all combinations are included according to their representation, capturing the full structure of the population. Random sampling (option two) may miss minority subgroups. Sampling by gender alone (option three) ignores age variation. Selecting alphabetically (option four) has no relationship to meaningful attributes.
What could happen if you randomly split an imbalanced fraud detection dataset without stratification?
Explanation: Without stratification, there's a risk that rare classes, like fraud, aren't included in one or both splits, making it impossible to evaluate the model on those cases. Classes are not guaranteed in all splits (option two), and random splitting does not ensure balanced distributions (option three). Oversampling is a separate technique (option four), unrelated to the splitting method.
Which statement accurately describes the 'holdout' strategy for data splitting?
Explanation: The holdout method assigns part of the data exclusively for training and another for final evaluation. Using overlapping data (option two) risks evaluation bias. Stratification is helpful but not required for the holdout method (option three). Splitting must occur before model training, not after (option four).
In which situation is stratified sampling usually less critical for splitting data?
Explanation: If the data is already balanced across classes, simple random sampling is usually sufficient because all groups are equally represented. When classes are highly imbalanced (option two and three), stratification is important to preserve class distribution. Evaluating on skewed datasets (option four) benefits from stratification to ensure the test set reflects the true diversity.
Which limitation sometimes affects stratified sampling when dealing with many small subgroups?
Explanation: With many small subgroups, there may not be enough individuals to sample from each, which can lead to unreliable estimates for those strata. Stratified sampling does not always require larger sample sizes (option two) and does not guarantee prediction accuracy (option three). Random selection within strata is still necessary, so option four is incorrect.