Assess your foundational knowledge of Scikit-learn concepts, including the use of pipelines, data preprocessing techniques, and selecting appropriate machine learning models. This quiz is designed for those seeking to strengthen their practical skills in machine learning workflows using Scikit-learn's core features.
Why is it important to split a dataset into training and testing sets before fitting a machine learning model?
Explanation: Splitting the dataset allows you to assess the model's performance on data it has not seen before, giving a realistic measure of generalization. Reducing dataset size and preventing missing values are not the primary goals of splitting data. Removing features is unrelated to the purpose of splitting; feature selection is a different step.
What does the StandardScaler preprocessing technique do when applied to a feature matrix X?
Explanation: StandardScaler standardizes each feature by removing the mean and scaling to unit variance, which helps many algorithms perform better. Ranking values changes the order but not the scale, scaling to [0,1] is done by a different scaler, and StandardScaler does not specifically remove outliers.
What is the primary benefit of using a pipeline in a machine learning workflow?
Explanation: Pipelines allow you to encapsulate multiple steps such as preprocessing and modeling, ensuring the workflow is repeatable and easier to manage. Visualizing distributions is separate from pipeline construction, and increasing features or shuffling data are unrelated to the main purpose.
If you choose SimpleImputer with the strategy 'mean', what happens to missing values in a numeric column?
Explanation: The 'mean' strategy fills missing values using the mean of the column. Simply leaving them or filling with zeros or the most frequent value (mode) are other strategies, but not what 'mean' does.
What is usually the most appropriate preprocessing method for converting categorical text features into numerical form for machine learning?
Explanation: OneHotEncoder transforms categorical variables into binary vectors, suitable for most models. StandardScaler is intended for numerical data, SimpleImputer with mean strategy cannot be applied to text, and dropping the column removes useful information instead of encoding it.
Which estimator is primarily used for binary classification tasks?
Explanation: LogisticRegression is designed for classification tasks, especially binary ones. KMeans is a clustering algorithm, LinearRegression handles regression tasks, and PCA is a dimensionality reduction method, not a classifier.
When using a pipeline, what ensures that the same transformations learned from training data are applied to new data during prediction?
Explanation: Pipelines keep track of all fitted transformers from the training phase so that exactly the same transformations are applied during prediction. Random re-fitting or skipping transformers can cause inconsistencies, and limiting transformations to initial fitting defeats their purpose.
Which type of algorithms are most affected if feature scaling is not performed?
Explanation: Distance-based methods rely heavily on feature scales because distance computations are sensitive to magnitude. Tree-based and Naive Bayes algorithms are generally scale-invariant, and random guessing does not involve features at all.
In supervised learning, what is the term for the variable that the model tries to predict?
Explanation: The target variable is what the model aims to predict. The feature matrix contains the input variables, pipeline step refers to an element of pipelines, and scaler parameter is a value used in preprocessing, not prediction.
Which metric is most suitable for evaluating the accuracy of a classification model on balanced data?
Explanation: Accuracy score measures the proportion of correct predictions and is standard for balanced classification tasks. Mean squared error is for regression, adjusted R-squared is also used in regression, and silhouette score is relevant for clustering, not classification.