Explore the essential concepts of Apache Spark MLlib with this foundational quiz, designed to assess your understanding of core components, algorithms, and functionalities within Spark’s machine learning library. Strengthen your knowledge of distributed processing, available algorithms, data formats, and feature engineering in MLlib through practical, scenario-based questions.
Which data structure does MLlib primarily use to represent labeled training data for supervised learning algorithms?
Explanation: LabeledPoint is specifically designed by MLlib to store features with their corresponding labels, making it ideal for supervised learning tasks. While DataFrame and RDD are core distributed data structures, they do not inherently associate features with labels for machine learning. VectorAssembler is used for feature engineering but does not store labels directly.
If you want to predict continuous numeric values, which MLlib algorithm should you use?
Explanation: Linear Regression is suitable for predicting continuous numeric values, making it the correct choice for regression tasks. Logistic Regression is used for classification, not regression. KMeans is a clustering algorithm and does not predict specific values. Decision Tree Classifier is designed for discrete class prediction rather than regression.
Which of the following programming languages can you use to write MLlib applications?
Explanation: Java is natively supported for developing MLlib applications. PHP, Perl, and Ruby are not supported by MLlib, so you cannot use them directly to access MLlib’s features. Other supported languages include Python and Scala.
Which MLlib component is commonly used to convert categorical string variables into numerical indices before machine learning?
Explanation: StringIndexer is used to encode categorical string values as numerical indices, facilitating machine learning tasks that require numerical input. HashingTF is intended for text feature extraction, not categorical indexing. Normalizer is utilized for scaling feature values, while ChiSqSelector is a feature selector based on statistical tests.
Which algorithm in MLlib is generally used for unsupervised grouping of data into clusters based on feature similarity?
Explanation: KMeans is a popular unsupervised clustering algorithm used to group data points by similarity without needing labels. Naive Bayes and Random Forest Classifier are used for classification tasks. Linear Regression is used for regression, not clustering.
When evaluating a binary classifier in MLlib, which metric gives you the fraction of correctly predicted positive and negative samples out of all samples?
Explanation: Accuracy measures the proportion of correct predictions—both positives and negatives—over the entire dataset. RMSE, or Root Mean Squared Error, is used for regression tasks. Log Loss evaluates probabilistic classification predictions, while Explained Variance is specific to regression model evaluation.
Which MLlib feature helps you chain multiple data processing stages, such as feature transformation and model training, into a single workflow?
Explanation: A Pipeline in MLlib allows you to sequence multiple stages, including preprocessing and model fitting, for streamlined processing. GridSearch is not a component but refers to hyperparameter tuning. Evaluator assesses model performance and Binarizer is a transformation tool, not a workflow manager.
What MLlib function is commonly used to randomly divide data into training and test datasets for model evaluation?
Explanation: The randomSplit function is designed to efficiently split data into multiple subsets, such as training and test sets. crossJoin creates a cartesian product of datasets, which is unrelated to splitting data. sampleBy performs sampling based on keys, and groupBy is used for aggregation based on groupings, not splitting.
When using MLlib’s machine learning algorithms, which of the following is the main requirement for input data format in the newer DataFrame-based API?
Explanation: MLlib expects input data as a DataFrame containing columns for features (typically as vectors) and labels for supervised learning. CSV files and JSON files must be read and converted into the required DataFrame structure. A single integer per row cannot convey both features and labels, making it unsuitable for most algorithms.
Which MLlib transformer is used to scale features to have unit norm, often needed before applying certain machine learning algorithms?
Explanation: Normalizer scales each data point so that its vector has a unit norm, which can improve performance for algorithms sensitive to feature magnitudes. PCA reduces dimensionality by projecting data onto principal components. OneHotEncoder converts categorical variables into binary vectors, and VectorIndexer identifies categorical features but doesn't scale feature values.