Apache Spark MLlib Fundamentals Quiz Quiz

Explore the essential concepts of Apache Spark MLlib with this foundational quiz, designed to assess your understanding of core components, algorithms, and functionalities within Spark’s machine learning library. Strengthen your knowledge of distributed processing, available algorithms, data formats, and feature engineering in MLlib through practical, scenario-based questions.

  1. MLlib Data Structure

    Which data structure does MLlib primarily use to represent labeled training data for supervised learning algorithms?

    1. RDD
    2. DataFrame
    3. LabeledPoint
    4. VectorAssembler

    Explanation: LabeledPoint is specifically designed by MLlib to store features with their corresponding labels, making it ideal for supervised learning tasks. While DataFrame and RDD are core distributed data structures, they do not inherently associate features with labels for machine learning. VectorAssembler is used for feature engineering but does not store labels directly.

  2. MLlib Linear Models

    If you want to predict continuous numeric values, which MLlib algorithm should you use?

    1. Linear Regression
    2. Logistic Regression
    3. KMeans
    4. Decision Tree Classifier

    Explanation: Linear Regression is suitable for predicting continuous numeric values, making it the correct choice for regression tasks. Logistic Regression is used for classification, not regression. KMeans is a clustering algorithm and does not predict specific values. Decision Tree Classifier is designed for discrete class prediction rather than regression.

  3. Supported Languages

    Which of the following programming languages can you use to write MLlib applications?

    1. PHP
    2. Java
    3. Perl
    4. Ruby

    Explanation: Java is natively supported for developing MLlib applications. PHP, Perl, and Ruby are not supported by MLlib, so you cannot use them directly to access MLlib’s features. Other supported languages include Python and Scala.

  4. Feature Engineering

    Which MLlib component is commonly used to convert categorical string variables into numerical indices before machine learning?

    1. Normalizer
    2. ChiSqSelector
    3. StringIndexer
    4. HashingTF

    Explanation: StringIndexer is used to encode categorical string values as numerical indices, facilitating machine learning tasks that require numerical input. HashingTF is intended for text feature extraction, not categorical indexing. Normalizer is utilized for scaling feature values, while ChiSqSelector is a feature selector based on statistical tests.

  5. Clustering Algorithms

    Which algorithm in MLlib is generally used for unsupervised grouping of data into clusters based on feature similarity?

    1. Linear Regression
    2. Naive Bayes
    3. KMeans
    4. Random Forest Classifier

    Explanation: KMeans is a popular unsupervised clustering algorithm used to group data points by similarity without needing labels. Naive Bayes and Random Forest Classifier are used for classification tasks. Linear Regression is used for regression, not clustering.

  6. Model Evaluation Metrics

    When evaluating a binary classifier in MLlib, which metric gives you the fraction of correctly predicted positive and negative samples out of all samples?

    1. Accuracy
    2. RMSE
    3. Explained Variance
    4. Log Loss

    Explanation: Accuracy measures the proportion of correct predictions—both positives and negatives—over the entire dataset. RMSE, or Root Mean Squared Error, is used for regression tasks. Log Loss evaluates probabilistic classification predictions, while Explained Variance is specific to regression model evaluation.

  7. Pipeline Components

    Which MLlib feature helps you chain multiple data processing stages, such as feature transformation and model training, into a single workflow?

    1. GridSearch
    2. Pipeline
    3. Evaluator
    4. Binarizer

    Explanation: A Pipeline in MLlib allows you to sequence multiple stages, including preprocessing and model fitting, for streamlined processing. GridSearch is not a component but refers to hyperparameter tuning. Evaluator assesses model performance and Binarizer is a transformation tool, not a workflow manager.

  8. Split Data for Validation

    What MLlib function is commonly used to randomly divide data into training and test datasets for model evaluation?

    1. sampleBy
    2. randomSplit
    3. crossJoin
    4. groupBy

    Explanation: The randomSplit function is designed to efficiently split data into multiple subsets, such as training and test sets. crossJoin creates a cartesian product of datasets, which is unrelated to splitting data. sampleBy performs sampling based on keys, and groupBy is used for aggregation based on groupings, not splitting.

  9. MLlib Input Formats

    When using MLlib’s machine learning algorithms, which of the following is the main requirement for input data format in the newer DataFrame-based API?

    1. JSON file with nested objects
    2. DataFrame with feature and label columns
    3. CSV text file with headers only
    4. Single integer per row

    Explanation: MLlib expects input data as a DataFrame containing columns for features (typically as vectors) and labels for supervised learning. CSV files and JSON files must be read and converted into the required DataFrame structure. A single integer per row cannot convey both features and labels, making it unsuitable for most algorithms.

  10. Scaling Features

    Which MLlib transformer is used to scale features to have unit norm, often needed before applying certain machine learning algorithms?

    1. Normalizer
    2. PCA
    3. VectorIndexer
    4. OneHotEncoder

    Explanation: Normalizer scales each data point so that its vector has a unit norm, which can improve performance for algorithms sensitive to feature magnitudes. PCA reduces dimensionality by projecting data onto principal components. OneHotEncoder converts categorical variables into binary vectors, and VectorIndexer identifies categorical features but doesn't scale feature values.