Dimensionality Reduction: PCA, t-SNE, and UMAP Explained Quiz

Explore core concepts of dimensionality reduction with this quiz focused on PCA, t-SNE, and UMAP techniques. Assess your understanding of key applications, differences, and practical uses in unsupervised learning and data visualization.

  1. Objective of Dimensionality Reduction

    What is the main goal of dimensionality reduction methods like PCA, t-SNE, and UMAP when working with high-dimensional datasets?

    1. To represent data with fewer features while preserving important structure
    2. To remove all outliers from the dataset
    3. To increase the number of features for better accuracy
    4. To convert categorical data to numerical form

    Explanation: Dimensionality reduction techniques aim to reduce the number of features or dimensions in a dataset while retaining as much of the relevant information as possible. Increasing features is the opposite of their purpose and may lead to overfitting. Removing outliers is not the primary goal of these methods, and converting categorical data to numerical form is a preprocessing step rather than dimensionality reduction.

  2. PCA's Approach

    Which statement best describes how Principal Component Analysis (PCA) reduces dimensionality?

    1. It encodes data using random projections
    2. It projects data onto directions of maximum variance
    3. It clusters data points based on similarity
    4. It uses nonlinear optimization to embed data in two dimensions

    Explanation: PCA finds new axes, called principal components, along which the variance in the data is maximized, then projects the data onto these axes to reduce dimensions. Clustering is not the focus of PCA, and nonlinear optimization is more related to t-SNE or UMAP. Random projection is a separate dimensionality reduction method, not specific to PCA.

  3. t-SNE Purpose

    In which scenario would t-SNE be especially suitable compared to PCA, such as visualizing the separation of different handwritten digit clusters?

    1. When normalizing feature scales
    2. When encoding categorical variables before modeling
    3. When visualizing complex, nonlinear relationships in data
    4. When performing linear regression modeling

    Explanation: t-SNE is designed to capture and visualize nonlinear relationships, making it ideal for tasks like exploring handwritten digit clusters. It is not used for linear regression or for encoding categorical variables. While feature normalization is an important preprocessing step, it is not specific to t-SNE's purpose.

  4. UMAP vs t-SNE

    Which key advantage does UMAP have over t-SNE when applied to large datasets for visualization?

    1. It eliminates the need for initial data preprocessing
    2. It is generally faster and preserves more global structure
    3. It uses supervised learning for all embeddings
    4. It guarantees perfect class separation every time

    Explanation: UMAP tends to run faster on large datasets and better preserves global data structure compared to t-SNE. UMAP does not guarantee perfect class separation. Proper data preprocessing is important for most dimensionality reduction techniques. Both UMAP and t-SNE can be used in unsupervised ways; supervised learning isn't required for their embeddings.

  5. PCA Linearity

    Why is PCA considered a linear method of dimensionality reduction?

    1. Because it always sorts features alphabetically
    2. Because it generates random clusters of data
    3. Because it only captures linear relationships between features
    4. Because it can model complex nonlinear patterns in data

    Explanation: PCA analyzes the variance in data based on linear relationships between features, making it a linear method. It cannot model complex nonlinear patterns, which are addressed by methods like t-SNE or UMAP. Sorting features alphabetically has no relation to PCA, and generating random clusters is not its function.

  6. Distance Preservation

    When reducing data to two dimensions, which technique aims to preserve local distances or neighborhoods rather than global distances?

    1. Linear Regression
    2. One-Hot Encoding
    3. Random Forest
    4. t-SNE

    Explanation: t-SNE is specifically designed to preserve local structure and neighborhoods in the reduced space, making it effective for visualizing local cluster relationships. Random Forest and Linear Regression are predictive modeling methods, not dimensionality reduction techniques. One-Hot Encoding is used for categorical feature transformation, not for preserving distances.

  7. Data Preprocessing for PCA

    Before applying PCA, which preprocessing step is usually recommended for datasets with features on different scales?

    1. Using k-means clustering first
    2. Converting all columns to text strings
    3. Standardizing features to have zero mean and unit variance
    4. Sorting the dataset randomly

    Explanation: Standardization ensures that features contribute equally to the calculation of principal components, which is vital when scales differ. Converting columns to text makes them unusable for PCA. Employing k-means clustering before PCA is unrelated. Random sorting does not affect feature scaling or PCA performance.

  8. UMAP's Core Principle

    What is a fundamental principle behind UMAP's approach to dimensionality reduction?

    1. Constructing a graph to represent local relationships
    2. Generating polynomial regression models
    3. Assembling a confusion matrix
    4. Sorting labels alphabetically

    Explanation: UMAP creates a graph that captures local similarities among data points and then optimizes its low-dimensional embedding to preserve these local connections. Confusion matrices are evaluation tools for classification, not for dimensionality reduction. Polynomial regression and label sorting are unrelated to UMAP's core method.

  9. Choosing PCA Limitation

    Why might PCA perform poorly on data with a nonlinear structure, such as concentric circles or curved manifolds?

    1. Because PCA cannot capture nonlinear patterns or shapes
    2. Because PCA always removes all outliers automatically
    3. Because PCA can only work with text data
    4. Because PCA adds noise to the data

    Explanation: PCA is limited to capturing only linear dependencies; thus, it struggles with data that has nonlinear structure, like concentric circles. It does not automatically remove outliers or add noise, nor is it designed to process text data directly. Handling nonlinear patterns is a strength of methods like t-SNE and UMAP.

  10. Interpreting Principal Components

    After applying PCA, what does the first principal component represent in the reduced feature space?

    1. The mean value of all features
    2. Random noise added for robustness
    3. The smallest variance direction in the dataset
    4. The direction of maximum variance in the data

    Explanation: The first principal component is defined as the direction along which the data varies the most. It is not related to minimal variance or to noise addition. While the mean of features is informative, it does not represent any principal component.