Clustering Fundamentals: K-Means and Hierarchical Quiz Quiz

Challenge your grasp of essential clustering algorithms like K-Means and hierarchical clustering. This quiz covers principles, differences, applications, and interpretation of results, helping data enthusiasts assess their understanding of foundational clustering techniques.

  1. K-Means Objective Function

    Which metric does the K-Means algorithm aim to minimize when clustering a dataset of customer purchase patterns?

    1. Sum of absolute errors
    2. Euclidean distance between all points
    3. Number of clusters formed
    4. Total within-cluster sum of squares

    Explanation: K-Means focuses on minimizing the total within-cluster sum of squares, measuring how close data points in a cluster are to the centroid. The Euclidean distance between all points is not minimized globally, only within clusters. The number of clusters is set in advance and not reduced by the algorithm. The sum of absolute errors is used in other algorithms like K-medoids, not classic K-Means.

  2. Types of Hierarchical Clustering

    When performing document clustering, what distinguishes agglomerative from divisive hierarchical clustering?

    1. Agglomerative needs the number of clusters in advance; divisive does not
    2. Agglomerative requires labeled data; divisive does not
    3. Agglomerative produces overlapping clusters; divisive creates non-overlapping clusters
    4. Agglomerative starts with individual points; divisive starts with one big cluster

    Explanation: Agglomerative hierarchical clustering starts with each data point as its own cluster and merges them step by step. In contrast, divisive clustering starts with all data in one cluster and splits them. Both can work without specifying cluster count and produce non-overlapping clusters. Neither requires labeled data, as both are unsupervised.

  3. Interpreting Dendrograms

    If a dendrogram from hierarchical clustering shows two branches joining at a high vertical distance, what does this indicate about the grouped data?

    1. The two clusters are identical
    2. The two clusters being merged are dissimilar
    3. The merging step occurred early
    4. The data set has no outliers

    Explanation: A high vertical distance on the dendrogram means the combined clusters are very different or separated. If they were identical, they would merge at a low height. This says nothing about outliers, and such a late merge did not occur early in the clustering process. Therefore, the correct answer is that the merged clusters are more dissimilar.

  4. Choosing K in K-Means

    In customer segmentation, which technique is commonly used to select the optimal number of clusters K in K-Means?

    1. Gradient descent
    2. Elbow method
    3. Decision trees
    4. Silhouette coloring

    Explanation: The Elbow method is widely used to determine the optimal number of clusters by examining a plot of inertia versus K and looking for a bend or 'elbow.' Silhouette analysis helps assess cluster fit but is not the technique for choosing K directly. Gradient descent is not used for selecting K and is unrelated to clustering. Decision trees are a different model type and do not apply here.

  5. Cluster Shapes and Algorithm Choice

    If your data shows long, irregularly shaped clusters, which clustering method is more appropriate to use than K-Means?

    1. Linear regression
    2. Hierarchical clustering
    3. Principal component analysis
    4. K-Means clustering

    Explanation: Hierarchical clustering can handle clusters of varying shapes and sizes, including long or irregular structures, thanks to different linkage criteria. K-Means works best for spherical, similarly sized clusters, making it less suitable here. Linear regression is used for prediction, not clustering, and principal component analysis is for dimensionality reduction, not grouping clusters.