Explore fundamental concepts of clustering algorithms including K-Means, Hierarchical, and DBSCAN, focusing on their characteristics, use-cases, and differences. This quiz helps you reinforce your knowledge on clustering techniques, parameters, and key principles essential for data science and unsupervised learning.
Which step is performed first in the K-Means clustering algorithm when grouping a set of data points?
Explanation: The first step in K-Means clustering is randomly assigning cluster centers (also known as centroids). This serves as the starting point before the algorithm iteratively updates the clusters. Calculating distances occurs after initialization, not before it. Sorting data points is not a standard part of the algorithm, and merging is associated with hierarchical clustering, not K-Means.
What is a key limitation of K-Means clustering when applied to data with complex, non-spherical cluster shapes?
Explanation: K-Means works best with clusters that are roughly spherical (circular in 2D), because it uses Euclidean distance from cluster centers. It does not inherently cause all data to merge into one cluster, nor does it always produce overlapping clusters. While K-Means can be sensitive to outliers, it does not totally ignore them.
In the DBSCAN algorithm, which feature differentiates it from K-Means and Hierarchical clustering?
Explanation: DBSCAN can identify noise points that don't belong to any cluster by analyzing density, a capability not present in traditional K-Means or agglomerative hierarchical clustering. Unlike K-Means, DBSCAN does not require specifying the number of clusters. The result is independent of data sorting, and DBSCAN does not use centroids.
What does a dendrogram represent in hierarchical clustering?
Explanation: A dendrogram visually displays how clusters are merged step by step in hierarchical clustering, resembling a tree structure. It is not a chart of distances, though distances are shown on the axes. A table of centroids would be relevant for K-Means, and a density map pertains more to DBSCAN.
Which two main parameters must be defined when using the DBSCAN algorithm?
Explanation: DBSCAN requires Epsilon (maximum distance for neighborhood search) and MinPts (minimum points to form a dense cluster). Parameters like Alpha, Beta, Gamma, and Delta are unrelated to DBSCAN. 'Iterations' and 'K' pertain to iterative algorithms and K-Means specifically.
Which method is commonly used to select an appropriate value of K in the K-Means algorithm?
Explanation: The Elbow method is frequently used to decide the optimal number of clusters by plotting the sum of squared errors versus the number of clusters. The Silhouette method also helps but is less commonly the first choice for basic users. Centroid swap is not a standard method, and dendrograms are used for hierarchical clustering, not K-Means.
What is the main difference between agglomerative and divisive hierarchical clustering?
Explanation: Agglomerative hierarchical clustering begins with individual points and merges them, whereas divisive starts with all points in a cluster and recursively splits them. Sorting by size or density does not differentiate these methods. Neither uses specific Elbow or DBSCAN approaches for cluster formation.
After running K-Means clustering, you receive a set of centroids and labels for each data point. What does each centroid represent?
Explanation: Each centroid corresponds to the mean position (average) of all points assigned to its cluster in feature space. It is not the farthest point from the cluster, as centroids are centrally located. Highest-density area might be closer to DBSCAN's core point concept, and smallest region is not a property of centroids.
How does DBSCAN determine if a point should be added to a cluster?
Explanation: A point in DBSCAN is added to a cluster if it has enough neighboring points within a certain radius, revealing dense regions. Centroid assignment is a feature of K-Means, not DBSCAN. Random initialization is more relevant for K-Means, and DBSCAN does not involve sorting clusters by size.
Which statement best describes how K-Means deals with data points that are equally close to two centroids?
Explanation: If a data point is equally close to two or more centroids in K-Means, it is assigned to one of them arbitrarily due to the algorithm's determinism. K-Means does not support fractional memberships as in soft clustering, nor does it flag such points as noise; merging centroids is not a standard operation in this algorithm.