30 K Means
SLIDE DECKS
Some of the material presented in this chapter will be discussed in class. It is your responsibility to ensure you cover all the concepts presented both in class and in this textbook.
K-means is an unsupervised learning method that allows us to group data based on how "close" it is to each other. Specifically, it is an iterative clustering algorithm used to partition a dataset into K distinct, non-overlapping subsets (or clusters). The goal is to minimize the within-cluster variance.
Steps
- Initialization: Choose K initial centroids, where K is the number of clusters you want. This can be done randomly or using some heuristic.
- Assignment: Assign each data point to the nearest centroid. All data points assigned to a specific centroid become members of that cluster.
- Update: Calculate the new centroid (mean) of each cluster based on the members of that cluster.
- Repeat: Repeat the assignment and update steps until convergence (i.e., centroids do not change significantly).
Best Practices
- Initialization:
- Select the centroids to begin the process. You could use the K-means++ initialization method to select initial centroids in a way that accelerates convergence.
- Run the algorithm multiple times with different initializations and pick the best result.
- Choosing K:
- The Elbow Method: Plot the cost (within-cluster variance) as a function of K and look for the ‘elbow’ point where the rate of decrease sharply changes.
- Silhouette analysis can be used to study the separation distance between the resulting clusters.
- Scaling:
- Always scale the features, especially if they are on different scales.
- Convergence:
- Set a maximum number of iterations to prevent the algorithm from running indefinitely.
- Use a tolerance for centroid movement to determine convergence.
- Outliers:
- K-means is sensitive to outliers. Consider preprocessing your data to remove or handle outliers.
- Dimensionality:
- If the dataset has a high dimensionality, consider using dimensionality reduction techniques like Principal Components Analysis (PCA) before clustering.
Extensions
- K-means++: An improved initialization method to avoid the sometimes poor clusterings found by the standard K-means algorithm.
- Mini-Batch K-means: An adaptation of the K-means algorithm that uses mini-batches to reduce computation time.
- Binary K-means: A variation where data is split into binary trees.
- Spherical K-means: Used when clustering text data or other data in high-dimensional spaces where cluster assignment is based on angles between vectors.
- Fuzzy K-means (or C-means): Each point has a degree of belonging to clusters, rather than belonging completely to just one cluster.
- K-medoids (or PAM, Partitioning Around Medoids): Uses actual data points as cluster centers (medoids) instead of centroids. This can make the algorithm more robust to outliers.
- K-modes: Used for categorical data, where modes (most common values) are used instead of means.
K-means is a powerful and widely-used clustering algorithm, but its performance can vary based on initialization, choice of K, and data quality. Understanding its assumptions and limitations, as well as being aware of advanced techniques and extensions, can help in effectively clustering datasets.
Activity
Explore the following R code.