KMeans
E97072
clustering algorithm
iterative optimization algorithm
partition-based clustering method
unsupervised learning algorithm
KMeans is a popular unsupervised machine learning algorithm used for partitioning data into a specified number of clusters based on feature similarity.
Statements (50)
| Predicate | Object |
|---|---|
| instanceOf |
clustering algorithm
→
iterative optimization algorithm → partition-based clustering method → unsupervised learning algorithm → |
| advantage |
computationally efficient for large datasets
→
scales linearly with number of samples and clusters in practice → simple to implement → |
| alsoKnownAs |
Lloyd’s algorithm
→
k-means clustering → |
| assumes |
Euclidean feature space in standard form
→
clusters are roughly spherical → clusters have similar size → |
| basedOn |
minimization of within-cluster sum of squares
→
|
| canUseDistanceMetric |
other Lp distances with modifications
→
|
| commonlyUsedIn |
customer segmentation
→
document clustering → image compression → pattern recognition → |
| convergesWhen |
change in objective function is below a threshold
→
cluster assignments no longer change → |
| distanceMetric |
Euclidean distance (standard)
→
|
| implementedIn |
Apache Spark MLlib
→
MATLAB Statistics and Machine Learning Toolbox → R stats and cluster packages → scikit-learn → |
| input |
number of clusters k
→
set of data points → |
| limitation |
cannot automatically determine optimal number of clusters
→
may converge to local minima → not robust to noise and outliers → performs poorly on non-spherical clusters → |
| objectiveFunction |
minimize sum of squared distances between points and their assigned cluster centroid
→
|
| optimizationProblem |
NP-hard in general
→
|
| relatedAlgorithm |
Gaussian mixture models
→
fuzzy c-means → k-medoids → |
| requires |
numerical feature representation
→
predefined number of clusters k → |
| sensitiveTo |
feature scaling
→
initialization → outliers → |
| step |
assign each point to nearest centroid
→
iterate assignment and update until convergence → recompute centroids as mean of assigned points → |
| typicalInitialization |
k-means++ initialization
→
random selection of initial centroids → |
| usedFor |
data compression
→
partitioning data into k clusters → prototype-based clustering → vector quantization → |
Referenced by (1)
| Subject (surface form when different) | Predicate |
|---|---|
|
scikit-learn
→
|
hasConcept |