K-Means Algorithm [duplicate]

2023-03-13 06:32 问答作者：

This question already has answers here: 开发者_运维技巧 Closed 11 years ago.

Possible Duplicates:
How to optimal K in K - Means Algorithm
How do I determine k when using k-means clustering?

Depending on the statistical measures can we decide on the K. Like Standard Deviation, Mean, Variance etc., Or

Is there any simple method to choose the K in K-means Algorithm?

Thanks in advance Navin

If you explicitly want to use k-means you could study the article describing x-means. When using an implementation of x-means the only difference compared to k-means, is that rather than specifying a single k, you specify a range for k. The "best" choice, wrt. some measure, in the range will be part of the output from x-means. You can also look into the Mean Shift clustering algorithm.

If it is computationally feasible with your given data (possibly using sampling as yura suggests), you could do clustering with various k's and evalute the quality of the resulting clusters using some of the standard cluster validity measures. Some of the classic measures are described here: measures.

@doug It is not correct that k-means++ determines an optimal k for the number of clusters before cluster assignments start. k-means++ differs from k-means only by instead of randomly choosing the initial k centroids, it chooses one initial centroid randomly and successively chooses centers until k has been chosen. After the initial completely random choice, data points are chosen as a new centroid with a probability that is determined by a potential function which depends on the datapoint's distance to the already chosen centers. The standard reference for k-means++ is k-means++: The Advantages of Careful Seeding by Arthur and Vassilvitskii.

Also, I don't think that in general choosing k to be the number of principal components will improve your clustering. Imagine data points in three-dimensional space all lying in a plane passing through the origo. You will then get 2 principal components, but the "natural" clustering of the points could have any number of clusters.

Unfortunately not. There isn't a principled statistical method, simple or complex that can set the "right K". There are heuristics, rules of thumb that sometimes work, sometimes don't.

The situation is more general as many clustering methods have these type of parameters.

Well there are two practical solutions to the the problem of intelligent selection of the number of centroids (k) in common use.

The first is to PCA your data, and the output from PCA--which is the principal components (eigenvectors) and their cumulate contribution to the variation observed in the data--obviously suggests an optimal number of centroids. (E.g., if 95% of the variability in your data is explained by the first three principal components, then k=3 is a wise choice for k-means.)

The second commonly used practical solution to intelligently estimate k is is a revised implementation of the k-means algorithm, called k-means++. In essence, k-means++ just differs from the original k-means by the additional of a pre-processing step. During this step, the number and initial position of the centroids and estimated.

The algorithm that k-means++ relies on to do this is straightforward to understand and to implement in code. A good source for both is a 2007 Post in the LingPipe Blog, which offers an excellent explanation of k-means++ as well as includes a citation to the original paper that first introduced this technique.

Aside from providing an optimal choice for k, k-means++ is apparently superior to the original k-means in both performance (roughly 1/2 processing time compared with k-means in one published comparison) and accuracy (three orders of magnitude improvement in error in the same comparison study).

Bayesian k-means may be a solution when you don't know the number of clusters. There's a related paper given in the website and the corresponding MATLAB code is also given.

The best solution for unkown(by statistical paramters model etc) ML problem is to sample data and find parameters thet best for sub problem, then use them on full problem. In that case select best K for 5% of data.

继续阅读：cluster-analysis data-mining k-means machine-learning

K-Means Algorithm [duplicate]

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？