How can I weight features for better clustering with a very small data set?

2023-03-20 16:27 问答作者：

I'm working on a program that takes in several (<50) high dimension points in feature space (1000+ dimensions) and performing hierarchical clustering on them by recursively using standard k-clustering.

My problem is that in any one k-clustering pass, different parts of the high dimensional representation are redundant. I know this problem follows under the umbrella of either feature extraction, selection, or weighting.

In general, what does one take into account when selecting a 开发者_如何学Goparticular feature extraction/selection/weighting algorithm? And specifically, what algorithm would be the best way to prepare my data to clustering in my situation?

Check out this paper:

Witten DM and R Tibshirani (2010) A framework for feature selection in clustering. Journal of the American Statistical Association 105(490): 713-726.

And the related paper COSA by Friedman. They both discuss these issues in depth.

I would suggest a combination of PCA based feature selection and k-means.

Find your principal components and order them by weight. And consume those weights at each depth of you hierarchy.

For example, let's assume you have a cluster hierarchy of four depths abd you obtain component weights like this:

W1: 0.32
W2: 0.20
W3: 0.18
W4: 0.09
...
W1000: 0.00

We want to consume a weight of 1/N from the top for each depth, where N is the depth count. Taking N as 4 here. 0.25 of the first component gets consumed and we reach:

W1: 0.07*
W2: 0.20
W3: 0.18
W4: 0.09
...
W1000: 0.00

New score for the first component becomes 0.32-0.25=0.07. In the second iteration, we consume the top 0.25 again.

W1: 0.00*
W2: 0.02*
W3: 0.18
W4: 0.09
...
W1000: 0.00

The third iteration is:

W1: 0.00
W2: 0.00*
W3: 0.00*
W4: 0.04*
...
W1000: 0.00

And the fourth iteration uses the rest where weights some up to 0.25.

At each iteration we use only features whose weight we consume. For example we only use PC1 and PC2 of the features after KLT on the second iteration, since those are the only components whose weights we consume. Thus, components to cluster for each iteration become:

Iteration 1: PC1
Iteration 2: PC1, PC2
Iteration 3: PC2, PC3, PC4
Iteration 4: PC4, ... PC1000

You may target a final weight consumption that is less than 1.0 and iterate in less amount of weights for this purpose. This is effectively same as filtering out all components beyond your target weight for dimension reduction prior to clustering.

Finally, I don't know if there is a name for this approach. It just feels natural to use PCA for unsupervised problems. You may also try supervised feature selection after the first iteration, since you have cluster labels at hand.

继续阅读：cluster-analysis feature-extraction feature-selection machine-learning statistics

How can I weight features for better clustering with a very small data set?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？