How can I weight features for better clustering with a very small data set?
I'm working on a program that takes in several (<50) high dimension points in feature space (1000+ dimensions) and performing hierarchical clustering on them by recursively using standard k-clustering.
My problem is that in any one k-clustering pass, different parts of the high dimensional representation are redundant. I know this problem follows under the umbrella of either feature extraction, selection, or weighting.
In general, what does one take into account when selecting a 开发者_如何学Goparticular feature extraction/selection/weighting algorithm? And specifically, what algorithm would be the best way to prepare my data to clustering in my situation?
Check out this paper:
Witten DM and R Tibshirani (2010) A framework for feature selection in clustering. Journal of the American Statistical Association 105(490): 713-726.
And the related paper COSA by Friedman. They both discuss these issues in depth.
I would suggest a combination of PCA based feature selection and k-means.
Find your principal components and order them by weight. And consume those weights at each depth of you hierarchy.
For example, let's assume you have a cluster hierarchy of four depths abd you obtain component weights like this:
W1: 0.32
W2: 0.20
W3: 0.18
W4: 0.09
...
W1000: 0.00
We want to consume a weight of 1/N
from the top for each depth, where N
is the depth count. Taking N
as 4
here. 0.25
of the first component gets consumed and we reach:
W1: 0.07*
W2: 0.20
W3: 0.18
W4: 0.09
...
W1000: 0.00
New score for the first component becomes 0.32-0.25=0.07
. In the second iteration, we consume the top 0.25
again.
W1: 0.00*
W2: 0.02*
W3: 0.18
W4: 0.09
...
W1000: 0.00
The third iteration is:
W1: 0.00
W2: 0.00*
W3: 0.00*
W4: 0.04*
...
W1000: 0.00
And the fourth iteration uses the rest where weights some up to 0.25
.
At each iteration we use only features whose weight we consume. For example we only use PC1 and PC2 of the features after KLT on the second iteration, since those are the only components whose weights we consume. Thus, components to cluster for each iteration become:
Iteration 1: PC1
Iteration 2: PC1, PC2
Iteration 3: PC2, PC3, PC4
Iteration 4: PC4, ... PC1000
You may target a final weight consumption that is less than 1.0
and iterate in less amount of weights for this purpose. This is effectively same as filtering out all components beyond your target weight for dimension reduction prior to clustering.
Finally, I don't know if there is a name for this approach. It just feels natural to use PCA for unsupervised problems. You may also try supervised feature selection after the first iteration, since you have cluster labels at hand.
精彩评论