Clustering problem
I've been tasked to find N clusters containing the most points for 开发者_JAVA百科a certain data set given that the clusters are bounded by a certain size. Currently, I am attempting to do this by plugging in my data into a kd-tree, iterating over the data and finding its nearest neighbor, and then merging the points if the cluster they make does not exceed a limit. I'm not sure this approach will give me a global solution so I'm looking for ways to tweak it. If you can tell me what type of problem this would go under, that'd be great too.
Check out scipy.clustering for a start. Key word searches can then give a lot of info on the different algorithms that are used there. Clustering is a big field, with a lot of research and practical applications, and a number of simple approaches that have been found to work fairly well, so you may not want to start by rolling your own.
This said, clustering algorithms are generally fairly easy to program, and if you do want to program your own, k-means and agglomerative clustering are some of the favorites that are quick to do.
Finally, I'm not sure that your idea of exactly N clusters that are bounded by a certain size is self-consistent, but it depends on exactly what you mean by "size" and "cluster" (are single points a cluster?).
Update:
Following the OP's comments below, I think that the standard clustering methods won't give an optimal solution to this problem because there's not a continuous metric for the "distance" between points that can be optimized. Although they may give a good solution or approximation in some cases. For a clustering approach I'd try k-means since the premise of this method is having a fixed N.
But instead of clustering, this seems more like a covering problem (i.e., you have N rectangles of fixed size and you're trying to cover all of the points with them), but I don't know much about these, so I'll leave it to someone else.
If your number of clusters is fixed and you only want to maximize the number of points that are in these clusters then I think a greedy solution would be good :
- find the rectangle that can contains the maximum number of points,
- remove these points,
- find the next rectangle
- ...
So how to find the rectangle of maximum area A (in fact each rectangle will have this area) that contains the maximum number of points ?
A rectangle is not really common for euclidean distance, before trying to solve this, could you precise if you really need rectangle or just some king of limit on the cluster size ? Would a circle/ellipse work ?
EDIT : greedy will not work (see comment below) and it really need to be rectangles...
link textActually, I think this is really pretty easy with two key assumptions.
1) Assume the by "a certain size" we can say "any cluster must be contained completely within a circle with radius, r".
2) All your points are candidate "seed" points at the center of the cluster.
First calculate all the distances less than r among all points. Now solve a set covering problem using only the feasible edges that are less than r. If any point has a nearest neighbor greater than r distance away, it forms its own cluster.
精彩评论