开发者

Steps to perform document clustering using k-means algorithm in java

I need steps t开发者_如何学Pythono perform document clustering using k-means algorithm in java. It will be very useful for me to provide the steps easily. Thanks in advance.


You need to count the words in each document and make a feature generally called bag of words. Before that you need to remove stop words(very common but not giving much information like the, a etc). You can generally take top n common words from your document. Count the frequency of these words and store them in n dimensional vector.

For distance measure you can use cosine vector.

Here is a simple algorithm for 2 mean for 1 dimensional data points. you can extend it to k mean and n dimensional data point easily. Let me know if you want n dim implementation.


double[] x = {1,2,2.5,3,3.5,4,4.5,5,7,8,8.5,9,9.5,10};

double[] center = new int[2]; double[] precenter = new int[2]; ArrayList[] cluster = new ArrayList[2];

//generate 2 random number from 0 to x.length without replacement int rand = new int[2]; Random rand = new Random(); rand[0] = rand.nextInt(x.length + 1); rand[1] = rand.nextInt(x.length + 1);

while(rand[0] == rand[1] ){ rand[1] = rand.nextInt(x.length + 1); } center[0] = x[rand[0]]; center[1] = x[rand[1]]; //there is a better way to generate k random number (w/o replacement) just search.

do{ cluster[0].clear(); cluster[1].clear(); for(int i = 0; i < x.length; ++i){ if(abs(x[i]-center1[0]) <= abs(x[i]-center1[1])){ cluster[0].add(x[i]); } else{ cluster[0].add(x[i]); } precenter[0] = center[0]; precenter[1] = center[1];
center[0] = mean(cluster[0]); center[1] = mean(cluster[1]); } } while(precenter[0] != center[0] && precenter[1] != center[1]);

double mean(ArrayList list){ double mean = 0; double sum = 0; for(int index=0;index }

The cluster[0] and cluster [1] contain points in the clusters and center[0], center[1] are the 2 means. you need to do some debugging because I have written the code in R and just converted it into java for you :)


Does this help you? Also the wiki article has some links to implementations in other languages ready to be ported to java.

Steps of the algorithm:

  1. Define the number of clusters you want to have
  2. Distribute the points radomly in your problem space.
  3. Link every observation to the nearest point.
  4. calculate the center of mass for each cluster and place the point into the middle.
  5. Link the points again to the centerpoints and repeat until the points dont move any more.


What do you want to cluster the documents based on? If it's by similarity you'll need to do some natural language processing first, and then you'll need a metric (some kind of assignment algorithm) to place the documents into clusters (crp works and is relatively straight forward).

The hardest part will be the NLP (language processing) if you're not clustering them based on something like "length". I can provide more info on all of these, but I won't dive down the rabbit hole if you don't need it.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜