Steps to perform document clustering using k-means algorithm in java

2023-01-11 19:10 问答作者：

I need steps t开发者_如何学Pythono perform document clustering using k-means algorithm in java. It will be very useful for me to provide the steps easily. Thanks in advance.

You need to count the words in each document and make a feature generally called bag of words. Before that you need to remove stop words(very common but not giving much information like the, a etc). You can generally take top n common words from your document. Count the frequency of these words and store them in n dimensional vector.

For distance measure you can use cosine vector.

Here is a simple algorithm for 2 mean for 1 dimensional data points. you can extend it to k mean and n dimensional data point easily. Let me know if you want n dim implementation.


double[] x = {1,2,2.5,3,3.5,4,4.5,5,7,8,8.5,9,9.5,10};

double[] center = new int[2];
double[] precenter = new int[2];
ArrayList[] cluster = new ArrayList[2];

//generate 2 random number from 0 to x.length without replacement
int rand = new int[2];
Random rand = new Random();
rand[0] = rand.nextInt(x.length + 1);
rand[1] = rand.nextInt(x.length + 1);

while(rand[0] == rand[1] ){
  rand[1] = rand.nextInt(x.length + 1);
}
center[0] = x[rand[0]];
center[1] = x[rand[1]];
//there is a better way to generate k random number (w/o replacement) just search.

do{
  cluster[0].clear();
  cluster[1].clear();
  for(int i = 0; i < x.length; ++i){
          if(abs(x[i]-center1[0]) <= abs(x[i]-center1[1])){
             cluster[0].add(x[i]);
          }
          else{
             cluster[0].add(x[i]);
          }
          precenter[0] = center[0];
          precenter[1] = center[1];

          center[0] = mean(cluster[0]);
          center[1] = mean(cluster[1]);
 }
} while(precenter[0] != center[0] && precenter[1] != center[1]);

double mean(ArrayList list){
    double mean = 0;
    double sum = 0;
    for(int index=0;index
}

The cluster[0] and cluster [1] contain points in the clusters and center[0], center[1] are the 2 means. you need to do some debugging because I have written the code in R and just converted it into java for you :)

Does this help you? Also the wiki article has some links to implementations in other languages ready to be ported to java.

Steps of the algorithm:

Define the number of clusters you want to have
Distribute the points radomly in your problem space.
Link every observation to the nearest point.
calculate the center of mass for each cluster and place the point into the middle.
Link the points again to the centerpoints and repeat until the points dont move any more.

What do you want to cluster the documents based on? If it's by similarity you'll need to do some natural language processing first, and then you'll need a metric (some kind of assignment algorithm) to place the documents into clusters (crp works and is relatively straight forward).

The hardest part will be the NLP (language processing) if you're not clustering them based on something like "length". I can provide more info on all of these, but I won't dive down the rabbit hole if you don't need it.

Steps to perform document clustering using k-means algorithm in java

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？