Interpreting output from mahout clusterdumper

2023-02-28 21:56 问答作者：

I ran a clustering test on crawled pages (more than 25K docs ; personal data set). I've done a clusterdump :

$MAHOUT_HOME/bin/mahout clusterdump --seqFileDir output/clusters-1/ --output clusteranalyze.txt

The output after running cluster dumper is shown 25 elements "VL-xxxxx {}" :

VL-24130{n=1312 c=[0:0.017, 10:0.007, 11:0.005, 14:0.017, 31:0.016, 35:0.006, 41:0.010, 43:0.008, 52:0.005, 59:0.010, 68:0.037, 72:0.056, 87:0.028, ... ] r=[0:0.442, 10:0.271, 11:0.198, 14:0.369, 31:0.421, ... ]}
...
VL-24868{n=311 c=[0:0.042, 11:0.016, 17:0.046, 72:0.014, 96:0.044, 118:0.015, 135:0.016, 195:0.017, 318:0.040, 319:0.037, 320:0.036, 330:0.030, ...] ] r=[0:0.740, 11:0.287, 17:开发者_如何学JAVA0.576, 72:0.239, 96:0.549, 118:0.273, ...]}

How to interpret this output?

In short : I am looking for document ids which belong to a particular cluster.

What is the meaning of :

VL-x ?
n=y c=[z:z', ...]
r=[z'':z''', ...]

Does 0:0.017 means "0" is the document id which belongs to this cluster?

I already have read on mahout wiki-pages what CL, n, c and r means. But can someone please explain them to me better or points to a resource where it is explained a bit more in detail?

Sorry, if i am asking some stupid questions, but i am a newbie wih apache mahout and using it as part of my course assignment for clustering.

By default, kmeans clustering uses WeightedVector which does not include the data point name. So, you would like to make a sequence file yourself using NamedVector. There is a one to one correspondence between the number of seq files and the mapping tasks. So if your mapping capacity is 12, you want to chop your data into 12 pieces when making seqfiles NamedVecotr:
```
vector = new NamedVector(new SequentialAccessSparseVector(Cardinality),arrField[0]);
```

Basically you need to download the clusteredPoints from your HDFS system and write your own code to output the results. Here is the code that I wrote to output the cluster point membership.

import java.io.*;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Set;
import java.util.TreeMap;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.SequenceFile;
import org.apache.mahout.clustering.WeightedVectorWritable;
import org.apache.mahout.common.Pair;
import org.apache.mahout.common.iterator.sequencefile.PathFilters;
import org.apache.mahout.common.iterator.sequencefile.PathType;
import org.apache.mahout.common.iterator.sequencefile.SequenceFileDirIterable;
import org.apache.mahout.math.NamedVector;

public class ClusterOutput {

/**
 * @param args
 */
public static void main(String[] args) {
        // TODO Auto-generated method stub
        try {
                BufferedWriter bw;
                Configuration conf = new Configuration();
                FileSystem fs = FileSystem.get(conf);
                File pointsFolder = new File(args[0]);
                File files[] = pointsFolder.listFiles();
                bw = new BufferedWriter(new FileWriter(new File(args[1])));
                HashMap<String, Integer> clusterIds;
                clusterIds = new HashMap<String, Integer>(5000);
                for(File file:files){
                        if(file.getName().indexOf("part-m")<0)
                                continue;
                        SequenceFile.Reader reader = new SequenceFile.Reader(fs,  new Path(file.getAbsolutePath()), conf);
                        IntWritable key = new IntWritable();
                        WeightedVectorWritable value = new WeightedVectorWritable();
                        while (reader.next(key, value)) {
                                NamedVector vector = (NamedVector) value.getVector();
                                String vectorName = vector.getName();
                                bw.write(vectorName + "\t" + key.toString()+"\n");
                                if(clusterIds.containsKey(key.toString())){
                                        clusterIds.put(key.toString(), clusterIds.get(key.toString())+1);
                                }
                                else
                                        clusterIds.put(key.toString(), 1);
                        }
                        bw.flush();
                        reader.close(); 
                }
                bw.flush();
                bw.close();
                bw = new BufferedWriter(new FileWriter(new File(args[2])));
                Set<String> keys=clusterIds.keySet();
                for(String key:keys){
                        bw.write(key+" "+clusterIds.get(key)+"\n");
                }
                bw.flush();
                bw.close();
                } catch (IOException e) {
                        e.printStackTrace();
                }
        }
}

To complete the answer:

VL-x: is the identifier of the cluster
n=y: is the number of elements in the cluster
c=[z, ...]: is the centroid of the cluster, with the z's being the weights of the different dimensions
r=[z, ...]: is the radius of the cluster.

More info here: https://mahout.apache.org/users/clustering/cluster-dumper.html

I think you need to read the source code -- download from http://mahout.apache.org. VL-24130 is just a cluster identifier for a converged cluster.

You can use mahout clusterdump https://cwiki.apache.org/MAHOUT/cluster-dumper.html

继续阅读：cluster-analysis k-means mahout

Interpreting output from mahout clusterdumper

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？