开发者

How to classify a set of samples via a continuous feature?

For example I got below table which is simply a coarse distribution for 20 persons over their age

         age        count of person

  •     2                  1
  •     5                  5
  •     8                  2
  •     10                3
  •     15                1
  •     16                2
  •     17                1
  •     20                4
  •     21                1

Then by using the same dataset, I could build another 'better' table .

         age         count of person

  •    10-                  8
  •    10s                  7
  •    20+                 5

In fact , I could make more tables which contains different age range combination by using the same dataset.

Now I wonder how could I find the best combinations. The possible "goodness functions" we could use to measure if the combination is good or not might come by following three principles:

  • There should not be too many or too little classes
  • Ranges of classes should not vary too much.
  • Distribution should be smooth enough, that is ,number of items covered by each class should not vary too much.

Since this question represents a situation which is just general enough to describe a kind of specific problems , some sophisticated solutions to it should have already been there . But I failed to find them. Anyone could give some suggestions please?

I have go through some classification algorithm like PCA, k-mean or "max entropy based algorithm" but seems they are just too general to cover this specific problem by following all of开发者_如何学JAVA the above three principles.


I would do the following:

Construct an evaluation function:

double goodness(double firstThreshold, double bucketWidth, int numBuckets)

which returns a goodness score based on your principles. I would then brute force a number of combinations of parameters and pick the combination with the best goodness score. If we try 4-10 values for each parameter then brute force will work, and probably give you nice round numbers for the cutoffs. If you want to get more sophisticated or have it run faster then you can try other search methods like hill-climbing, beam search or simulated annealing but I think that might be overkill for your situation.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜