Find best attribute in decision tree

2023-02-22 12:17 问答作者：

i've come across one question

Color   Flavor  Edibility
Red     Grape      Yes
Red     Cherry     Yes
Green   Grape      Yes
Green   Cherry     No
Blue    Grape      No
Blue    Cherry     No

In this question, it says just analyzing without any calculation, guess the best attribute (either Colo开发者_运维百科r or Flavor)

Can someone explain how to guess this without calculating the entropy and so on

I know this question is kind of older, but in case you're still interested: In general, a shorter, wider tree would be "better." Consider the fact that it will take additional decisions to reach a node in a tall tree.

What you really have to look at is the entropy and gain at each inner decision node.

Entropy is the amount of uncertainty or randomness with a particular variable. Thought about another way, it is the measure of how homogenous the training samples at a particular node are. For example, consider a classifier with two classes, YES and NO (true or false in your case). If a particular variable or attribute, say x has three training examples of class YES and three training examples of class NO (for a total of six), the entropy would be 1. This is because there is an equal number of both classes for this variable and is the most "mixed up" you can get. Likewise, if x had all six training examples of a particular class, say YES, then entropy would be 0 because this particular variable would be pure, thus making it a leaf node in our decision tree.

Entropy may be calculated in the following way:

Find best attribute in decision tree

_{(source: dms.irb.hr)}

Now consider gain. Note that each level of the decision tree, we choose the attribute that presents the best gain for that node. The gain is simply the expected reduction in the entropy achieved by learning the state of the random variable x. Gain is also known as Kullback-Leibler divergence. Gain can be calculated in the following way:

Find best attribute in decision tree

_{(source: dms.irb.hr)}

While the questions asks for you not to calculate either gain nor entropy, the explanation was necessary to show why I choose a particular attribute. In your case, I will assume that edibility is the learned attribute.

If you choose either flavor or color, notice that you have an entropy of 1 [0-1] in both cases because you have an equal number of training instances with an edibility of "yes" and "no" regardless of the attribute. At this point, you should look at gain. If you anchor your tree with the attribute "color", you will have less entropy as the proportion of each attribute belonging to the set S would be less. For example, notice that the leaf nodes for "Red" and "Green" are already pure, with all "yes" and "no" respectively. From that point on, you have one attribute left to use, flavor. Obviously, if you had more than one left, you would have to calculate gain for each attribute to figure out which does the best and use that as the next "layer"

Also, try drawing it out and anchoring the tree with the Color attribute and calculating the gain - you'll find you converge to your answer (a pure node) faster.

Entropy and gains are definitely better heuristics from Information theory that help in selecting the best feature. However, one way to choose best attribute which is less accurate than entropy but works well is a 1 step lookahead which works as follows:

CHOOSEBESTATTRIBUTE(S)
choose j to minimize J, computed as follows:
    S0 = all <x,y> in S with xj = O;
    S1 = all <x,y> in S with xj = 1;
    Y0 = the most common value of y in S0
    Y1 = the most common value of y in S1
    J0 = number of examples <x, y> in S0 with y not equal to y0
    J1 = number of examples (x, y) in S1 with y not equal to y1
    Ji = J0 + J1 (total errors if we split on this feature)
return j

Source: Machine Learning: Tom M. Mitchell, Chapter 3

继续阅读：decision-tree entropy

Find best attribute in decision tree

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？