Decision Tree Learning and Impurity

2023-02-09 13:11 问答作者：

There are three ways to measure i开发者_StackOverflowmpurity:

What are the differences and appropriate use cases for each method?

If the p_i's are very small, then doing multiplication on very small numbers (Gini index) can lead to rounding error. Because of that, it is better to add the logs (Entropy). Classification error, following your definition, provides a gross estimate since it uses the single largest p_i to compute its value.

The difference between entropy and other impurity measures, and in fact often the difference between information theoretic approaches in machine learning and other approaches, is that entropy has been mathematically proven to capture the concept of 'information'. There are many classification theorems (theorems that prove a particular function or mathematical object is the only object that satisfies a set of criteria) for entropy measures that formalize philosophical arguments justifying their meaning as measures of 'information'.

Contrast this with other approaches (especially statistical methods) that are chosen not for their philosophical justification, but primarily for their empirical justification - that is they seem to perform well in experiments. The reason why they perform well is because they contain additional assumptions that may happen to hold at the time of the experiment.

In practical terms this means entropy measures (A) can't over-fit when used properly as they are free from any assumptions about the data, (B) are more likely to perform better than random because they generalize to any dataset but (C) the performance for specific datasets might not be as good as measures that adopt assumptions.

When deciding which measures to use in machine learning it often comes down to long-term vs short-term gains, and maintainability. Entropy measures often work long-term by (A) and (B), and if something goes wrong it's easier to track down and explain why (e.g. a bug with obtaining the training data). Other approaches, by (C), might give short-term gains, but if they stop working it can be very hard to distinguish, say a bug in infrastructure with a genuine change in the data where the assumptions no longer hold.

A classic example where models suddenly stopped working is the global financial crisis. Bankers where being given bonuses for short term gains, so they wrote statistical models that would perform well short term and largely ignored information theoretic models.

I found this description of impurity measures to be quite useful. Unless you are implementing from scratch, most existing implementations use a single predetermined impurity measure. Note also that the Gini index is not a direct measure of impurity, not in its original formulation, and that there are many more than what you list above.

I'm not sure that I understand the concern about small numbers and the Gini impurity measure... I can't imagine how this would happen when splitting a node.

I have seen various efforts at informal guidance on this, ranging from "if you use one of the usual metrics, there there won't be much difference", to much more specific recommendations. In reality, the only way to know with certainty which measure works best is to try all of the candidates.

Anyway, here is some perspective from Salford Systems (the CART vendor):

Do Splitting Rules Really Matter?

继续阅读：data-mining decision-tree machine-learning random-forest

Decision Tree Learning and Impurity

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？