Decision Tree Learning and Impurity
There are three ways to measure i开发者_StackOverflowmpurity:
What are the differences and appropriate use cases for each method?
If the p_i's are very small, then doing multiplication on very small numbers (Gini index) can lead to rounding error. Because of that, it is better to add the logs (Entropy). Classification error, following your definition, provides a gross estimate since it uses the single largest p_i to compute its value.
The difference between entropy and other impurity measures, and in fact often the difference between information theoretic approaches in machine learning and other approaches, is that entropy has been mathematically proven to capture the concept of 'information'. There are many classification theorems (theorems that prove a particular function or mathematical object is the only object that satisfies a set of criteria) for entropy measures that formalize philosophical arguments justifying their meaning as measures of 'information'.
Contrast this with other approaches (especially statistical methods) that are chosen not for their philosophical justification, but primarily for their empirical justification - that is they seem to perform well in experiments. The reason why they perform well is because they contain additional assumptions that may happen to hold at the time of the experiment.
In practical terms this means entropy measures (A) can't over-fit when used properly as they are free from any assumptions about the data, (B) are more likely to perform better than random because they generalize to any dataset but (C) the performance for specific datasets might not be as good as measures that adopt assumptions.
When deciding which measures to use in machine learning it often comes down to long-term vs short-term gains, and maintainability. Entropy measures often work long-term by (A) and (B), and if something goes wrong it's easier to track down and explain why (e.g. a bug with obtaining the training data). Other approaches, by (C), might give short-term gains, but if they stop working it can be very hard to distinguish, say a bug in infrastructure with a genuine change in the data where the assumptions no longer hold.
A classic example where models suddenly stopped working is the global financial crisis. Bankers where being given bonuses for short term gains, so they wrote statistical models that would perform well short term and largely ignored information theoretic models.
I found this description of impurity measures to be quite useful. Unless you are implementing from scratch, most existing implementations use a single predetermined impurity measure. Note also that the Gini index is not a direct measure of impurity, not in its original formulation, and that there are many more than what you list above.
I'm not sure that I understand the concern about small numbers and the Gini impurity measure... I can't imagine how this would happen when splitting a node.
I have seen various efforts at informal guidance on this, ranging from "if you use one of the usual metrics, there there won't be much difference", to much more specific recommendations. In reality, the only way to know with certainty which measure works best is to try all of the candidates.
Anyway, here is some perspective from Salford Systems (the CART vendor):
Do Splitting Rules Really Matter?
精彩评论