Can the value of information gain be negative? [closed]
This question does not appear to be about programming 开发者_JS百科within the scope defined in the help center.
Closed 2 years ago.
Improve this questionIs there a chance to get the value of information gain be negative?
IG(Y|X) = H(Y) - H(Y|X) >= 0
, since H(Y) >= H(Y|X)
worst case is that X and Y are independent, thus H(Y|X)=H(Y)
Another way to think about it is that by observing the random variable X taking some value, we either gain no or some information about Y (you don't lose any).
EDIT
Let me clarify information gain in the context of decision trees (which actually I had in mind in the first place as I came from a machine learning background).
Assume a classification problem where we are given a set of instances and labels (discrete classes).
The idea of choosing which attribute to split by at each node of the tree, is to select the feature that splits the class attribute into the two purest possible groups of instances (i.e. lowest entropy).
This is in turn equivalent to picking the feature with the highest information gain since
InfoGain = entropyBeforeSplit - entropyAfterSplit
where the entropy after the split is the sum of entropies of each branch weighted by the number of instances down that branch.
Now there exist no possible split of class values that will generate a case with an even worse purity (higher entropy) than before splitting.
Take this simple example of a binary classification problem. At a certain node we have 5 positive instances and 4 negative ones (total of 9). Therefore the entropy (before the split) is:
H([4,5]) = -4/9*lg(4/9) -5/9*lg(5/9) = 0.99107606
Now lets consider some cases of splits. The best case scenario is that the current attribute splits the instances perfectly (i.e. one branch is all positive, the other all negative):
[4+,5-]
/ \ H([4,0],[0,5]) = 4/9*( -4/4*lg(4/4) ) + 5/9*( -5/5*lg(5/5) )
/ \ = 0 // zero entropy, perfect split
[4+,0-] [0+,5-]
then
IG = H([4,5]) - H([4,0],[0,5]) = H([4,5]) // highest possible in this case
Imagine that the second attribute is the worst case possible, where one of the branches created doesn't get any instances: rather all instances go down to the other branch (could happen if for example the attribute is constant across instances, thus useless):
[4+,5-]
/ \ H([4,5],[0,0]) = 9/9 * H([4,5]) + 0
/ \ = H([4,5]) // the entropy as before split
[4+,5-] [0+,0-]
and
IG = H([4,5]) - H([4,5],[0,0]) = 0 // lowest possible in this case
Now somewhere in between these two cases, you will see any number of cases like:
[4+,5-]
/ \ H([3,2],[1,3]) = 5/9 * ( -3/5*lg(3/5) -2/5*lg(2/5) )
/ \ + 4/9 * ( -1/4*lg(1/1) -3/4*lg(3/4) )
[3+,2-] [1+,3-]
and
IG = H([4,5]) - H([3,2],[1,3]) = [...] = 0.31331323
so no matter how you split those 9 instances, you always get a positive gain in information. I realize this is no mathematical proof (go to MathOverflow for that!), I just thought an actual example could help.
(Note: All calculations according to Google)
First, the answer is no, it cannot be negative. The absolute worst possibility is no change, or an IG of zero. If you want proof, go look up the full proof on MathOverFlow like Amro pointed out.
Now for the advice. If you only do the first level of a decision tree, it seems obvious that it would never come up negative. However, when building my first tree using Information Gain, I found myself with a negative gain by my third branching. This didn't seem useful or possible, so I scrambled to check my math. The math was fine. The part I had wrong was the first part of the base formula. I was using the answer from the level above as my starting entropy, but this is wrong because it includes information from other datasets. You need to make sure that for your starting entropy you determine the entropy for that branch alone! This means your "starting entropy" could actually be higher than it was at the level before this one.
In other words, when calculating IG, make sure you are only using the current dataset.
Sure it can.
Information gain is just the change in information entropy from one state to another:
IG(Ex, a) = H(Ex) - H(Ex | a)
That state change can go in either direction--it can be positive or negative.
This is easy to see by example:
Decision Tree algorithms works like this: at a given node, you calculate its information entropy (for the independent variable).
You can think of it like this: information entropy is to categorical/discrete variables as variance is to continuous variables). Variance, of course, is just the square of the standard deviation. So for instance, if we are looking predicting price based on various criteria, and we have arbitrarily group our data set into two groups, in which the prices for group A are (50, 60, and 70), and the prices for group B are (50, 55, 60), group B has the lowest variance--i.e., they are close together. Of course variance cannot be negative (because after you sum the distances of each point from the mean, you square it) but the difference in variance certainly can.
To see how this relates to Information Entropy/Information Gain, suppose we aren't predicting price but something else, like whether the visitor to our Site will become a registered user or a premium subscriber, or neither. The indepdendent variable here is discrete, not continuous like price, so you can't calculate variance in a meaningful way. Information entropy is what is used instead. (If you doubt the close analogy here between variance and IE, you should know that most decision tree algorithms capable of handling both discrete and continuous variables, in the latter case, the algorithm will use variance as the splitting criterion, rather than using IG.)
In any event, after you calculate the information entropy for a given node, you then split the data at that node (which is the entire data set if you are at the root node) on every value for every variable, then for each split, calculate the IE for both groups, and take the weighted average IE. Next take the split that results in the lowest weighted average IE and compare it with the node IE (which is obviously just a single group). If that weighted average IE for the split is lower than the node IE, then you split the data at that node (form a branch), if not, then you stop, i.e., that node can't be further split--you are at the bottom.
In sum, at the heart of the decision tree algorithm is the criterion to determine whether to split a node--that's how they are constructed. That criterion is whether IG is positive or negative.
精彩评论