Questions on some data-mining algorithms
Recently I have studied k-nearest neighbour and decision trees and I am quite curious about the difference between the two,i.e,for task like seperating a target function "return 1 if x2>x1,return 0 otherwise",then choosing Nearest Neighbour would be good here since decision tree will invovle too many splits. So I am just considering in what kind of cases,chossing a decision tree would be more appropriate than k-nearest neighbour?
The other question is just to do with K-nearest neighbour,I understand that when K=1,then it is just a baseline classification(classify the instance to its nearset neighbour' class).Could any one give me an idea on what kind of classification task,will a 3-nearest neigh开发者_运维技巧bour definitely outperfom a 1-nearest neightbour classifier?
Thanks in advance!
k-NN vs. Decision Tree
I always find a picture is the best way to gain an intuition of an algorithm. The target function you suggest would give rise to a data set a bit like this:
Where the function to separate the data is x1 - x2 = 0. The trouble is that normally, decision trees only have functions of one variable at the nodes, so the decision functions at the nodes are axis aligned. I image a decision tree learned on this data set would do something like this:
Hopefully you get the idea, obviously you can approximate the optimal decision boundary by doing this with enough nodes in a decision tree, but that means you run the risk of overfitting the data.
Actually, I said that decision trees normally use single variable functions at nodes, but there is another approach, described in a StackOverflow question about multivariate decision trees (that I failed to answer).
By the way, the best classifier for this kind of data would be a linear classifier, maybe logistic regression, which would find the optimal decision boundary
The effect of k in k-NN
The best description I can give for k in k-nearest neighbour is that high values of k smooth the decision boundary. It is also not the case that a higher k is always better then a lower one.
To think about k-NN we need a bit more of a complicated data set. For k=1, a k-NN model might make decisions a bit like this:
If we increased the value of k, the decisions would be affected by a larger neighbourhood of points and so the decision boundaries would become smoother. In particular, those little red and blue island would be overwhelmed by the surrounding data points:
Whether using a high k is better depends on the level of noise on the dataset. Were those little islands really important and we learned too simple a model that doesn't fit the data very well, or were they just noise and did we avoid overfitting?
A practical perspective
Unfortunately, given some large, complex, real-world data set you probably don't have a very good basis for deciding which algorithm is going to work best (unless you draw on previous work on the same or similar data). What most people do is carefully segment the data into training, parameter tuning and test sets and then run as many algorithms as they can think of. You might also find that you particular situation determines some properties that the algorithm must have (fast, incremental, probabilistic, etc.)
This is an answer to the second question.
(I assume that by definitely outperform you mean always outperform.)
I'm not sure it's possible--because, given a data set and a kNN algorithm, for every instance in which the prediction is better with k=3 (vs. k=1) it's easy to flip that result by changing either how the model is configured or varying the data description (in particular the data density in the solution space).
Here's a simple example, Even though kNN is probably the simplest machine learning algorithm, there are a still a few crucial configuration details beyond calculating a distance matrix and then calcluating minimum distances against it. One of these configuration parameters is weighting--i.e., the contribution of each neighboring points to the predicted value weighted. Some common weighting functions are gaussian, and inverse. For instance, one common weighting function is the 'subtraction function', which, for each neighbor, just subtracts the distance from a constant provided that the distance is greater than the constant. While this function nicely avoids over-weighting data points very close to the unknown point (the point whose value you are trying to predict), a point's weight approaches zero as its distance from the unknown point approaches the value of the chose constant. In other words, predictions using k=3 could be much better than k=1 using this function, but they can also be very nearly the same if two of the three neighboring points are are far enough away so that their weight approaches zero.
Or it might be the data. Suppose the predictions from a k=3 model give the same predictions as k=1 for reason i just mentioned. Now suppose that the data set is enlarged so there is a greater data density, which in turn means that the three neighboring points are more likely than before to contribute approximately equally to the predicted value.
Of course, the same applies for other primary configuration parameters in a kNN algorithm--e.g., distance metric, dimension scaling, probability distribution, etc.
Good question, by the way.
精彩评论