k-nearest neighbour classifier but using a distribution?
I am building a classifier for some 2D data.
I have some training data for which I know the classes and have plotted these on a graph to see the clustering.
To the observer, there are obvious, separate clusters, but unfortunately they are spread out over lines rather than in tight clusters. One line-spread goes up at about an 80 degree angle, another at 45 degree and another at about 10 d开发者_运维问答egrees from horizontal, but all three seem to point back to the origin.
I want to perform a nearest-neighbour classification on some test data, and from the looks of things, if the test data is very similar to the training data a 3-nearest-neighbour classifier would work fine, except when the data is close to the origin of the graph, in which case the three clusters are quite close together and there might be a few errors.
Should I be coming up with some estimated gaussian distributions for my clusters? If so, I'm not sure how I can combine this with a nearest neighbour classifier?
Be grateful for any input.
Cheers
Transform all your points to [r, angle], and scale r down to the range 0 to 90 too, before running nearest-neighbor.
Why ? NN uses Euclidean distance between points and centres (in most implementations),
but you want distance( point, centre )
to be more like
sqrt( (point.r - centre.r)^2 + (point.angle - centre.angle)^2 )
than sqrt( (point.x - centre.x)^2 + (point.y - centre.y)^2 ) .
Scaling r down to 30 ? 10 ? would weight angle more than r, which seems to be what you want.
Why use k-NN for that purpose? any linear classifier would do the trick. try solving it with SVM and you'll get much better results. If you insist of using kNN, you clearly have to scale the features and transform them into polar ones as mentioned here.
精彩评论