'Probability' of a K-nearest neighbor like classification

2023-02-09 12:58 问答作者：

I'v开发者_开发问答e a small set of data points (around 10) in a 2D space, and each of them have a category label. I wish to classify a new data point based on the existing data point labels and also associate a 'probability' for belonging to any particular label class.

Is it appropriate to label the new point based on the label to its nearest neighbor( like a K-nearest neighbor, K=1)? For getting the probability I wish to permute all the labels and calculate all the minimum distance of the unknown point and the rest and finding the fraction of cases where the minimum distance is lesser or equal to the distance that was used to label it.

Thanks

The Nearest Neighbour method is already using the Bayes theorem to estimate the probability using the points in a ball containing your chosen K points. There is no need to transform, as the number of points in the ball of K points belonging to each label divided by the total number of points in that ball already is an approximation of the posterior probability of that label. In other words:

P(label|z) = P(z|label)P(label) / P(z) = K(label)/K

This is obtained using the Bayes rule of probability on an estimated probability estimated using a subset of the data. In particular, using:

VP(x) = K/N (this gives you the probability of a point in a ball of volume V)

P(x) = K/NV (from above)

P(x=label) = K(label)/N(label)V (where K(label) and N(label) are the number of points in the ball of that given class and the number of points in the total samples of that class)

and

P(label) = N(label)/N.

Therefore, just pick a K, calculate the distances, count the points and by checking their labels and recounting you will have your probability.

Roweis uses a probabilistic framework with KNN in his publication Neighbourhood Component Analysis. The idea is to use a "soft" nearest neighbour classification, where the probability that a point i uses another point j as its neighbour is defined by

'Probability' of a K-nearest neighbor like classification

where d_ij is the euclidean distance between point i and j.

The are no probabilities for such K-nearest classification method because it is discriminative classification as well as SVM. There are should be used postporcess for learning probabilities on unseen data with generative model like logistic regression. 1. learn K nearest classifier 2. Train logistic regression on distance and average distance to K nearest for validation data.

Check for details LibSVM article.

Sort the distances to the 10 centres; they could be
1 5 6 ... — one near, others far
1 1 1 5 6 ... — 3 near, others far
... lots of possibilities.
You could combine the 10 distances to a single number, e.g. 1 - (nearest / average) ** p,
but that's throwing away information. (Different powers p makes the hills around the centres steeper or flatter.)

If your centres are really Gaussian hills though, take a look at Multivariate kernel density estimation.

Added: There are zillions of functions that go smoothly between 0 and 1, but that doesn't make them probabilities of something.
"Probability" means either that chance, likelihood, is involved, as in probability of rain;
or that you're trying to impress somebody.

Added again: scholar.google.com "(single|1) nearest neighbor classifier" gets > 300 hits; "k nearest neighbor classifier" gets almost 3000.
It seems to me (non-expert) that, out of 10 different ways of mapping k-NN distances to labels,
each one might be better than the 9 others — for some data, with some error measure.
Anyway, you could try asking stats.stackexchange.com ,

The answer is : it depends.

Imagine your labels are the surname of a person, and the X,Y coordinates represent some essential characteristics of the person's DNA sequence. Clearly a more close DNA description enhance the probability of having the same surnames.

Now suppose the X,Y is the lat/long of the work office for that person. Working closer isn't related to label (surname) sharing.

So, it depends on the semantic of your tags and axes.

HTH!

继续阅读：classification cluster-analysis machine-learning statistics

'Probability' of a K-nearest neighbor like classification

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？