How to calculate distance when we have sparse dataset in K nearest neighbour

2023-04-05 13:12 问答作者：

I am implementing K nearest neighbour algorithm for a very sparse data. I want to calculate the distance between a test instance and each sample in the training set, but I am confused.

Because most of the features in training samples don't exist in test instance or vice versa (missing feature开发者_如何学Pythons).

How can I compute the distance in this situation?

To make sure I'm understanding the problem correctly: each sample forms a very sparsely filled vector. The missing data is different between samples, so it's hard to use any Euclidean or other distance metric to gauge similarity of samples.

If that is the scenario, I have seen this problem show up before in machine learning - in the Netflix prize contest, but not specifically applied to KNN. The scenario there was quite similar: each user profile had ratings for some movies, but almost no user had seen all 17,000 movies. The average user profile was quite sparse.

Different folks had different ways of solving the problem, but the way I remember was that they plugged in dummy values for the missing values, usually the mean of the particular value across all samples with data. Then they used Euclidean distance, etc. as normal. You can probably still find discussions surrounding this missing value problem on that forums. This was a particularly common problem for those trying to implement singular value decomposition, which became quite popular and so was discussed quite a bit if I remember right.

You may wish to start here: http://www.netflixprize.com//community/viewtopic.php?id=1283

You're going to have to dig for a bit. Simon Funk had a little different approach to this, but it was more specific to SVDs. You can find it here: http://www.netflixprize.com//community/viewtopic.php?id=1283 He calls them blank spaces if you want to skip to the relevant sections.

Good luck!

If you work in very high dimension space. It is better to do space reduction using SVD, LDA, pLSV or similar on all available data and then train algorithm on trained data transformed that way. Some of those algorithms are scalable therefor you can find implementation in Mahout project. Especially I prefer using more general features then such transformations, because it is easier debug and feature selection. For such purpose combine some features, use stemmers, think more general.

继续阅读：distance knn machine-learning multidimensional-array nearest-neighbor

How to calculate distance when we have sparse dataset in K nearest neighbour

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？