How to compute the similarity between lists of features?

2023-03-27 22:00 问答作者：

I have users and resources. Each resource is described by a set of features and each user is related to a different set of resources. In my particular case, the resources are web pages, and the features information about the location of the visit, the time of the visit, the number of visit etc, which are tied to a specific user each time.

I want to get a similarity measure between my users regarding those features but I can't find a way to aggregate the resource features together. I've done it with text features, as it is possible to add the documents toget开发者_如何学Cher and then extract features (say TF-IDF), but I don't know how to proceed with this configuration.

To be as clear as possible, here is what I have:

>>> len(user_features)
13 # that's my number of users
>>> user_features[0].shape
(2374, 17) # 2374 documents for this user, and 17 features

I'm able to get a similarity matrix of the documents using euclidean distances for instance:

>>> euclidean_distance(user_features[0], user_features[0])

But I don't know how do I compare the users against each other. I should somehow aggregate the features together to end up with a N_Users X N_Features matrix, but I don't know how.

Any hints on how to proceed?

Some more information about the features I'm using:

The features I have here are not completely fixed. What I've got so far is 13 different features, already aggregated from "views". What I have is standard deviation, mean, etc. for each of the views, in order to have something "flat", to be able to compare them. One of the feature I have is: was the location changed since the last view? And what about one hour ago? Two hours ago?

If each user is represented as a set of document-interaction vectors you can define the similarity of a pair of users as the similarity of the pair of document-interaction vector sets that represent the users.

You say you can get a similarity matrix of the documents. Then assume that user U1 visited documents D1, D2, D3, and user U2 visited documents D1,D3,D4. You would have two sets of vectors S1 = {U1(D1), U1(D2), U1(D3)} for user 1 and S2 = {U2(D1), U2(D3), U2(D4)}. Note that because each user's interaction with a document is different they are represented as such. If I understand correctly, the elements of these sets should correspond to the respective lines in the matrix of each user.

The similarity between these two sets can be computed in many different ways. One option is the average pair-wise similarity: You iterate over all pairings of the elements from each set, compute the document similarity of the pair, and average over all pairs.

You could use the mean of the features in each user's set of resources seems a natural way to summarize a user. numpy.mean with an appropriate axis argument should get you the mean, then compute the Euclidean distance between the resulting "user vectors" (of length n_features) as you did before between document vectors.

I would look at creating multiple dimensions of documents, so those documents that are visited at certain times of day, divide up by morning and night, and then plot users that are nite owls and early birds.

With any number of dimensions you can create a matrix of users, and use distance between users to help.

继续阅读：machine-learning numpy python

How to compute the similarity between lists of features?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？