Discovering synonyms from set of documents using LSA transform in Ruby

2023-02-20 05:44 问答作者：

After applying the LSA transform to a document array, how can this be used to generate synonyms? For instance, I have the following sample documents:

D1 = Mobilization

D2 = Reflective Pavement

D3 = Maintenance of Traffic

D4 = Special Detour

D5 = Commercial Materials for Driveway

            D1    D2    D3    D4    D5    
commerci[ +0.00 +0.00 +0.00 +0.00 +1.00 ]  
  materi[ +0.00 +0.00 +0.00 +0.00 +1.00 ]  
drivewai[ +0.00 +0.00 +0.00 +0.00 +1.00 ]  
 special[ +0.00 +0.00 +0.00 +1.00 +0.00 ]  
  detour[ +0.00 +0.00 +0.00 +1.00 +0.00 ]  
 mainten[ +0.00 +0.00 +1.00 +0.00 +0.00 ]  
 traffic[ +0.00 +0.00 +1.00 +0.00 +0.00 ]  
 reflect[ +0.00 +1.00 +0.00 +0.00 +0.00 ]  
pavement[ +0.00 +1.00 +0.00 +0.00 +0.00 ]  
  mobil [ +1.00开发者_StackOverflow +0.00 +0.00 +0.00 +0.00 ]

Applying TFIDF transform

            D1    D2    D3    D4    D5  
commerci[ +0.00 +0.00 +0.00 +0.00 +0.54 ]  
  materi[ +0.00 +0.00 +0.00 +0.00 +0.54 ]  
drivewai[ +0.00 +0.00 +0.00 +0.00 +0.54 ]  
 special[ +0.00 +0.00 +0.00 +0.80 +0.00 ]  
  detour[ +0.00 +0.00 +0.00 +0.80 +0.00 ]  
 mainten[ +0.00 +0.00 +0.80 +0.00 +0.00 ]  
 traffic[ +0.00 +0.00 +0.80 +0.00 +0.00 ]  
 reflect[ +0.00 +0.80 +0.00 +0.00 +0.00 ]  
pavement[ +0.00 +0.80 +0.00 +0.00 +0.00 ]  
  mobil [ +1.61 +0.00 +0.00 +0.00 +0.00 ]

Applying LSA transform

            D1    D2    D3    D4    D5  
commerci[ +0.00 +0.00 +0.00 +0.00 +0.00 ]  
  materi[ +0.00 +0.00 +0.00 +0.00 +0.00 ]  
drivewai[ +0.00 +0.00 +0.00 +0.00 +0.00 ]  
 special[ +0.00 +0.00 +0.00 +0.80 +0.00 ]  
  detour[ +0.00 +0.00 +0.00 +0.80 +0.00 ]  
 mainten[ +0.00 +0.00 +0.80 +0.00 +0.00 ]  
 traffic[ +0.00 +0.00 +0.80 +0.00 +0.00 ]  
 reflect[ +0.00 +0.80 +0.00 +0.00 +0.00 ]  
pavement[ +0.00 +0.80 +0.00 +0.00 +0.00 ]  
  mobil [ +1.61 +0.00 +0.00 +0.00 +0.00 ]

Firstly, this example won't work. The principle behind it is that the more frequently words occur in similar contexts, the more related they are in meaning. Therefore there needs to be some overlap between the input documents. Paragraph length documents are ideal (since they have a reasonable number of words and there tends to be a single topic per paragraph).

To understand how LSA is useful for synonym recognition, you need to first understand how a vector space representation (the first matrix you've got there) of words occurrences is useful for synonym recognition in the first place. This is because you can calculate the distance between two items in this high dimensionality vector space as a measure of their similarity (given that it is a measure of how often they occur together). The Magic of LSA is that it reshuffles the dimensions of the vector space, so that items that don't occur together but occur in similar contexts are brought together by a collapsing of similar dimensions into each other.

The idea of the TFIDF weighting function is to highlight the differences between documents, by giving higher weightings to words that appear more in a smaller subset of of the corpus, and lower weightings to words that are used everywhere. A more thorough explanation.

The "LSA" transformation is actually a singular-value decomposition (SVD) – conventionally Latent Semantic Analysis or Latent Semantic Indexing refers to the combination TFIDF with SVD – and it serves to reduce the dimensionally of the vector space, or in other words, it reduces the number of columns into a smaller, more concise description (as described above).

So to get the the nub of your question: you can tell how similar to words are by applying a distance function to the two corresponding vectors (rows). There are several distance functions to choose from by the most commonly used is the cosine distance (which measures the angle between the two vectors).

Hope this makes things clearer.

继续阅读：artificial-intelligence lsa ruby

Discovering synonyms from set of documents using LSA transform in Ruby

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？