Discovering synonyms from set of documents using LSA transform in Ruby
After applying the LSA transform to a document array, how can this be used to generate synonyms? For instance, I have the following sample documents:
D1 = Mobilization
D2 = Reflective Pavement D3 = Maintenance of Traffic D4 = Special Detour D5 = Commercial Materials for Driveway D1 D2 D3 D4 D5
commerci[ +0.00 +0.00 +0.00 +0.00 +1.00 ]
materi[ +0.00 +0.00 +0.00 +0.00 +1.00 ]
drivewai[ +0.00 +0.00 +0.00 +0.00 +1.00 ]
special[ +0.00 +0.00 +0.00 +1.00 +0.00 ]
detour[ +0.00 +0.00 +0.00 +1.00 +0.00 ]
mainten[ +0.00 +0.00 +1.00 +0.00 +0.00 ]
traffic[ +0.00 +0.00 +1.00 +0.00 +0.00 ]
reflect[ +0.00 +1.00 +0.00 +0.00 +0.00 ]
pavement[ +0.00 +1.00 +0.00 +0.00 +0.00 ]
mobil [ +1.00开发者_StackOverflow +0.00 +0.00 +0.00 +0.00 ]
Applying TFIDF transform
D1 D2 D3 D4 D5
commerci[ +0.00 +0.00 +0.00 +0.00 +0.54 ]
materi[ +0.00 +0.00 +0.00 +0.00 +0.54 ]
drivewai[ +0.00 +0.00 +0.00 +0.00 +0.54 ]
special[ +0.00 +0.00 +0.00 +0.80 +0.00 ]
detour[ +0.00 +0.00 +0.00 +0.80 +0.00 ]
mainten[ +0.00 +0.00 +0.80 +0.00 +0.00 ]
traffic[ +0.00 +0.00 +0.80 +0.00 +0.00 ]
reflect[ +0.00 +0.80 +0.00 +0.00 +0.00 ]
pavement[ +0.00 +0.80 +0.00 +0.00 +0.00 ]
mobil [ +1.61 +0.00 +0.00 +0.00 +0.00 ]
Applying LSA transform
D1 D2 D3 D4 D5
commerci[ +0.00 +0.00 +0.00 +0.00 +0.00 ]
materi[ +0.00 +0.00 +0.00 +0.00 +0.00 ]
drivewai[ +0.00 +0.00 +0.00 +0.00 +0.00 ]
special[ +0.00 +0.00 +0.00 +0.80 +0.00 ]
detour[ +0.00 +0.00 +0.00 +0.80 +0.00 ]
mainten[ +0.00 +0.00 +0.80 +0.00 +0.00 ]
traffic[ +0.00 +0.00 +0.80 +0.00 +0.00 ]
reflect[ +0.00 +0.80 +0.00 +0.00 +0.00 ]
pavement[ +0.00 +0.80 +0.00 +0.00 +0.00 ]
mobil [ +1.61 +0.00 +0.00 +0.00 +0.00 ]
Firstly, this example won't work. The principle behind it is that the more frequently words occur in similar contexts, the more related they are in meaning. Therefore there needs to be some overlap between the input documents. Paragraph length documents are ideal (since they have a reasonable number of words and there tends to be a single topic per paragraph).
To understand how LSA is useful for synonym recognition, you need to first understand how a vector space representation (the first matrix you've got there) of words occurrences is useful for synonym recognition in the first place. This is because you can calculate the distance between two items in this high dimensionality vector space as a measure of their similarity (given that it is a measure of how often they occur together). The Magic of LSA is that it reshuffles the dimensions of the vector space, so that items that don't occur together but occur in similar contexts are brought together by a collapsing of similar dimensions into each other.
The idea of the TFIDF weighting function is to highlight the differences between documents, by giving higher weightings to words that appear more in a smaller subset of of the corpus, and lower weightings to words that are used everywhere. A more thorough explanation.
The "LSA" transformation is actually a singular-value decomposition (SVD) – conventionally Latent Semantic Analysis or Latent Semantic Indexing refers to the combination TFIDF with SVD – and it serves to reduce the dimensionally of the vector space, or in other words, it reduces the number of columns into a smaller, more concise description (as described above).
So to get the the nub of your question: you can tell how similar to words are by applying a distance function to the two corresponding vectors (rows). There are several distance functions to choose from by the most commonly used is the cosine distance (which measures the angle between the two vectors).
Hope this makes things clearer.
精彩评论