Perceptual hash function for text [closed]
Does anyone knows a simple perceptual hash algorithm for text ? I took a look in the p开发者_如何学GoHash function ph_texthash
but I want a more simple algorithm.
Preferably in Python. Thank you !
A blog post about perceptual hash functions (in the imaging context):
- http://www.hackerfactor.com/blog/index.php?/archives/432-Looks-Like-It.html
and some related python code (dealing with images, not text, but may be adaptable):
- http://sprunge.us/WcVJ?py (53 LOC)
As I understand this short presentation about Perceptual Hashing of Textual Content, there are numerous approaches (in different dimensions such as the level of the text, linguistic or statistical approach, the model chosen to represent the text, ...), and the right one will depend on your domain and the problems you try to solve.
Also you might look into Locality-sensitive hashing, which
is a method of performing probabilistic dimension reduction of high-dimensional data. The basic idea is to hash the input items so that similar items are mapped to the same buckets with high probability (the number of buckets being much smaller than the universe of possible input items)
精彩评论