Image-hashing algorithm to produce natural primary-key values that work well in PostgreSQL table indices?
I'm building out a set of cooperative data stores with images, and I'm starting to implement some simple/trivial content-based search and sort algorithms: SIFT, sparse color-histogram distance, basic SVD, etc.
I am currently using sha1 hashes of binary data as indices in PostgreSQL tables. These hashes are 'dumb' -- they're calculated by feeding the data in question* straight to Python's hashlib.sha1
module, and stored in nullable char columns that are exactly as lo开发者_如何学编程ng as the sha1's base64 representation.
It would be quite a panacea to implement a hash algorithm that would yield hashes suitable for indexing Postgres tables, but that also described the image in some way, à la phash or hamming distance. While phash looks like a good candidate, it turns out to require the use of a proprietary storage engine and API... I'm looking for something less 'turn-key' that will play nice with my existing Python/Postgresql/Solr/Redis-based ecosystem.
It doesn't have to be the fastest -- it's more important for me to implement an algorithm (or algorithms) that can be hacked up a bit and stay somewhat cogent.
( * ) mostly this consists of untransformed or lightly transformed harvests from my images -- things like: JPEG/PNG/DNG image file content, ICC profile data structures, JSON dumps of EXIF/IPTC tagsets, and the like.
Quite interesting approach is described in http://railsware.com/blog/2012/05/10/effective-similarity-search-in-postgresql/.
Basically image is scaled to 15x15 px, then intensity is calculated for each pixel (0.299 * red + 0,587 * green + 0,114 * blue). This array of 255 values is stored in PostgreSQL table column with Gin/Gist index for fast search of similar images.
What about a space-filling-curve, for example a hilbert curve or moore curve?
精彩评论