Intentional Hashing Collisions

2022-12-11 03:05 问答作者：

I'm trying to write some code that will do "fuzzy hashing". That is: I want several inputs to hash to the same output so that I can do searches etc quickly and easily. If A hashes to 1 and C hashes to 1, it will be trivial for me to find out that A is equivalent to C.

Designing such a hash function seems hard, so I was wondering if anyone had experience with CMPH or GPERF and could walk me through creating a function that would result in this hash function.

Thanks in advance! Stefan

@Ben

In this case, matrixes of booleans, but I can easily pack them into 64 bit integers. Rotations, translations, etc in the inp开发者_StackOverflow社区ut are irrelevant and need to be weeded out. Thus:

000
111
000

Is equivalent to

111
000
000

and

001
001
001

(simplification)

@Kinopiko

My best bet thus far would be to determine some sort of "canonical" representation and design code that terminates when the transformations reach such a representation (say...packing all the bits at the bottom). Yet this is slow and I'm looking for a better way. My data set is large.

@Jason

These two would not hash to the same value.

Let's take SOUNDEX, as an illustration of what you may be looking for...

it does hash distinct but similar keys with the same value
it would allow asserting that entity A (say, "McDonald") is possibly the same as entity C ("MacDonnel")

One characteristic of SOUNDEX is that it works well in a specific domain (that of names, particular family names) and that it leverages rules associated with the pronunciation of words [in the English language, and by extension many languages of the same origin].

Your hashing (or is it almost a form of indexing?) algorithm will only be successful if a [relatively simple] set of rules exist (or can be discovered) to express the "sameness" of the items/records considered. For example if the underlying database pertains to automobiles, and if the criteria for "sameness" is that of size and general performance, maybe the hashing algorithm should be based on attributes from the record such as price (converted to a range), number of cylinders, number of doors, and maybe estimated gas mileage (too converted to a range).

In a nutshell, I hope the above illustrates the need to tailor the algorithm to the semantics we wish to associate with hash value identity (or proximity... hence looking more and more like an index...), as well as these semantics are represented within the available data from the items.

Items may be similar along many very dimensions. It is a matter of defining what these dimensions are are how the attributes "on" this dimension can be used as keys to a hashing function.

On CMPH and gperf...
These are implementations of perfect, optionally mimimal, hash functions. This kind of function would allow you to prevent collisions. Not what is needed here (I think)

You can check out MinHash, which is a probablistic method for cases where items are defined by set membership.

I know you want to design your own hash function, but maybe this is what you are looking for.

The chd algorithm implemented by cmph can do k-perfect hashing, which I think it is exactly what you are looking for:

http://cmph.sourceforge.net/chd.html

However, it would be better to know what you mean by "large input". If you are talking about hundreds of thousands entries, there are more straightforward solutions. If you have hundreds of millions then chd is probably your best bet.

In your example matrices, rotations and translations, etc seem to me to cover pretty much everything such that your algorithm would say that all matrices are equal. Can you give an example of sets of matrices that are different?

Also, it looks like you're doing something similar to image searching - algorithms for identifying images that are the same but possibly scaled, rotated (possibly arbitrarily), flipped, etc. I don't know much about any of this stuff, but maybe you could look to that research area for inspiration.

Also, I seem to remember something from Vector Calc about Eigenvalues which you might find useful.

How collision-free does the hash have to be (in terms of false-positives)?

What about:

sorting the numbers, then running a standard hash on that sequence?
Or, treat each rows/columns as a set and hash the contents of each using addition. Then sort those results and run a standard hash across the sorted sums?
Or, the product of 1 and 2.

For example, here it is in Java (quick & dirty, but you get the idea):

import java.util.*;

/**
 *
 * @author Mark Bolusmjak
 */
public class MatrixTest {


  LinkedList<LinkedList<Integer>> randomMatrix(int size){
    LinkedList<LinkedList<Integer>> rows = new LinkedList<LinkedList<Integer>>();
    for (int i=0; i<size; i++){
      LinkedList<Integer> newRow = new LinkedList<Integer>();
      for (int j=0; j<size; j++){
        newRow.add((int)(5*Math.random()));
      }
      rows.add(newRow);
    }
    return rows;
  }

  LinkedList<LinkedList<Integer>> trans(LinkedList<LinkedList<Integer>> m){
    if (Math.random()<0.5){ //column translation
      for (LinkedList<Integer> integers : m) {
        integers.addFirst(integers.removeLast());
      }
    } else { //row translation
      m.addFirst(m.removeLast());
    }
    return m;
  }

  LinkedList<LinkedList<Integer>> flipDiagonal(LinkedList<LinkedList<Integer>> m){
    LinkedList<LinkedList<Integer>> flipped = new LinkedList<LinkedList<Integer>>();
    for (int i=0; i<m.size(); i++){
      flipped.add(new LinkedList<Integer>());
    }

    for (LinkedList<Integer> mRows : m) {
      Iterator<Integer> listIterator = mRows.iterator();
      for (LinkedList<Integer> flippedRows : flipped) {
        flippedRows.add(listIterator.next());
      }
    }
    return flipped;
  }


  public static void main(String[] args) {
    MatrixTest mt = new MatrixTest();
    LinkedList<LinkedList<Integer>> m = mt.randomMatrix(4);
    mt.display(m);

    System.out.println(mt.hash1(m));
    System.out.println(mt.hash2(m));

    m = mt.trans(m);
    mt.display(m);
    System.out.println(mt.hash1(m));
    System.out.println(mt.hash2(m));

    m = mt.flipDiagonal(m);
    mt.display(m);
    System.out.println(mt.hash1(m));
    System.out.println(mt.hash2(m));

    m = mt.trans(m);
    mt.display(m);
    System.out.println(mt.hash1(m));
    System.out.println(mt.hash2(m));

    m = mt.flipDiagonal(m);
    mt.display(m);
    System.out.println(mt.hash1(m));
    System.out.println(mt.hash2(m));

  }


  private void display(LinkedList<LinkedList<Integer>> m){
    for (LinkedList<Integer> integers : m) {
      System.out.println(integers);
    }
    System.out.println("");
  }

  int hash1(LinkedList<LinkedList<Integer>> m){
    ArrayList<Integer> sorted = new ArrayList<Integer>();

    for (LinkedList<Integer> integers : m) {
      for (Integer integer : integers) {
        sorted.add(integer);
      }
    }
    Collections.sort(sorted);
    return sorted.hashCode();
  }

  int hash2(LinkedList<LinkedList<Integer>> m){
    List<Integer> rowColumnHashes = new ArrayList<Integer>();
    for (LinkedList<Integer> row : m) {
      int hash = 0;
      for (Integer integer : row) {
        hash += integer;
      }
      rowColumnHashes.add(hash);
    }

    m = flipDiagonal(m);
    for (LinkedList<Integer> row : m) {
      int hash = 0;
      for (Integer integer : row) {
        hash += integer;
      }
      rowColumnHashes.add(hash);
    }

    Collections.sort(rowColumnHashes);
    return rowColumnHashes.hashCode();
  }



} // end of class

继续阅读：hash math

Intentional Hashing Collisions

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？