开发者

How to check (non-trivial) equivalence of lists of numbers, fast?

I have a list of integers, for example 1,2,2,3,4,1. I need to be able to check for equivalence (==) between different lists.

However, I do not mean a simple number wise comparison. Each of these lists actually denotes a set partition, where the position in the list denotes the index of an element and the number denotes an index of the group. For example in the former, element 0 and element 5 are in the same group, element 1 and 2 are in the same group and element 3 and 4 are both in their own individual groups. The actual index of the group is not important, only the grouping.

I need to be able to test equivalence in this sense, so for example the previous list would be equivalent to 5,3,3,2,9,5, since they have the same grouping.

The way I have been doing this is reducing the array to a kind of normal form. I find all numbers having the same value as the first number, and set these all to 0. I then continue in the list until I find a new number, find all numbers of the same value is this and set them all to 1. I continue in this manner.

In my example, both numbers would reduce to would reduce down to 0,1,1,2,3,0 and of course I can then just use a simple comparison to see if they are equivalent.

However this is quite slow, as I have to make several linear passes over the list. So to cut to the chase, is there any more efficient manner of reducing these numbers to this normal form?

Howver, more开发者_StackOverflow社区 generally, can I avoid this reduction all together and compare arrays in a different and perhaps more efficient manner?

Implementation details

  • These arrays are actually implemented as bitsets to save space, so I really do have to iterate over the whole list every time as there is no rb_tree esque hashing going on.
  • Large numbers of these arrays will be stored in an stl unordered_set, hence the requirement for a hash should be taken into consideration


Try iterating through the two sequences in parallel, keeping a map (either std::map or an array) from values in the first array to values in the second and vice versa. If you get to a pair that is not in your table, add it, unless there is something in the table for either that first or second number (since that would indicate inequality). For example:

1,2,2,3,4,1
5,3,3,2,9,5

You would add 1->5, 2->3, 3->2, and 4->9 and the comparison would pass. For something slightly different:

5,3,3,2,9,5
1,2,2,3,2,1

you would add 5->1, 3->2, 2->3, then 9->2 would fail since there is already a binding for 2 in the second sequence; thus, you would know that the sequences were not equivalent.

For creating a hash function, you would probably need to do the normalization that you are doing, but it should require only one pass through the sequence. Again, keep maps in both directions, but if you find an unknown element in the input sequence, map it to the next available number, and otherwise use the map to transform the input sequence into a normalized one.


For an alphabet of K symbols and an array of N of these symbols, you should be able to produce the signature (or canonical representation) of the array in O(N), using a hash table, or in O(N log K) using a binary search tree.

The trick is to perform the conversion of all digits in one pass:

std::unordered_map<std::size_t,std::size_t> map;

std::vector<std::size_t> signature;
signature.reserve(array.size());

for (std::size_t i: array) {
  // insert only inserts if they key is not already present
  // it returns std::pair<iterator,bool> with iterator pointing
  // to the pair {key: i, value: index}
  size_t index = map.insert({i, map.size()}).first->second;
  signature.push_back(index);
}

The hash of the array is then the hash of its signature.

But more fundamentally, there is no reason not to put all arrays in their canonical representation once and for all.


You could override the hashing algorithm and create a hash key that uniquely encodes the grouping. That way when each array is inserted into the hash table, all arrays that have the same group encoding will be chained into the same hash location. Once all arrays are inserted, the arrays will already be grouped.

A possible encoding could be: [1 2 2 3 4 1] would hash to 162345. (Um... sorry that encoding is non-unique).

We need a unique encoding, so we need to record both the position and count of the grouping in the array. So how about

[1 2 2 3 4 1] -> 1622324151 (from left to right, group indicies followed by set cardinality)

[ 5 5 5 9 9 9] -> 12334563

[ 1 2 3 4 5 6] -> 112131415161

I'm sure there are more clever encodings, but this would be a very fast hash.

Paul


If you know the maximum possible "group", then you can do something like this (psuedocode, but you should be able to figure it out:)

for i = 0; i < listLength; i++
    if !mapping[list1[i]]
        mapping[list1[i]] = list2[i]
    if mapping[list1[i]] != list2[i]
        return false;
return true;
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜