Counting number of unique pairs and instances of non-unique pairs in unsorted data

2023-04-07 04:49 问答作者：

I have data in the form of:

ID   ATTR
3    10
1    20
1    20
4    30
...  ...

Where ID and Attr are unsorted and may contain duplicates. The range for the IDs are 1-20,000 or so, and ATTR are unsigned int. There may be anywhere between 100,000 and 500,000 pairs that I need to process at a single time.

I am looking for:

The number of unique pairs.
The number of times a non-unique pair pops up.

So in the above data, I'd want to know that (1,20) appeared twice and that there were 3 unique pairs.

I'm currently using a hash table in my naive approach. I keep a counter of uniq开发者_如何学运维ue pairs, and decrement the counter if the item I am inserting is already there. I also keep an array of IDs of the non-unique pairs. (All on first encounters)

Performance and size are about equal concerns. I'm actually OK with a relatively high (say 0.5%) rate of false positives given the performance and size concerns. (I've also implemented this using a spectral bloom)

I'm not that smart, so I'm sure there's a better solution out there, and I'd like to hear about your favorite hash table implementations/any other ideas. :)

A hash table with keys like <id>=<attr> is an excellent solution to this problem. If you can tolerate errors, you can get smaller/faster with a bloom, I guess. But do you really need to do that?

继续阅读：algorithm c

Counting number of unique pairs and instances of non-unique pairs in unsorted data

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？