Finding data gaps with bit masking

2023-01-29 00:57 问答作者：

I'm faced with a problem of finding discontinuities (gaps) of a given length in a sequence of numbers. So, for example, given [1,2,3,7,8,9,10] and a gap of length=3, I'll find [4,5,6]. If the gap is length=4, I'll find nothing. The real sequence is, of course, much longer. I've seen this problem in quite a few posts, and it had various applications and possible implementations.

One way I thought might work and should be relatively quick is to represent the complete set as a bit array containing 1 for available number and 0 for missing - so the above will look like [1,1,1,0,0,0,1,1,1,1]. Then possibly run a window function that'll XOR mask an array of the given length with the complete set until all locations result in 1. This will require a single pass over the whole sequence in roughly ~O(n), plus the cost of masking in each run.

Here's what I managed to come up with:

def find_gap(array, start=0, length=10):
    """
    array:  assumed to be of length MAX_NUMBER and contain 0 or 1 
            if the value is actually present
    start:  indicates what value to start looking from
    length: what the length the gap should be
    """

    # create the bitmask to check against
    mask = ''.join( [1] * length )

    # convert the input 0/1 mapping to bit开发者_运维问答 string
    # e.g - [1,0,1,0] -> '1010'
    bits =''.join( [ str(val) for val in array ] )

    for i in xrange(start, len(bits) - length):

        # find where the next gap begins
        if bits[i] != '0': continue

        # gap was found, extract segment of size 'length', compare w/ mask
        if (i + length < len(bits)):
            segment = bits[i:i+length]

            # use XOR between binary masks
            result  = bin( int(mask, 2) ^ int(segment, 2) )

            # if mask == result in base 2, gap found
            if result == ("0b%s" % mask): return i

    # if we got here, no gap exists
    return -1

This is fairly quick for ~100k (< 1 sec). I'd appreciate tips on how to make this faster / more efficient for larger sets. thanks!

Find the differences between adjacent numbers, and then look for a difference that's large enough. We find the differences by constructing two lists - all the numbers but the first, and all the numbers but the last - and subtracting them pairwise. We can use zip to pair the values up.

def find_gaps(numbers, gap_size):
    adjacent_differences = [(y - x) for (x, y) in zip(numbers[:-1], numbers[1:])]
    # If adjacent_differences[i] > gap_size, there is a gap of that size between
    # numbers[i] and numbers[i+1]. We return all such indexes in a list - so if
    # the result is [] (empty list), there are no gaps.
    return [i for (i, x) in enumerate(adjacent_differences) if x > gap_size]

(Also, please learn some Python idioms. We prefer direct iteration, and we have a real boolean type.)

You could use XOR and shift and it does run in roughly O(n) time.

However, in practice, building an index (hash list of all gaps greater then some minimum length) might be a better approach.

Assuming that you start with a sequence of these integers (rather than a bitmask) then you build an index by simply walking over the sequence; any time you find a gap greater than your threshold you add that gap size to your dictionary (instantiate it as an empty list if necessary, and then append the offset in the sequence.

At the end you have a list of every gap (greater than your desired threshold) in your sequence.

One nice thing about this approach is that you should be able to maintain this index as you modify the base list. So the O(n*log(n)) initial time spent building the index is amortized by O(log(n)) cost for subsequent queries and updates to the indexes.

Here's a very crude function to build the gap_index():

def gap_idx(s, thresh=2):
    ret = dict()
    lw = s[0]  # initial low val.
    for z,i in enumerate(s[1:]):
        if i - lw < thresh:
            lw = i
            continue
        key = i - lw
        if key not in ret:
            ret[key] = list()
        ret[key].append(z)
        lw = i
    return ret

A class to maintain both a data set and the index might best be built around the built-in 'bisect' module and its insort() function.

Pretty much what aix did ... but getting only the gaps of the desired length:

def findGaps(mylist, gap_length, start_idx=0):
    gap_starts = []
    for idx in range(start_idx, len(mylist) - 1):
        if mylist[idx+1] - mylist[idx] == gap_length + 1:
            gap_starts.append(mylist[idx] + 1)

    return gap_starts

EDIT: Adjusted to the OP's wishes.

These provide a single walk of your input list.

List of gap values for given length:

from itertools import tee, izip
def gapsofsize(iterable, length):
    a, b = tee(iterable)
    next(b, None)
    return ( p for x, y in izip(a, b) if y-x == length+1 for p in xrange(x+1,y) )

print list(gapsofsize([1,2,5,8,9], 2))

[3, 4, 6, 7]

All gap values:

def gaps(iterable):
    a, b = tee(iterable)
    next(b, None)
    return ( p for x, y in izip(a, b) if y-x > 1 for p in xrange(x+1,y) )

print list(gaps([1,2,4,5,8,9,14]))

[3, 6, 7, 10, 11, 12, 13]

List of gaps as vectors:

def gapsizes(iterable):
    a, b = tee(iterable)
    next(b, None)
    return ( (x+1, y-x-1) for x, y in izip(a, b) if y-x > 1 )

print list(gapsizes([1,2,4,5,8,9,14]))

[(3, 1), (6, 2), (10, 4)]

Note that these are generators and consume very little memory. I would love to know how these perform on your test dataset.

If it's efficiency you're after, I'd do something along the following lines (where x is the list of sequence numbers):

for i in range(1, len(x)):
  if x[i] - x[i - 1] == length + 1:
    print list(range(x[i - 1] + 1, x[i]))

继续阅读：arrays bitmask performance python

Finding data gaps with bit masking

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？