Is there lexographical version of searchsorted in numpy?

2023-03-24 16:11 问答作者：

I have two arrays which are lex-sorted.

In [2]: a = np.array([1,1,1,2,2,3,5,6,6])
In [3]: b = np.array([10,20,30,5,10,100,10,30,40])
In [4]: ind = np.lexsort((b, a)) # sorts elements first by a and then by b
In [5]: print a[ind]
[1 1 1 2 2 3 5开发者_JAVA技巧 6 6]
In [7]: print b[ind]
[ 10  20  30   5  10 100  10  30  40]

I want to do a binary search for (2, 7) and (5, 150) expecting (4, 7) as the answer.

In [6]: np.lexsearchsorted((a,b), ([2, 5], [7,150]))

We have searchsorted function but that works only on 1D arrays.

EDIT: Edited to reflect comment.

def comp_leq(t1,t2):
    if (t1[0] > t2[0]) or ((t1[0] == t2[0]) and (t1[1] > t2[1])):
        return 0
    else:
        return 1

def bin_search(L,item):
    from math import floor
    x = L[:]
    while len(x) > 1:
        index = int(floor(len(x)/2) - 1)
        #Check item
        if comp_leq(x[index], item):
            x = x[index+1:]
        else:
            x = x[:index+1]
    out = L.index(x[0])
    #If greater than all
    if item >= L[-1]:
        return len(L)
    else:
        return out

def lexsearch(a,b,items):
    z = zip(a,b)
    return [bin_search(z,item) for item in items]

if __name__ == '__main__':
    a = [1,1,1,2,2,3,5,6,6]
    b = [10,20,30,5,10,100,10,30,40]
    print lexsearch(a,b,([2,7],[5,150])) #prints [4,7]

This code seems to do it for a set of (exactly) 2 lexsorted arrays

You might be able to make it faster if you create a set of values[-1], and than create a dictionary with the boundries for them.

I haven't checked other cases apart from the posted one, so please verify it's not bugged.

def lexsearchsorted_2(arrays, values, side='left'):
    assert len(arrays) == 2
    assert (np.lexsort(arrays) == range(len(arrays[0]))).all()
    # here it will be faster to work on all equal values in 'values[-1]' in one time
    boundries_l = np.searchsorted(arrays[-1], values[-1], side='left')
    boundries_r = np.searchsorted(arrays[-1], values[-1], side='right')
    # a recursive definition here will make it work for more than 2 lexsorted arrays
    return tuple([boundries_l[i] + 
                  np.searchsorted(arrays[-2[boundries_l[i]:boundries_r[i]], 
                                  values[-2][i], 
                                  side=side) 
                  for i in range(len(boundries_l))])

Usage:

import numpy as np
a = np.array([1,1,1,2,2,3,5,6,6])
b = np.array([10,20,30,5,10,100,10,30,40])
lexsearchsorted_2((b, a), ([7,150], [2, 5]))  # return (4, 7)

I ran into the same issue and came up with a different solution. You can treat the multi-column data instead as single entries using a structured data type. A structured data type will allow one to use argsort/sort on the data (instead of lexsort, although lexsort appears faster at this stage) and then use the standard searchsorted. Here is an example:

import numpy as np
from itertools import repeat

# Setup our input data
# Every row is an entry, every column what we want to sort by
# Unlike lexsort, this takes columns in decreasing priority, not increasing
a = np.array([1,1,1,2,2,3,5,6,6])
b = np.array([10,20,30,5,10,100,10,30,40])
data = np.transpose([a,b])

# Sort the data
data = data[np.lexsort(data.T[::-1])]

# Convert to a structured data-type
dt = np.dtype(zip(repeat(''), repeat(data.dtype, data.shape[1]))) # the structured dtype
data = np.ascontiguousarray(data).view(dt).squeeze(-1) # the dtype change leaves a trailing 1 dimension, ascontinguousarray is required for the dtype change

# You can also first convert to the structured data-type with the two lines above then use data.sort()/data.argsort()/np.sort(data)

# Search the data
values = np.array([(2,7),(5,150)], dtype=dt) # note: when using structured data types the rows must be a tuple
pos = np.searchsorted(data, values)

# pos is (4,7) in this example, exactly what you would want

This works for any number of columns, uses the built-in numpy functions, the columns remain in the "logical" order (decreasing priority), and it should be quite fast.

A compared the two two numpy-based methods time-wise.

#1 is the recursive method from @j0ker5 (the one below extends his example with his suggestion of recursion and works with any number of lexsorted rows)

#2 is the structured array from me

They both take the same inputs, basically like searchsorted except a and v are as per lexsort.

import numpy as np

def lexsearch1(a, v, side='left', sorter=None):
    def _recurse(a, v):
        if a.shape[1] == 0: return 0
        if a.shape[0] == 1: return a.squeeze(0).searchsorted(v.squeeze(0), side)
        bl = np.searchsorted(a[-1,:], v[-1], side='left')
        br = np.searchsorted(a[-1,:], v[-1], side='right')
        return bl + _recurse(a[:-1,bl:br], v[:-1])
    a,v = np.asarray(a), np.asarray(v)
    if v.ndim == 1: v = v[:,np.newaxis]
    assert a.ndim == 2 and v.ndim == 2 and a.shape[0] == v.shape[0] and a.shape[0] > 1
    if sorter is not None: a = a[:,sorter]
    bl = np.searchsorted(a[-1,:], v[-1,:], side='left')
    br = np.searchsorted(a[-1,:], v[-1,:], side='right')
    for i in xrange(len(bl)): bl[i] += _recurse(a[:-1,bl[i]:br[i]], v[:-1,i])
    return bl

def lexsearch2(a, v, side='left', sorter=None):
    from itertools import repeat
    a,v = np.asarray(a), np.asarray(v)
    if v.ndim == 1: v = v[:,np.newaxis]
    assert a.ndim == 2 and v.ndim == 2 and a.shape[0] == v.shape[0] and a.shape[0] > 1
    a_dt = np.dtype(zip(repeat(''), repeat(a.dtype, a.shape[0])))
    v_dt = np.dtype(zip(a_dt.names, repeat(v.dtype, a.shape[0])))
    a = np.asfortranarray(a[::-1,:]).view(a_dt).squeeze(0)
    v = np.asfortranarray(v[::-1,:]).view(v_dt).squeeze(0)
    return a.searchsorted(v, side, sorter).ravel()


a = np.random.randint(100, size=(2,10000)) # Values to sort, rows in increasing priority
v = np.random.randint(100, size=(2,10000)) # Values to search for, rows in increasing priority

sorted_idx = np.lexsort(a)
a_sorted = a[:,sorted_idx]

And the timing results (in iPython):

# 2 rows
%timeit lexsearch1(a_sorted, v)
10 loops, best of 3: 33.4 ms per loop

%timeit lexsearch2(a_sorted, v)
100 loops, best of 3: 14 ms per loop

# 10 rows
%timeit lexsearch1(a_sorted, v)
10 loops, best of 3: 103 ms per loop

%timeit lexsearch2(a_sorted, v)
100 loops, best of 3: 14.7 ms per loop

Overall the structured array approach is faster, and can be made even faster if you design it to work with the flipped and transposed versions of a and v. It gets even faster as the numbers of rows/keys goes up, barely slowing down when going from 2 rows to 10 rows.

I did not notice any significant timing difference between using a_sorted or a and sorter=sorted_idx so I left those out for clarity.

I believe that a really fast method could be made using Cython, but this is as fast as it is going to get with pure pure Python and numpy.

继续阅读：numpy python search

Is there lexographical version of searchsorted in numpy?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？