开发者

Longest string in numpy object_ array

I'm using a numpy object_ array to store variable 开发者_JAVA技巧length strings, e.g.

a = np.array(['hello','world','!'],dtype=np.object_)

Is there an easy way to find the length of the longest string in the array without looping over all elements?


max(a, key=len) gives you the longest string (and len(max(a, key=len)) gives you its length) without requiring you to code an explicit loop, but of course max will do its own looping internally, as it couldn't possibly identify "the longest string" in any other way.


If you store the string in a numpy array of dtype object, then you can't get at the size of the objects (strings) without looping. However, if you let np.array decide the dtype, then you can find out the length of the longest string by peeking at the dtype:

In [64]: a = np.array(['hello','world','!','Oooh gaaah booo gaah?'])

In [65]: a.dtype
Out[65]: dtype('|S21')

In [72]: a.dtype.itemsize
Out[72]: 21


No as the only place the length of each string is known is by the string. So you have to find out from every string what its length is.


Say I want to get the longest string in the second column:

data_array = [['BFNN' 'Forested bog without permafrost or patterning, no internal lawns']
             ['BONS' 'Nonpatterned, open, shrub-dominated bog']]


def get_max_len_column_value(data_array, column):
    return len(max(data_array[:,[column]], key=len)[0])

get_max_len_column_value(data_array, 1)

>>>64


I would also like to mention a C-like method:

int(string_array.dtype.itemsize/np.dtype(string_array.dtype.char+'1').itemsize)

It seems to be more efficient than the accepted answer:

codes_len = 10000
codes_size = 10000
string_array = np.random.choice(np.array([b'a', b'b']), [codes_size, codes_len])
string_array = np.array([b"".join(string_array[i]).decode('utf-8') for i in range(codes_size)])

%time res = int(string_array.dtype.itemsize/np.dtype(string_array.dtype.char+'1').itemsize)
print('result is:', str(res) + '\n')
>>> CPU times: user 21 µs, sys: 4 µs, total: 25 µs
>>> Wall time: 29.1 µs
>>> result is: 10000

%time res = len(max(string_array, key=len))
print('result is:', res)
>>> CPU times: user 333 ms, sys: 8.21 ms, total: 342 ms
>>> Wall time: 341 ms
>>> result is: 10000
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜