Numpy table - advanced multiple criteria selection
I have a table that goes something like this:
IDs Timestamp Values
124 300.6 1.23
124 350.1 -2.4
309 300.6 10.3
12 123.4 9.00
18 350.1 2.11
309 350.1 8.3
...
and I'd like to select all the rows that belong to a group of IDs. I know that I can do something like
table[table.IDs == 124]
to select all of one ID's row, and I could do
table[(table.IDs == 124) | (table.IDs == 309)]
to get two IDs' rows. But imagine I have ~100,000 rows with over 1,000 unique IDs (which are distinct from row indices), and I want to select all the rows that match a set of 10 IDs. Intuitively I'd like to do this:
# id_list: a list of 10 IDs
table[ table.IDs in id_list ]
but Python rejects this syntax. The only way I can think of is to do the following:
table[ (table.IDs == id_list[0]) |
(table.ID开发者_高级运维s == id_list[1]) |
(table.IDs == id_list[2]) |
(table.IDs == id_list[3]) |
(table.IDs == id_list[4]) |
(table.IDs == id_list[5]) |
(table.IDs == id_list[6]) |
(table.IDs == id_list[7]) |
(table.IDs == id_list[8]) |
(table.IDs == id_list[9]) ]
which seems very inelegant to me - too much code and no flexibility for different lengths of lists. Is there a way around my problem, such as using list comprehensions, or the .any() function? Any help is appreciated.
You can do it like this:
subset = table[np.array([i in id_list for i in table.IDs])]
If you have a more recent version of numpy, you can use the in1d
function to make it a bit more compact:
subset = table[np.in1d(table.IDs, id_list)]
See also this question: numpy recarray indexing based on intersection with external array
Here's a solution that will probably profile faster than any python for
loop. However, I don't think it will do better than in1d
. Use it only if you can afford a temporary 2D integer array of ids.size
by table.IDs.size
. Here, ids
is the numpy array of id_list
.
result = table[~np.all(table.IDs[None]-ids[None].T, 0)]
精彩评论