开发者

How to select rows from matrix with unique entry in specific column?

I tried to solve this using a functional way, but I am not having much success.

Suppose there is a list of lists, and it is required to pick only the lists in it which have unique entry in specific position.

For example, suppose there is a matrix, and we want to select only the rows which have unique elements in the first column.

Here is an example:

INPUT:

list= {{ 1,2}, {1,3},{4,5}}

I'd like the output to be

list={{1,2},{4,5}}

It does not matter which 'row' to remove, the first one is ok, but any one is fine.

I tried Select, DeleteCases, DeleteDuplicates, Union, and few other things, but can't get it to work. I do not know how to tell Mathematica to only look for 'unique' element. Union comes close but it works on complete list. i.e. I do not know what to write for criteria, as in

DeleteDuplicates[list, <now what?> ]

For reference, this is how I do the above in Matlab:

EDU>> A=[1 2;1 3;4 5开发者_JS百科]

A =
     1     2
     1     3
     4     5

EDU>> [B,I,J]=unique(A(:,1));
EDU>> A(I,:)

ans =
     1     3
     4     5

thanks


Here is one way:

DeleteDuplicates[list, First@#1 === First@#2 &]

EDIT

Note that the timings and discussion below are based on M7

Upon reflecting a bit, I found a solution which will be (at least) order of magnitude faster for large lists, and sometimes two orders of magnitude faster, for this particular case (probably, a better way to put it is that the solution below will have different computational complexity):

Clear[delDupBy];
delDupBy[nested_List, n_Integer] :=
  Module[{parts = nested[[All, n]], ord, unpos},
    ord = Ordering[parts];
    unpos = Most@Accumulate@Prepend[Map[Length, Split@parts[[ord]]], 1];
    nested[[Sort@ord[[unpos]]]]];

Benchmarks:

In[406]:= 
largeList = RandomInteger[{1,15},{50000,2}];

In[407]:= delDupBy[largeList,1]//Timing
Out[407]= {0.016,{{13,4},{12,1},{1,6},{6,13},{10,12},{7,15},{8,14},
            {14,4},{4,1},{11,9},{5,11},{15,4},{2,7},{3,2},{9,12}}}

In[408]:= DeleteDuplicates[largeList,First@#1===First@#2&]//Timing
Out[408]= {1.265,{{13,4},{12,1},{1,6},{6,13},{10,12},{7,15},{8,14},{14,4},
      {4,1},{11,9},{5,11},{15,4},{2,7},{3,2},{9,12}}}

This is particularly remarkable because DeleteDuplicates is a built-in function. I can make a blind guess that DeleteDuplicates with user-defined test is using a quadratic-time pairwise comparison algorithm, while delDupBy is n*log n in the size of the list.

I think this is an important lesson: one should pay attention to built-in functions such as Union, Sort, DeleteDuplicates etc when using custom tests. I discussed it more extensively in this Mathgroup thread, where there are also other insightful replies.

Finally, let me mention that exactly this question has been asked (with the emphasis on efficiency) before here. I will reproduce here a solution I gave for the case when the first (or, generally, n-th) elements are positive integers (generalizing to arbitrary integers is straightforward).:

Clear[sparseArrayElements];
sparseArrayElements[HoldPattern[SparseArray[u___]]] := {u}[[4, 3]]

Clear[deleteDuplicatesBy];
Options[deleteDuplicatesBy] = {Ordered -> True, Threshold -> 1000000};
deleteDuplicatesBy[data_List, n_Integer, opts___?OptionQ] := 
  Module[{fdata = data[[All, n]], parr, 
  rlen = Range[Length[data], 1, -1], 
  preserveOrder =  Ordered /. Flatten[{opts}] /. Options[deleteDuplicatesBy], 
  threshold =  Threshold /. Flatten[{opts}] /. Options[deleteDuplicatesBy], dim},
  dim = Max[fdata];
  parr = If[dim < threshold, Table[0, {dim}], SparseArray[{}, dim, 0]];
  parr[[fdata[[rlen]]]] = rlen;
  parr = sparseArrayElements@If[dim < threshold, SparseArray@parr, parr];
  data[[If[preserveOrder, Sort@parr, parr]]]
];

The way this works is to use the first (or, generally, n-th) elements as positions in some huge table we preallocate, exploiting that they are positive integers). This one can give us crazy performance in some cases. Observe:

In[423]:= hugeList = RandomInteger[{1,1000},{500000,2}];

In[424]:= delDupBy[hugeList,1]//Short//Timing
Out[424]= {0.219,{{153,549},{887,328},{731,825},<<994>>,{986,150},{92,581},{988,147}}}

In[430]:= deleteDuplicatesBy[hugeList,1]//Short//Timing
Out[430]= {0.032,{{153,549},{887,328},{731,825},<<994>>,{986,150},{92,581},{988,147}}}


Leonid provided a long and thorough answer, as he often does. However, I believe it is worth pointing out that an efficient and concise solution may be had with:

First /@ GatherBy[hugeList, #[[1]] &]

Where 1 is the column index to compare.

On my system this is faster than delDupBy but not as fast as deleteDuplicatesBy.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜