开发者

R data frame and which.min() equivalent in c++

开发者_Python百科

I'm translating R code to c++ and I'd like to find an equivalent (optimal) structure which would allow the same kind of operations than a data frame, but in c++.

The operations are :

  • add elements (rows)
  • remove elements (rows) from index
  • get the index of the lowest value

e.g. :

a <- data.frame(i = c(4, 9, 3, 1, 8, 2, 7, 10, 6, 6), 
                j = c(8, 8, 8, 4, 3, 9, 1, 4,  8, 9) , 
                v = c(1.9, 18, 1.3, 17, 1.5, 14, 11, 1.4, 18, 2.0), 
                o = c(3, 3, 3, 3, 1, 2, 1, 2, 3, 3))

a[which.min(a$v), c('i', 'j')] # find lowest v value and get i,j value
a <- a[-which.min(a$v)] # remove row from index
a <- cbind(a, data.frame(i = 3, j = 9, v = 2, o = 2)) # add a row

As I'm using Rcpp, Rcpp::DataFrame might be an option (I don't know how I would which.min it however), but I guess it's quite slow for the task as these operations need to be repeated a lot and I don't need to ship it back to R.

EDIT:

Target. Just to make it clear the goal here is to gain speed. It is the obvious reason why one would translate code from R to C++ (there might be others, that's why I clarify). However, maintenance and easy implementation comes second.

More precision on the operations. The algorithm is: add lots of data to the array (multiple lines), then extract the lowest value and delete it. Repeat. That's why I wouldn't go for a sorted vector, but instead always search the lowest data on demand as the array is updated (addition) frequently. I think it's faster, but maybe I'm wrong.


I think a vector of vectors should do what you want. You would need to implement the min-finding manually (two nested loops), which is the fastest you can do without adding overhead. You can speed up the min-finding by keeping track of the position of the smallest element in each row along with the row.


This question is a bit stale, but I thought I would offer some general observations pertaining to this kind of task.

If you are keeping the collection of rows in an ordered state, which might be an assumption of your which.min strategy, the most difficult operation to support efficiently is row insert, if this is a common operation. You'd be hard pressed not to use a list<> data structure, with the likely consequence that which.min turns into a linear operation, since lists aren't great for bisection search.

If you keep an unordered collection, you can deal with deletes by copying records off the end of the frame to the row vacated by the deletion, and subtracting 1 from your row count. Alternatively, you can just flag deletions with another vector of bool, until the delete count hits a threshold such as sqrt(N), at which time you perform a coalescence copy. You'll come out better than amortized O(N^2) for insert/delete, but which.min will be a linear search through the entire vector each time.

The normal thing to do when you need to identify the min/max element as a common operation is to employ a priority queue of some kind, sometimes duplicating the data for the column indexed. Off the top of my head, it would be tricky to synchronize a priority queue over one data column with rows of a data frame that are moving around as a result of delete operations in the non-list implementation.

If the rows are merely marked as deleted, the priority queue would stay in sync (though you would have to discard elements popped off the queue corresponding to subsequently deleted rows until you get a good one); after the coalescence copy, you would re-index the priority queue, which is pretty fast if you're not doing it too often. Usually if you had enough memory to grow the structure to a large size, you're not overly pressed to give the memory back again if the structure shrinks; it's not obvious you ever need to coalesce if your structure tends to persist at a size anywhere near the high water mark, but beware of the case where your priority queue has both expired and fresh references to the same storage row, because you wrote new data to a row previously deleted. For efficiency, sometimes you end up using an auxiliary list to keep track of rows marked deleted so you can find storage for inserted rows in less than linear time.

It can be hard to extract stale items from the bowels of a priority queue, since these tend to be designed only for removal at the top of the queue; often you have to leave the stale items in there and arrange to ignore them if they surface at some later time.

When you get into C++ with performance objectives, there are many ways to skin the cat, and you need to be far more precise about performance trade-offs than what the original R code expressed to obtain good execution time for all required operations.


A data.frame is really just a list of vectors. In C++, we really only have those lists of vectors which makes adding rows hard.

Idem for removing rows -- as Rcpp works on the original R representation you always need copy all remaining values.

As for the which.min() equivalent: I think that come up once on the list and you can do something simple with STL idioms. I don't recall us having that in the API.


An R data frame in C++ terms is a container of objects (An R matrix might be a vector of vectors, but if you care about efficiency you are unlikely to implement it that way.)

So, represent the data frame with this class:

class A{
public:
  int i,j,o;
  double v;
public:
  A(int i_,int j_,int v_,int o_):i(i_),j(j_),v(v_),o(o_){}
}

And prepare this algorithm parameter function to help find the minimum:

bool comp(A &x,A &y){
return x.v<y.v;
}

(In production code I'd more likely use a functor object (see Meyer's Effective STL, item 46), or boost lambda or, best of all, C++0x's lambdas)

And then this code body:

std::vector<A> a;
a.push_back(A(4,8,1.9,3));
a.push_back(A(9,8,18,3));
a.push_back(A(1,4,1.3,3));
//...

std::vector<A>::iterator lowest=std::min_element(a.begin(),a.end(),comp);
std::cout<< lowest->i << ',' << lowest->j <<"\n";

a.erase(lowest);

a.push_back( A(3,9,2,0) );

Depending on what you are really doing, it may be more efficient to sort a first, greatest first. Then if you wish to erase the lowest item(s) you simply truncate the vector. If you are actually deleting all over the place, and which.min() was just for the sake of example, you may find a linked list more efficient.


No, A data.frame is a bit more complex than a vector of vector.

I might say the simpliest design for speed in all case is to store each columns in a typed vector and create a list as header for rows. Then on top of it create an intrusive list.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜