How to efficiently generate random subsets of rows from a matrix
I have a large matrix M implemented as vector<vector<double>
with m rows, i.e. the matrix is a vector of m vectors of n column elements.
I have to cre开发者_JAVA百科ate two subsets of the rows of this matrix, i.e. A holds k rows, and B the other m-k rows. The rows must be selected at random.
I do not want to use any libraries other than STL, so no boost either.
Two approaches that I considered are:
- generate a std::random_shuffle of row indices, copy the rows indicated by the first k indices to A and the rows indicated by the other m-k to B
- do a std::random_shuffle of M. copy k rows to A, and m-k rows to B
Are there other options, and how do the two options above compare in terms of memory consumption and processing time?
Thanks!
If you don't need B to be in random order, then random_shuffle does more work than you need.
If by "STL" you mean SGI's STL, then use random_sample.
If by "STL" you mean the C++ standard libraries, then you don't have random_sample. You might want to copy the implementation, except stop after the first n
steps. This will reduce the time.
Note that these both modify a sequence in place. Depending where you actually want A and B to end up, and who owns the original, this might mean that you end up doing 2 copies of each row - once to get it into a mutable container for the shuffle, then again to get it into its final destination. This is more memory and processing time than is required. To fix this you could maybe swap
rows out of the temporary container, and into A and B. Or copy the algorithm, but adapt it to:
- Make a list of the indexes of the first vector
- Partially shuffle the list of indexes
- Copy the rows corresponding to the first n indexes to A, and the rest to B.
I'm not certain this is faster or uses less memory, but I suspect so.
The standard for random_shuffle
says that it performs "swaps". I hope that means it's efficient for vectors, but you might want to check that it is actually using an optimised swap
, not doing any copying. I think it should mean that, especially since the natural implementation is as Fisher-Yates, but I'm not sure whether the language in the standard should be taken to guarantee it. If it is copying, then your second approach is going to be very slow. If it's using swap
then they're roughly comparable. swap
on a vector is going to be slightly slower than swap
on an index, but there's not a whole lot in it. Swapping either a vector or an index is very quick compared with copying a row, and there are M of each operation, so I doubt it will make a huge difference to total run time.
[Edit: Alex Martelli was complaining recently about misuse of the term "STL" to mean the C++ standard libraries. In this case it does make a difference :-)]
I think that the random_shuffle of indices makes sense.
If you need to avoid the overhead of copying the individual rows, and don't mind sharing data, you might be able to make the A and B matrices be vectors of pointers to rows in the original matrix.
Easiest way: use a random whole number generator, and queue up the offsets of each row in a separate container (assuming that a row is of the same offset in each column vector). The container you use will depend more on its eventual use. (Remember to take care of size_t limit, and tying in the offset container's life to the Matrix itself).
Edit: replaced pointers with offsets - makes more sense and is safer.
Orig: Quick Q:is each (inner) vector a row or a column?
i.e. is M a vector of columns or a vector of rows?
精彩评论