How to efficiently generate random subsets of rows from a matrix

2022-12-12 02:51 问答作者：

I have a large matrix M implemented as vector<vector<double> with m rows, i.e. the matrix is a vector of m vectors of n column elements.

I have to cre开发者_JAVA百科ate two subsets of the rows of this matrix, i.e. A holds k rows, and B the other m-k rows. The rows must be selected at random.

I do not want to use any libraries other than STL, so no boost either.

Two approaches that I considered are:

generate a std::random_shuffle of row indices, copy the rows indicated by the first k indices to A and the rows indicated by the other m-k to B
do a std::random_shuffle of M. copy k rows to A, and m-k rows to B

Are there other options, and how do the two options above compare in terms of memory consumption and processing time?

Thanks!

If you don't need B to be in random order, then random_shuffle does more work than you need.

If by "STL" you mean SGI's STL, then use random_sample.

If by "STL" you mean the C++ standard libraries, then you don't have random_sample. You might want to copy the implementation, except stop after the first n steps. This will reduce the time.

Note that these both modify a sequence in place. Depending where you actually want A and B to end up, and who owns the original, this might mean that you end up doing 2 copies of each row - once to get it into a mutable container for the shuffle, then again to get it into its final destination. This is more memory and processing time than is required. To fix this you could maybe swap rows out of the temporary container, and into A and B. Or copy the algorithm, but adapt it to:

Make a list of the indexes of the first vector
Partially shuffle the list of indexes
Copy the rows corresponding to the first n indexes to A, and the rest to B.

I'm not certain this is faster or uses less memory, but I suspect so.

The standard for random_shuffle says that it performs "swaps". I hope that means it's efficient for vectors, but you might want to check that it is actually using an optimised swap, not doing any copying. I think it should mean that, especially since the natural implementation is as Fisher-Yates, but I'm not sure whether the language in the standard should be taken to guarantee it. If it is copying, then your second approach is going to be very slow. If it's using swap then they're roughly comparable. swap on a vector is going to be slightly slower than swap on an index, but there's not a whole lot in it. Swapping either a vector or an index is very quick compared with copying a row, and there are M of each operation, so I doubt it will make a huge difference to total run time.

[Edit: Alex Martelli was complaining recently about misuse of the term "STL" to mean the C++ standard libraries. In this case it does make a difference :-)]

I think that the random_shuffle of indices makes sense.

If you need to avoid the overhead of copying the individual rows, and don't mind sharing data, you might be able to make the A and B matrices be vectors of pointers to rows in the original matrix.

Easiest way: use a random whole number generator, and queue up the offsets of each row in a separate container (assuming that a row is of the same offset in each column vector). The container you use will depend more on its eventual use. (Remember to take care of size_t limit, and tying in the offset container's life to the Matrix itself).

Edit: replaced pointers with offsets - makes more sense and is safer.

Orig: Quick Q:is each (inner) vector a row or a column?

i.e. is M a vector of columns or a vector of rows?

继续阅读：stl

How to efficiently generate random subsets of rows from a matrix

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？