What's the fastest way to generate a random sequence from a list of data?
Let's say that I have a list of data: {1, 2, 3, 4, 5, 6, 7, 8, 9, 10} where n = 10 elements
I'd like to randomly choose k elements of this set to form a sublist, say k = 5.
In t开发者_StackOverflow社区hat case, I could end up with a sublist that looks like {9, 3, 5, 2, 7}
I could accomplish this by:
- Randomly determining an offset within the list, between 0 and the current size of the list minus 1
- Appending that element to my sublist
- Erasing that element from the original list
- Repeat until the desired size is found
The problem with this is that as the original list grows the offset and deletion time grows as well, and for any significantly large list (say over 1,000,000 elements), it takes quite a long time to perform this algorithm.
Is there a faster way to generate a random sequence from a list of given data? The implementation of the random number generator should be set aside for this problem, instead, focusing on how the RNG result is used in a proposed algorithm.
Any thoughts?
Right now I'm using the C++ STL list
I would use random_shuffle
. You can change the generator by supplying a third parameter.
It requires random access iterators, so you can either switch to a std::vector
(which is generally far superior and preferred over std::list
, arguably the worse container), or just operate on some array. I'll demonstrate both:
int data[] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10};
std::random_shuffle(data, data + 10);
// or
std::vector data; // populate it
std::random_shuffle(data.begin(), data.end());
Now everything is in random order, just treat the fist k
elements as your subset:
// now treat data[0] through data[k] as your random subset, or:
std::vector subset(data, data + k);
// or
data.resize(k); // shrink vector
Note that in another question, Jerry shares an excellent way of doing what you want.
http://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle#The_modern_algorithm
Look under Examples > Modern method
You don't need to shuffle your entire list. O(k) (better than O(n))
A minimal example using OutputIterators and std::random_shuffle
. Notice that the algorithm will modify your original input, so it could be reasonable to make a copy before you call the function.
#include <iostream>
#include <algorithm>
#include <vector>
#include <iterator>
template<class It, class OutIt>
void take_random_n(It begin, It end, OutIt out, size_t n) {
std::random_shuffle(begin, end);
It end2 = begin;
std::advance(end2, n);
std::copy(begin, end2, out);
}
int main() {
std::vector<int> a;
int b[] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10};
take_random_n(b, b + 10, std::back_inserter(a), 4);
for(std::vector<int>::iterator it = a.begin(); it != a.end(); ++it)
std::cout << *it << " ";
}
Or you could accomplish this by:
- Randomly determining an offset within the list, between 0 and the current size of the list.
- Appending that element to your sublist.
- Repeat until the sublist is probably long enough to contain the right number of elements. For example, if you are choosing 10 out of 1,000,000 elements a sublist of 10 is probably long enough. You don't need to be hyper-accurate in your calculation of what number of extra elements you have to choose
- Now check that all elements in the sublist are different. If not, delete the duplicates. If your sublist is now too short choose some more from the main list. If not, you're done.
I'm not sure why you want to delete the chosen elements from the main list, but if that is essential you could do it after constructing the sublist.
And I haven't a clue about how the performance of this approach will rate against the performance of the of the suggested random_shuffle of a list of 10^6 elements.
Shuffle the list, then take the first (or last) k elements. If you use a O(n) algorithm like the Fisher-Yates shuffle, then the whole process is O(n).
You could shuffle it with std::random_shuffle and then just copy the first however many elements you want into a new list.
Shuffle your array using some algorithm Then you can peek random elements from the beginning of array.
Assign a random number to each entry in your list, then sort the list by random number. Pick off the first n entries you want.
Most answers propose to shuffle the initial container. If you don't want it to be modified, you can still use this approach, but you first need to copy the container. The solution of @pmr (which is nice because he makes it into a function) would then become:
template <typename InputIterator, typename Size, typename OutputIterator>
void take_random_n(InputIterator first, InputIterator last,
Size n, OutputIterator result)
{
typedef typename std::iterator_traits<InputIterator>::value_type value_type;
std::vector<value_type> shufflingVec(first, last);
std::random_shuffle(shufflingVec.begin(), shufflingVec.end());
std::copy(shufflingVec.begin(), shufflingVec.begin() + n, result);
}
However, copying the entire container can be quite expensive if the elements contained are heavy and take some time to copy. In this case, you can be better off shuffling a list of indexes:
template <typename InputIterator, typename Size, typename OutputIterator>
void take_random_n(InputIterator first, InputIterator last,
Size n, OutputIterator result)
{
typedef typename
std::iterator_traits<InputIterator>::value_type value_type;
typedef typename
std::iterator_traits<InputIterator>::difference_type difference_type;
difference_type size = std::distance(first, last);
std::vector<value_type> indexesVec(
boost::counting_iterator<size_t>(0),
boost::counting_iterator<size_t>(size));
// counting_iterator generates incrementing numbers. Easy to implement if you
// can't use Boost
std::random_shuffle(indexesVec.begin(), indexesVec.end());
for (Size i = 0 ; i < n ; ++i)
{
*result++ = *std::advance(first, indexesVec[i]);
}
}
// Disclaimer: I have not tested the code above!
You'll notice that the latter solution will perform very differently depending on the kind of iterators you use: with random access iterators (like pointers or vector<T>::iterator
), it will be ok, but with other types of iterators, the use of std::distance
and the numerous calls to std::advance
can induce quite an overhead.
My 2 cents (using stl only & needing at most forward iterators):
//-----------------------------------------------------------------------------
#include <cstdlib>
//-----------------------------------------------------------------------------
#include <iostream>
#include <list>
#include <iterator>
#include <algorithm>
//-----------------------------------------------------------------------------
// random generator
template< typename DiffType >
struct RandomlyRandom{
DiffType operator()( DiffType i ){
return std::rand() % i;
}
};
//-----------------------------------------------------------------------------
// we'll have two iterators:
// - the first starts at the begining of the range
// and moves one element at a time for n times
// - the second starts at random in the middle of the range
// and will move a random number of elements inside the range
//
// then we swap their values
template< typename FwdIter, typename Fn >
void random_shuffle_n( FwdIter begin, FwdIter end, Fn& Func, size_t n ){
typedef typename std::iterator_traits<FwdIter>::difference_type difference_type;
FwdIter first = begin;
FwdIter second = begin;
difference_type dist = std::distance( begin, end );
difference_type offset = Func( dist ) % dist;
difference_type index = offset;
std::advance( second, offset ); // try to put some distance between first & second
do{
offset = Func( dist ) % dist;
index += offset;
if( index >= dist ){
second = begin;
index = offset = index % dist;
}
std::advance( second, offset );
std::swap( *first++, *second );
}while( n-- > 0 );
}
//-----------------------------------------------------------------------------
int main( int argc, char* argv[] ){
int arr[] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10};
std::list< int > lst( arr, arr + sizeof( arr ) / sizeof( arr[ 0 ] ) );
std::copy( lst.begin(), lst.end(), std::ostream_iterator< int >( std::cout, " " ) );
std::cout << std::endl;
RandomlyRandom< std::list< int >::difference_type > rand;
for( int i = 0; i < 100; i++ ){
random_shuffle_n( lst.begin(), lst.end(), rand, 5 );
std::copy( lst.begin(), lst.end(), std::ostream_iterator< int >( std::cout, " " ) );
std::cout << std::endl;
}
return 0;
}
//-----------------------------------------------------------------------------
精彩评论