My(SQL) selecting random rows, novel way, help to evaluate how good is it?
I again run into problem of selecting random subset of rows. And as many probably know ORDER BY RAND() is quite inefficient, or at least thats the consensus. I have read that mysql generates random value for every row in table, then filters then orders by these random values and then returns set. The biggest performance impact seems to be from the fact that there as many random numbers generated as there are rows in a table. So i was looking for possibly better way to return random subset of results for such query:
SELECT id FROM <table> WHERE <some conditions> LIMIT 10;
Of course simplest and easiest way to do what i want would be the one witch I try to avoid:
SELECT id FROM <table> WHERE <some conditions> ORDER BY RAND() LIMIT 10; (a)
Now after some thinking i came up with other option for this task:
SELECT id FROM <table> WHERE (<some conditions>) AND RAND() > x LIMIT 10; (b)
(Of course we can use <
instead of >
) Here we take x
from range 0.0 - 1.0
. Now I'm not exactly sure how MySQL handles this but my guess is that it first selects rows matching <some conditions>
(using index[es]?) and then generates random value and sees if it should return or discard row. But what do i know:) thats why i ask here. So some observations about this method:
- first it does not guarantee that asked number of rows will be returned even if there is much more matching rows than needed, especially true for
x
values close to1.0
(or close to0.0
if we use<
) - returned object don't really have random ordering, they are just objects selected randomly, order by index used or by the way they are stored(?) (of course this might not matter in some cases at all)
- we probably need to choose
x
according to size of result set, since if we have large result set andx
is lets say0.1
, it will be very likely that only some random first results will be returned most of the time; on the other hand if have small result set and choose largex
it is likely that we might get less object than we want, although there are enough of them
Now some words about performance. I d开发者_开发技巧id a little testing using jmeter. <table>
has about 20k rows, and there are about 2-3k rows matching <some conditions>
. I wrote simple PHP script that executes query and print_r
's the result. Then I setup test using jmeter that starts 200 threads, so that is 200 requests per second, and requests said PHP script. I ran it until about 3k requests were done (average response time stabilizes well before this). Also I executed all queries with SQL_NO_CACHE
to prevent query cache having some effect. Average response times were:
- ~30ms for query (a)
- 13-15ms for query (b) with
x = 0.1
- 17-20ms for query (b) with
x = 0.9
, as expected largerx
is slower since it has to discard more rows
So my questions are: what do you think about this method of selecting random rows? Maybe you have used it or tried it and see that it did not work out? Maybe you can better explain how MySQL handles such query? What could be some caveats that I'm not aware of?
EDIT: I probably was not clear enough, the point is that i need random rows of query not simply table, thus I included <some conditions>
and there are quite some. Moreover table is guaranteed to have gaps in id, not that it matters much since this is not random rows from table but from query, and there will be quite a lot such queries so those suggestions involving querying table multiple times do not sound appealing. <some conditions>
will vary at least a bit between requests, meaning that there will be requests with different conditions.
From my own experience, I've always used ORDER BY RAND()
, but this has it's own performance implications on larger datasets. For example, if you had a table that was too big to fit in memory then MySQL will create a temporary table on disk, and then perform a file sort to randomise the dataset (storage engine permitting). Your LIMIT 10
clause will have no effect on the execution time of the query AFAIK, but it will reduce the size of the data to send back to the client obviously.
Basically, the limit and order by happen after the query has been executed (full table scan to find matching records, then it is ordered and then it is limited). Any rows outside of your LIMIT 10
clause are discarded.
As a side note, adding in the SQL_NO_CACHE will disable MySQL's internal query cache, but will does not prevent your operating system from caching the data (due to the random nature of this query I don't think it would have much of an effect on your execution time anyway).
Hopefully someone can correct me here if I have made any mistakes but I believe that is the general idea.
An alternative way which probably would not be faster, but might who knows :)
Either use a table status query to determine the next auto_increment, or the row count, or use (select count(*)). Then you can decide ahead of time the auto_increment value of a random item and then select that unique item.
This will fail if you have gaps in the auto_increment field, but if it is faster than your other methods, you could loop a few times or fall back to a failsafe method in the case of zero rows returned. Best case might be a big savings, worst case would be comparable to your current method.
You're using the wrong data structure.
The usual method is something like this:
- Find out the number of elements
n
— something likeSELECT count(id) FROM tablename
. - Choose
r
distinct randomish numbers in the interval [0,n). I usually recommend a LCG with suitably-chosen parameters, but simply pickingr
randomish numbers and discarding repeats also works. - Return those elements. The hard bit.
MySQL appears to support indexed lookups with something like SELECT id from tablename ORDER BY id LIMIT :i,1
where :i is a bound-parameter (I forget what syntax mysqli uses); alternative syntax LIMIT 1 OFFSET :i
. This means you have to make r
queries, but this might be fast enough (it depends on per-statement overheads and how efficiently it can do OFFSET).
An alternative method, which should work for most databases, is a bit like interval-bisection:
SELECT count(id),max(id),min(id) FROM tablename
. Then you know rows [0,n-1] have ids [min,max].So rows [a,b] have ids [min,max]. You want row i. If i == a, return min. If i == b, return max. Otherwise, bisect:
- Choose
split = min+(max-min)/2
(avoiding integer overflow). SELECT count(id),max(id) WHERE :min < id AND id < split
andSELECT count(id),min(id) WHERE :split <= id and id < :max
. The two counts should equal b-a+1 if the table hasn't been modified...- Figure out which range i is in, and update a, b, min, and max appropriately. Repeat.
- Choose
There are plenty of edge cases (I've probably included some off-by-one errors) and a few potential optimizations (you can do this for all the indexes at once, and you don't really need to do two queries per iteration if you don't assume that i == b
implies id = max
). It's not really worth doing if SELECT ... OFFSET
is even vaguely efficient.
精彩评论