Gapless secondary ID for fetching random rows in MySQL
In several projects I've been working on I have encountered the need to fetch random rows from large (>1M rows) tables. With tables this large, ORDER BY rand() LIMIT 1
is no option as it will quickly bring the database to it's knees.
The usual solution has been to generat a random number between MIN(id)
and MAX(id)
and select that row directly. However, if there are big gaps in the id sequence this will require either lots of re-rolls or using WHERE id >= :myrandomnumber
which will lead to rows that succeed large gaps getting significantly more hits than average.
I've been thinking to solve this problem by creating a new indexed column solely for randomizing purposes, say id2
. This column would always be a gapless sequence between 1 and the number of rows in the table.
Question: What would be the best way to keep this sequence gapless?
The first solution that comes to mind is creating a helper table recycled_ids
that will contain columns tablename
and id2
. Whenever a row is deleted from tablename
, the id2
of that row is inserted to recycled_ids
. When new rows are inserted, the id2
is either selected from recycled_ids
or if none are available, a new one is created. Is there a simpler way?
Bonus quest开发者_如何学编程ions: Are there ORMs or frameworks that already do this or have otherwise efficient random row selection? Is this an existing pattern, does it have a name?
Update: I wrote a quick benchmark for this and ran it against a table with 125,000 rows and 30,000 gaps between them. The results are pretty promising:
Fetch a random row 100 times using id2: 0.0234689712524 seconds
Fetch a random row 100 times using ORDER BY rand() LIMIT 1: 54.992347002 seconds
When inserting the test data, I removed one random row for every five rows inserted. The sequence stays gapless the whole time.
for($i=1; $i<=$amount; $i++) {
insert_row();
if($i % 5 == 0)
delete_random_row();
}
Running that loop again with $amount = 10000
takes 9 seconds on my low-end vserver. That's 0.009 seconds per row and it includes deleting a random row every five iterations. It does get slower as the table grows, but fetching a random row does not.
My original questions still apply.
Here's how I'd do it -
- Select the MAX(id) from your table
- In PHP (or whatever language you're using), generate a random integer between 1 and MAX(id)
SELECT * FROM table WHERE id >= $random ORDER BY id ASC LIMIT 1
- If 3 returns nothing,
SELECT * FROM table WHERE id < $random ORDER BY id DESC LIMIT 1
Avoids running any queries that would be brutally slow. It also avoids the extra column which, keeping gapless, would be a nasty job indeed!
Ranking to the rescue I'd say.
SET @rank:= 1;
SELECT * FROM
(
SELECT @rank:= @rank + 1 as rank, * FROM table1
) s
WHERE s.rank = $random;
精彩评论