Gapless secondary ID for fetching random rows in MySQL

2023-03-03 12:21 问答作者：

In several projects I've been working on I have encountered the need to fetch random rows from large (>1M rows) tables. With tables this large, ORDER BY rand() LIMIT 1 is no option as it will quickly bring the database to it's knees.

The usual solution has been to generat a random number between MIN(id) and MAX(id) and select that row directly. However, if there are big gaps in the id sequence this will require either lots of re-rolls or using WHERE id >= :myrandomnumber which will lead to rows that succeed large gaps getting significantly more hits than average.

I've been thinking to solve this problem by creating a new indexed column solely for randomizing purposes, say id2. This column would always be a gapless sequence between 1 and the number of rows in the table.

Question: What would be the best way to keep this sequence gapless?

The first solution that comes to mind is creating a helper table recycled_ids that will contain columns tablename and id2. Whenever a row is deleted from tablename, the id2 of that row is inserted to recycled_ids. When new rows are inserted, the id2 is either selected from recycled_ids or if none are available, a new one is created. Is there a simpler way?

Bonus quest开发者_如何学编程ions: Are there ORMs or frameworks that already do this or have otherwise efficient random row selection? Is this an existing pattern, does it have a name?

Update: I wrote a quick benchmark for this and ran it against a table with 125,000 rows and 30,000 gaps between them. The results are pretty promising:

Fetch a random row 100 times using id2: 0.0234689712524 seconds
Fetch a random row 100 times using ORDER BY rand() LIMIT 1: 54.992347002 seconds

When inserting the test data, I removed one random row for every five rows inserted. The sequence stays gapless the whole time.

for($i=1; $i<=$amount; $i++) {
    insert_row();
    if($i % 5 == 0)
        delete_random_row();
}

Running that loop again with $amount = 10000 takes 9 seconds on my low-end vserver. That's 0.009 seconds per row and it includes deleting a random row every five iterations. It does get slower as the table grows, but fetching a random row does not.

My original questions still apply.

Here's how I'd do it -

Select the MAX(id) from your table
In PHP (or whatever language you're using), generate a random integer between 1 and MAX(id)
SELECT * FROM table WHERE id >= $random ORDER BY id ASC LIMIT 1
If 3 returns nothing, SELECT * FROM table WHERE id < $random ORDER BY id DESC LIMIT 1

Avoids running any queries that would be brutally slow. It also avoids the extra column which, keeping gapless, would be a nasty job indeed!

Ranking to the rescue I'd say.

SET @rank:= 1;  

SELECT * FROM
  (
  SELECT @rank:= @rank + 1 as rank, * FROM table1  
  ) s
WHERE s.rank = $random;

继续阅读：design-patterns orm query-optimization random

Gapless secondary ID for fetching random rows in MySQL

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？