Category with lot of pages (huge offsets) (how does stackoverflow work?)

2023-03-29 05:33 问答作者：

I think that my question can be solved by just knowing how, for example, stackoverflow works.

For example, this page, loads in a few ms (< 300ms): https://stackoverflow.com/questions?page=61440&sort=newest

The only query i can think about for that page is something like SELECT * FROM stuff ORDER BY date DESC LIMIT {pageNumber}*{stuffPerPage}, {pageNumber}*{stuffPerPage}+{stuffPerPage}

A query like that might take several seconds to run, but the stack overflow page loads almost in no time. It can't be a cached query, since that question are posted over time and rebuild the cache every time a question is posted is simply madness.

So, how do this works in your opinion?

(to make the question easier, let's forget about the ORDER BY) Example (the table is fully cached in ram and stored in an ssd drive)

mysql> select * from thread limit 1000000, 1;
1 row in set (1.61 sec)

mysql> select * from thread limit 10000000, 1;
1 row in set (16.75 sec)

mysql> describe select * from thread limit 1000000, 1;
+----+-------------+--------+------+---------------+------+---------+------+----------+-------+
| id | select_type | table  | type | possible_keys | key  | key_len | ref  | rows     | Extra |
+----+-------------+--------+------+---------------+------+---------+------+----------+-------+
|  1 | SIMPLE      | thread | ALL  | NULL          | NULL | NULL    | NULL | 64801163 |       |
+----+-------------+--------+------+---------------+------+---------+------+----------+-------+

mysql> select * from thread ORDER BY thread_date DESC limit 1000000, 1;
1 row in set (1 min 37.56 sec)


mysql> SHOW INDEXES FROM thread;
+--------+------------+----------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table  | Non_unique | Key_name | Seq_in_index | Column_name  | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Commen开发者_如何学运维t | Index_comment |
+--------+------------+----------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| thread |          0 | PRIMARY  |            1 | newsgroup_id | A         |      102924 |     NULL | NULL   |      | BTREE      |         |               |
| thread |          0 | PRIMARY  |            2 | thread_id    | A         |    47036298 |     NULL | NULL   |      | BTREE      |         |               |
| thread |          0 | PRIMARY  |            3 | postcount    | A         |    47036298 |     NULL | NULL   |      | BTREE      |         |               |
| thread |          0 | PRIMARY  |            4 | thread_date  | A         |    47036298 |     NULL | NULL   |      | BTREE      |         |               |
| thread |          1 | date     |            1 | thread_date  | A         |    47036298 |     NULL | NULL   |      | BTREE      |         |               |
+--------+------------+----------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
5 rows in set (0.00 sec)

Create a BTREE index on date column ~~and the query will run in a breeze~~.

CREATE INDEX date ON stuff(date) USING BTREE

UPDATE: Here is a test I just did:

CREATE TABLE test( d DATE, i INT, INDEX(d) );

Filled the table with 2,000,000 rows with different unique is and ds

mysql> SELECT * FROM test LIMIT 1000000, 1;
+------------+---------+
| d          | i       |
+------------+---------+
| 1897-07-22 | 1000000 |
+------------+---------+
1 row in set (0.66 sec)

mysql> SELECT * FROM test ORDER BY d LIMIT 1000000, 1;
+------------+--------+
| d          | i      |
+------------+--------+
| 1897-07-22 | 999980 |
+------------+--------+
1 row in set (1.68 sec)

And here is an interesiting observation:

mysql> EXPLAIN SELECT * FROM test ORDER BY d LIMIT 1000, 1;
+----+-------------+-------+-------+---------------+------+---------+------+------+-------+
| id | select_type | table | type  | possible_keys | key  | key_len | ref  | rows | Extra |
+----+-------------+-------+-------+---------------+------+---------+------+------+-------+
|  1 | SIMPLE      | test  | index | NULL          | d    | 4       | NULL | 1001 |       |
+----+-------------+-------+-------+---------------+------+---------+------+------+-------+

mysql> EXPLAIN SELECT * FROM test ORDER BY d LIMIT 10000, 1;
+----+-------------+-------+------+---------------+------+---------+------+---------+----------------+
| id | select_type | table | type | possible_keys | key  | key_len | ref  | rows    | Extra          |
+----+-------------+-------+------+---------------+------+---------+------+---------+----------------+
|  1 | SIMPLE      | test  | ALL  | NULL          | NULL | NULL    | NULL | 2000343 | Using filesort |
+----+-------------+-------+------+---------------+------+---------+------+---------+----------------+

MySql does use the index for OFFSET 1000 but not for 10000.

Even more interesting, if I do FORCE INDEX query takes more time:

mysql> SELECT * FROM test FORCE INDEX(d) ORDER BY d LIMIT 1000000, 1;
+------------+--------+
| d          | i      |
+------------+--------+
| 1897-07-22 | 999980 |
+------------+--------+
1 row in set (2.21 sec)

I think StackOverflow doesn't need to reach the rows at offset 10000000. The query below should be fast enough if you have an index on date and the numbers in LIMIT clause are from real world examples, not millions :)

SELECT * 
FROM stuff 
ORDER BY date DESC 
LIMIT {pageNumber}*{stuffPerPage}, {stuffPerPage}

UPDATE:

If records in a table are relatively rarely deleted (like in StackOverflow) then you can use the following solution:

SELECT * 
FROM stuff 
WHERE id between 
    {stuffCount}-{pageNumber}*{stuffPerPage}+1 AND 
    {stuffCount}-{pageNumber-1}*{stuffPerPage}
ORDER BY id DESC

Where {stuffCount} is:

SELECT MAX(id) FROM stuff

If you have some deleted records in a database then some pages will have less than {stuffPerPage} records, but it should not be the problem. StackOverflow uses some inaccurate algorithm too. For instance try to go to the first page and to the last page and you'll see that both pages return 30 records per page. But mathematically it's nonsense.

Solutions designed to work with large databases often uses some hacks which usually are unnoticeable for regular users.

Nowadays paging with millions of records is not modish, because it's impractical. Currently it's popular to use infinite scrolling (automatic or manual with button click). It has more sense and pages load faster because they don't need to be reloaded. If you think that old records can be useful for your users too, then it's a good idea to create a page with random records (with infinite scrolling too). This was my opinion :)

继续阅读：database

Category with lot of pages (huge offsets) (how does stackoverflow work?)

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？