开发者

Category with lot of pages (huge offsets) (how does stackoverflow work?)

I think that my question can be solved by just knowing how, for example, stackoverflow works.

For example, this page, loads in a few ms (< 300ms): https://stackoverflow.com/questions?page=61440&sort=newest

The only query i can think about for that page is something like SELECT * FROM stuff ORDER BY date DESC LIMIT {pageNumber}*{stuffPerPage}, {pageNumber}*{stuffPerPage}+{stuffPerPage}

A query like that might take several seconds to run, but the stack overflow page loads almost in no time. It can't be a cached query, since that question are posted over time and rebuild the cache every time a question is posted is simply madness.

So, how do this works in your opinion?

(to make the question easier, let's forget about the ORDER BY) Example (the table is fully cached in ram and stored in an ssd drive)

mysql> select * from thread limit 1000000, 1;
1 row in set (1.61 sec)

mysql> select * from thread limit 10000000, 1;
1 row in set (16.75 sec)

mysql> describe select * from thread limit 1000000, 1;
+----+-------------+--------+------+---------------+------+---------+------+----------+-------+
| id | select_type | table  | type | possible_keys | key  | key_len | ref  | rows     | Extra |
+----+-------------+--------+------+---------------+------+---------+------+----------+-------+
|  1 | SIMPLE      | thread | ALL  | NULL          | NULL | NULL    | NULL | 64801163 |       |
+----+-------------+--------+------+---------------+------+---------+------+----------+-------+

mysql> select * from thread ORDER BY thread_date DESC limit 1000000, 1;
1 row in set (1 min 37.56 sec)


mysql> SHOW INDEXES FROM thread;
+--------+------------+----------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table  | Non_unique | Key_name | Seq_in_index | Column_name  | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Commen开发者_如何学运维t | Index_comment |
+--------+------------+----------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| thread |          0 | PRIMARY  |            1 | newsgroup_id | A         |      102924 |     NULL | NULL   |      | BTREE      |         |               |
| thread |          0 | PRIMARY  |            2 | thread_id    | A         |    47036298 |     NULL | NULL   |      | BTREE      |         |               |
| thread |          0 | PRIMARY  |            3 | postcount    | A         |    47036298 |     NULL | NULL   |      | BTREE      |         |               |
| thread |          0 | PRIMARY  |            4 | thread_date  | A         |    47036298 |     NULL | NULL   |      | BTREE      |         |               |
| thread |          1 | date     |            1 | thread_date  | A         |    47036298 |     NULL | NULL   |      | BTREE      |         |               |
+--------+------------+----------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
5 rows in set (0.00 sec)


Create a BTREE index on date column and the query will run in a breeze.

CREATE INDEX date ON stuff(date) USING BTREE

UPDATE: Here is a test I just did:

CREATE TABLE test( d DATE, i INT, INDEX(d) );

Filled the table with 2,000,000 rows with different unique is and ds

mysql> SELECT * FROM test LIMIT 1000000, 1;
+------------+---------+
| d          | i       |
+------------+---------+
| 1897-07-22 | 1000000 |
+------------+---------+
1 row in set (0.66 sec)

mysql> SELECT * FROM test ORDER BY d LIMIT 1000000, 1;
+------------+--------+
| d          | i      |
+------------+--------+
| 1897-07-22 | 999980 |
+------------+--------+
1 row in set (1.68 sec)

And here is an interesiting observation:

mysql> EXPLAIN SELECT * FROM test ORDER BY d LIMIT 1000, 1;
+----+-------------+-------+-------+---------------+------+---------+------+------+-------+
| id | select_type | table | type  | possible_keys | key  | key_len | ref  | rows | Extra |
+----+-------------+-------+-------+---------------+------+---------+------+------+-------+
|  1 | SIMPLE      | test  | index | NULL          | d    | 4       | NULL | 1001 |       |
+----+-------------+-------+-------+---------------+------+---------+------+------+-------+

mysql> EXPLAIN SELECT * FROM test ORDER BY d LIMIT 10000, 1;
+----+-------------+-------+------+---------------+------+---------+------+---------+----------------+
| id | select_type | table | type | possible_keys | key  | key_len | ref  | rows    | Extra          |
+----+-------------+-------+------+---------------+------+---------+------+---------+----------------+
|  1 | SIMPLE      | test  | ALL  | NULL          | NULL | NULL    | NULL | 2000343 | Using filesort |
+----+-------------+-------+------+---------------+------+---------+------+---------+----------------+

MySql does use the index for OFFSET 1000 but not for 10000.

Even more interesting, if I do FORCE INDEX query takes more time:

mysql> SELECT * FROM test FORCE INDEX(d) ORDER BY d LIMIT 1000000, 1;
+------------+--------+
| d          | i      |
+------------+--------+
| 1897-07-22 | 999980 |
+------------+--------+
1 row in set (2.21 sec)


I think StackOverflow doesn't need to reach the rows at offset 10000000. The query below should be fast enough if you have an index on date and the numbers in LIMIT clause are from real world examples, not millions :)

SELECT * 
FROM stuff 
ORDER BY date DESC 
LIMIT {pageNumber}*{stuffPerPage}, {stuffPerPage}

UPDATE:

If records in a table are relatively rarely deleted (like in StackOverflow) then you can use the following solution:

SELECT * 
FROM stuff 
WHERE id between 
    {stuffCount}-{pageNumber}*{stuffPerPage}+1 AND 
    {stuffCount}-{pageNumber-1}*{stuffPerPage}
ORDER BY id DESC 

Where {stuffCount} is:

SELECT MAX(id) FROM stuff

If you have some deleted records in a database then some pages will have less than {stuffPerPage} records, but it should not be the problem. StackOverflow uses some inaccurate algorithm too. For instance try to go to the first page and to the last page and you'll see that both pages return 30 records per page. But mathematically it's nonsense.

Solutions designed to work with large databases often uses some hacks which usually are unnoticeable for regular users.


Nowadays paging with millions of records is not modish, because it's impractical. Currently it's popular to use infinite scrolling (automatic or manual with button click). It has more sense and pages load faster because they don't need to be reloaded. If you think that old records can be useful for your users too, then it's a good idea to create a page with random records (with infinite scrolling too). This was my opinion :)

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜