Pagination Strategies for Complex (slow) Datasets
What are some of the strategies being used for pagination of data sets that involve complex queries? count(*) takes ~1.5 sec so we don't want to hit the DB for every page view. Currently there are ~45k rows returned by this query.
Here are 开发者_StackOverflow社区some of the approaches I've considered:
- Cache the row count and update it every X minutes
- Limit (and offset) the rows counted to 41 (for example) and display the page picker as "1 2 3 4 ..."; then recompute if anyone actually goes to page 4 and display "... 3 4 5 6 7 ..."
- Get the row count once and store it in the user's session
- Get rid of the page picker and just have a "Next Page" link
I've had to engineer a few pagination strategies using PHP and MySQL for a site that does over a million page views a day. I persued the strategy in stages:
Multi-column indexes I should have done this first before attempting a materialized view.
Generating a materialized view. I created a cron job that did a common denormalization of the document tables I was using. I would SELECT ... INTO OUTFILE ...
and then create the new table, and rotate it in:
SELECT ... INTO OUTFILE '/tmp/ondeck.txt' FROM mytable ...;
CREATE TABLE ondeck_mytable LIKE mytable;
LOAD DATA INFILE '/tmp/ondeck.txt' INTO TABLE ondeck_mytable...;
DROP TABLE IF EXISTS dugout_mytable;
RENAME TABLE atbat_mytable TO dugout_mytable, ondeck_mytable TO atbat_mytable;
This kept the lock time on the write contended mytable
down to a minimum and the pagination queries could hammer away on the atbat
materialized view. I've simplified the above, leaving out the actual manipulation, which are unimportant.
Memcache I then created a wrapper about my database connection to cache these paginated results into memcache. This was a huge performance win. However, it was still not good enough.
Batch generation I wrote a PHP daemon and extracted the pagination logic into it. It would detect changes mytable
and periodically regenerate the from the oldest changed record to the most recent record all the pages to the webserver's filesystem. With a bit of mod_rewrite
, I could check to see if the page existed on disk, and serve it up. This also allowed me to take effective advantage of reverse proxying by letting Apache detect If-Modified-Since
headers, and respond with 304
response codes. (Obviously, I removed any option of allowing users to select the number of results per page, an unimportant feature.)
Updated:
RE count(*)
: When using MyISAM tables, COUNT
didn't create a problem when I was able to reduce the amount of read-write contention on the table. If I were doing InnoDB, I would create a trigger that updated an adjacent table with the row count. That trigger would just +1 or -1 depending on INSERT or DELETE statements.
RE page-pickers (thumbwheels) When I moved to agressive query caching, thumb wheel queries were also cached, and when it came to batch generating the pages, I was using temporary tables--so computing the thumbwheel was no problem. A lot of thumbwheel calculation simplified because it became a predictable filesystem pattern that actually only needed the largest page numer. The smallest page number was always 1.
Windowed thumbweel The example you give above for a windowed thumbwheel (<< 4 [5] 6 >>) should be pretty easy to do without any queries at all so long as you know your maximum number of pages.
My suggestion is ask MySQL for 1 row more than you need in each query, and decide based on the number of rows in the result set whether or not to show the next page
-link.
MySQL has a specific mechanism to compute an approximated count of a result set without the LIMIT
clause: FOUND_ROWS()
.
MySQL
is quite good in optimizing LIMIT
queries.
That means it picks appropriate join buffer, filesort buffer etc just enough to satisfy LIMIT
clause.
Also note that with 45k
rows you probably don't need exact count. Approximate counts can be figured out using separate queries on the indexed fields. Say, this query:
SELECT COUNT(*)
FROM mytable
WHERE col1 = :myvalue
AND col2 = :othervalue
can be approximated by this one:
SELECT COUNT(*) *
(
SELECT COUNT(*)
FROM mytable
) / 1000
FROM (
SELECT 1
FROM mytable
WHERE col1 = :myvalue
AND col2 = :othervalue
LIMIT 1000
)
, which is much more efficient in MyISAM
.
If you give an example of your complex query, probably I can say something more definite on how to improve its pagination.
I'm by no means a MySQL expert, but perhaps giving up the COUNT(*) and going ahead with COUNT(id)?
精彩评论