Why does the following Postgres SQL query take so long?
The raw query is as follows
SELECT "TIME", "TRADEPRICE"
FROM "YEAR" where "DATE"='2010-03-01'
and "SECURITY"='STW.AX'
AND "TIME" < '10:16:00'
AND "TYPE" = 'TRADE'
ORDER BY "TIME" ASC LIMIT 3
I have built three indices as follows
Columns "DATE" DESC NULLS LAST
Columns "SECURITY" DESC NULLS LAST
Columns "TIME" DESC NULLS LAST
I do not index TYPE because it only takes one of two possible values
Explain analyze produces the following
"Limit (cost=50291.28..50291.28 rows=3 width=16) (actual time=1794484.566..1794484.567 rows=3 loops=1)"
" -> Sort (cost=50291.28..50291.29 rows=4 width=16) (actual time=1794484.562..1794484.563 rows=3 loops=1)"
" Sort Key: "TIME""
" Sort Method: top-N heapsort Memory: 25kB"
" -> Bitmap Heap Scan on "YEAR" (cost=48569.54..50291.24 rows=4 width=16) (actual time=1794411.662..1794484.498 rows=20 loops=1)"
" Recheck Cond: (("SECURITY" = 'STW.AX'::bpchar) AND ("DATE" = '2010-03-01'::date))"
" Filter: (("TIME" < '10:16:00'::time without time zone) AND ("TYPE" = 'TRADE'::bpchar))"
" -> BitmapAnd (cost=48569.54..48569.54 rows=430 width=0) (actual time=1794411.249..1794411.249 rows=0 loops=1)"
" -> Bitmap Index Scan on security_desc (cost=0.00..4722.94 rows=166029 width=0) (actual time=1793917.506..1793917.506 rows=1291933 loops=1)"
" Index Cond: ("SECURITY" = 'STW.AX'::bpchar)"
" -> Bitmap Index Scan on date_desc (cost=0.00..43846.35 rows=2368764 width=0) (actual time=378.698..378.698 rows=2317130 loops=1)"
" Index Cond: ("DATE" = '2010-03-01'::date)"
"Total runtime: 1794485.224 ms"
The database is about 1 billion rows running on Core2Qu开发者_如何学Goad with 8gig RAM on Ubuntu 64bit. Surely this query should not take half an hour
The database is about 1 billion rows running on Core2Quad with 8gig RAM on Ubuntu 64bit. Surely this query should not take half an hour
It's taking half an hour because of the way you set up your indexes.
Your query has no multi-column indexes that it can use to head straight to the needed rows. It does the next best thing, which is a bitmap index scan on barely selective indexes, and top-3 sorting the resulting set.
The two indexes in question, on the security and on the date, yield 1.3M and 2.3M rows respectively. Combining them will be excruciatingly slow because you're randomly looking up over a million rows and filtering each one.
Adding insult to injury, your data structure is such that two highly correlated fields (date and time) are stored and manipulated separately. This confuses the query planner because Postgres does not collect correlation data. Your queries will thus almost always resort to filtering through enormous sets of data, and ordering the filtered set on a separate criteria.
I'd suggest the following changes:
Alter the table and add a datetime column, of type
timestamp with time zone
. Combine your date and time columns into it.Drop the date and time fields accordingly, as well as the indexes on them. Also drop the index on security.
Create an index on (security, datetime). (And don't mess around with nulls first/nulls last unless your ordering criteria contain those clauses too.)
At your option, add a separate index on (datetime) or on (datetime, security), if you ever need to execute queries that do stats on all trades in a day or date range.
vacuum analyze the whole mess once you're done with the above.
You'll then be able to rewrite your query like so:
SELECT "TIME", "TRADEPRICE"
FROM "YEAR"
WHERE '2010-03-01 00:00:00' <= "DATETIME" AND "DATETIME" < '2010-03-01 10:16:00'
AND "SECURITY"='STW.AX'
AND "TYPE" = 'TRADE'
ORDER BY "DATETIME" ASC LIMIT 3
This will yield the most optimial plan: retrieving the top-3 rows from a filtered index scan on (security, datetime), which I'd expect (since you've a billion rows) will take 25ms at most.
Add a composite index of many of the search terms together, eg ON YEAR (TYPE, SECURITY, DATE, TIME)
. Then the database can look up in a single index to match all of them, instead of having to search multiple indexes and collate all the results together (Bitmap Index Scan).
Exactly which columns (eg include TYPE
or not?) and what order you include them in the index depends on data characteristics and what kind of other queries you're doing (since you get to re-use any left-subset of a composite index for free), so experiment a bit; but to encourage order optimisation keep the ORDER BY column as the last used index column/direction.
You may also want to ANALYZE
to update stats for the query planner as some of the number-of-rows guesses seem to be a bit off.
I do not index TYPE because it only takes one of two possible values
You have to understand how indexes work, and why they work. Indexes duplicate the indexed data into lean, small index blocks that only contain the specified index data. From your X GB of raw data only X/20 (guesstimate) size remains. If you specify a query that uses data which is not indexed means that for each of the records that satisfy the other query criteria, the DBMS has to read the corresponding raw data block to the index block to determine whether it matches the query criteria.
The optimal case is that there is at least one index that contains every single requirement the query states so there is no need to do lookups into the data blocks.
Another hint: It is usually a good idea to list columns that take values that are usually queried as a range (in your case "TIME") last.
My suggestion: Drop all indexes. Create an index with the fields TIME(ASC),DATE,SECURITY,TYPE (in this order). Use the query
SELECT "TIME", "TRADEPRICE"
FROM "YEAR"
WHERE "TIME" < '10:16:00'
AND "DATE"='2010-03-01'
AND "SECURITY"='STW.AX'
AND "TYPE" = 'TRADE'
And watch the incredible speed.
精彩评论