Very slow bitmap heap scan in Postgres
I have the following simple table that contains traffic measurement data:
CREATE TABLE "TrafficData"
(
"RoadID" character varying NOT NULL,
"DateID" numeric NOT NULL,
"ExactDateTime" timestamp NOT NULL,
"CarsSpeed" numeric NOT NULL,
"CarsCount" numeric NOT NULL
)
CREATE INDEX "RoadDate_Idx" ON "TrafficData" USING btree ("RoadID", "DateID");
The column RoadID uniquely identifies the road whose data is being recorded, while DateID identifies the day of the year (1..365) of the data - basically a rounded off representation of ExactDateTime.
I have about 100.000.000 rows; there are 1.000 distinct values in the column "RoadID" and 365 distinct values in the column "DateID".
I then run the following query:
SELECT * FROM "TrafficData"
WHERE "RoadID"='Station_1'
AND "DateID">20100610 AND "DateID"<20100618;
This takes up to three mind-boggling seconds to finish, and I cannot for the life of me figure out WHY.
EXPLAIN ANALYZE gives me the following output:
Bitmap Heap Scan on "TrafficData" (cost=104.84..9743.06 rows=2496 width=47) (actual time=35.112..2162.404 rows=2016 loops=1)
Recheck Cond: ((("RoadID")::text = 'S开发者_运维技巧tation_1'::text) AND ("DateID" > 20100610::numeric) AND ("DateID" < 20100618::numeric))
-> Bitmap Index Scan on "RoadDate_Idx" (cost=0.00..104.22 rows=2496 width=0) (actual time=1.637..1.637 rows=2016 loops=1)
Index Cond: ((("RoadID")::text = 'Station_1'::text) AND ("DateID" > 20100610::numeric) AND ("DateID" < 20100618::numeric))
Total runtime: 2163.985 ms
My specs:
- Windows 7
- Postgres 9.0
- 4GB RAM
I'd greatly appreciate any helpful pointers!
The slow part is obviosly fetching the data from the tables, since the index access seems to be very fast. You might either optimize your RAM usage parameters (see http://wiki.postgresql.org/wiki/Performance_Optimization and http://www.varlena.com/GeneralBits/Tidbits/perf.html), or optimize the layout of the data in the table by issuing a CLUSTER command (see http://www.postgresql.org/docs/8.3/static/sql-cluster.html).
CLUSTER "TrafficData" USING "RoadDate_Idx";
should do it.
Adding to Daniel's answer, the cluster operation is a one off process that rearranged the data on disk. The intent is to get your 2000 results rows from fewer disk blocks.
As this is dummy data, being used to find out how you can quickly query it, I'd recommend reloading it, in a pattern closer to how it will be loaded as it is generated. I imagine that the data is generated one day at a time, which will effectively result in strong correlation between DateID
and the location on disk. If that is the case, then I'd either cluster by DateID
, or split your test data into 365 separate loads, and reload it.
Without that, and having randomly generated data, you're most likely having to perform over 2000 seeks of your disk head.
I'd also check that anything else you're running on Windows 7 isn't adding time to those reads that you don't need, such as ensuring that the blocks read do not contain virus signatures, or concurrently performing an automatically scheduled disk defragmentation (resulting in the disk head hardly ever being anywhere close to where it was last time a database block was read).
- 4GB RAM -> 6+ you have 100M records, which isn't large but for a desktop machine memory could matter. If this is not a desktop, I'm not sure why you would have such a small amount of memory
AND "DateID">20100610 AND "DateID"<20100618;
->DateID BETWEEN 20100611 AND 20100617;
- Create an index on the DateID
- Get rid of all the double quotes around field names
- Instead of a VarChar, make
RoadID
a text field
精彩评论