Clickhouse bloom filter index seems too slow
I had executed the following query but it has processed ~1B rows and took total time of 75 seconds for a simple count.
SELECT count(*)
FROM events_distributed
WHERE (orgId = '174a4727-1116-4c5c-8234-ab76f2406c4a') AND (timestamp >= '2022-12-05 00:00:00.000000000')
Query id: e4312ff5-6add-4757-8deb-d68e0f3e29d9
┌──count()─┐
│ 13071204 │
└──────────┘
1 row in set. Elapsed: 74.951 sec. Processed 979.00 million rows, 8.26 GB (13.06 million rows/s., 110.16 MB/s.)
I am wondering how I can speed this up? My events table has the following partition by
and order by columns
and a bloom filter index on orgid
PARTITION BY toDate(timestamp)
ORDER BY (timestamp);
INDEX idx_orgid orgid TYPE bloom_filter(0.01) GRANULARITY 1,
Below is the execution plan
EXPLAIN indexes = 1
SELECT count(*)
FROM events_distributed
WHERE (orgid = '174a4727-1116-4c5c-8234-ab76f240fc4a') AND (timestamp >= '2022-12-05 00:00:00.000000000') AND (timestamp <= '2022-12-06 00:00:00.000000000')
Query id: 879c2ce5-c4c7-4efc-b0e2-25613848afad
┌─explain────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ Expression ((Projection + Before ORDER BY)) │
│ MergingAggregated │
│ Union │
│ Aggregating │
│ Expression (Before GROUP BY) │
│ Filter (WHERE) │
│ ReadFromMergeTree (users.events) │
│ Indexes: │
│ MinMax │
│ Keys: │
│ timestamp │
│ Condition: and((timestamp in (-Inf, '1670284800']), (timestamp in ['1670198400', +Inf))) │
│ Parts: 12/342 开发者_如何学JAVA │
│ Granules: 42122/407615 │
│ Partition │
│ Keys: │
│ toDate(timestamp) │
│ Condition: and((toDate(timestamp) in (-Inf, 19332]), (toDate(timestamp) in [19331, +Inf))) │
│ Parts: 12/12 │
│ Granules: 42122/42122 │
│ PrimaryKey │
│ Keys: │
│ timestamp │
│ Condition: and((timestamp in (-Inf, '1670284800']), (timestamp in ['1670198400', +Inf))) │
│ Parts: 12/12 │
│ Granules: 30696/42122 │
│ Skip │
│ Name: idx_orgid │
│ Description: bloom_filter GRANULARITY 1 │
│ Parts: 8/12 │
│ Granules: 20556/30696 │
│ ReadFromRemote (Read from remote replica) │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
32 rows in set. Elapsed: 0.129 sec.
How can I speed up this query? because processing 1B rows to give a count of 13M sounds like something is total off. Does creating a SET
index on orgid
any better? because I will have a max of 10K orgs
You don't use primary index
I suggest is to use
PARTITION BY toDate(timestamp)
ORDER BY (orgId, timestamp)
https://kb.altinity.com/engines/mergetree-table-engine-family/pick-keys/
And remove bloom_filter index.
精彩评论