Help speeding up PostgreSQL query
I'm trying to speed this query up as much as possible. It's not terribly slow, but I need it be as fast as can be.
SELECT name
FROM (
SELECT cities.name || ', ' || regions.name || ', ' || countries.code AS name
FROM cities
INNER JOIN regions ON regions.id = cities.region_id
INNER JOIN countries ON countries.id = regions.country_id
) AS t1
GROUP BY name
HAVING LOWER(name) ILIKE 'asheville%'
ORDER BY name ASC
LIMIT 10;
These indexes exist:
UNIQUE INDEX index_cities_on_name_and_region_id ON cities USING btree (name, region_id)
UNIQUE INDEX index_countries_on_code ON countries USING btree (code)
UNIQUE INDEX index_countries_on_name ON countries USING btree (name)
UNIQUE INDEX index_regions_on_code_and_country_id ON regions USING btree (code, country_id)
cities t开发者_运维知识库able contains 248016 records. countries table contains 252 records. regions table contains 4005 records.
Here's the explain output of the query: http://explain.depesz.com/s/fWe
Any help would be greatly appreciated. Basically I'm just looking for suggestions or maybe pointing out something I may have missed.
In your subquery, you should return both the name
you are already returning, and cities.name as cname
. Then, you should do your ilike
on cname
instead of on name
. The issue is that right now there is no way that PostgreSQL could really be expected to infer that as 'ashville%'
doesn't have any commas in it, it can just look at the city name in the subquery, so it is really having to (and is, based on your explain) iterate and build every single possible string in order to do that final filter. If you return cities.name
back to the upper query it will dramatically improve the performance, as right now it seriously can't use any of those indexes you have.
Really, you should go all the way here, and just remove the string concatenation inside of the query and return what you really intended: select cities.name as city, regions.name as region, countries.code as country
, and modify sort to order by t1.city, t1.region, t1.country
.
Additionally, are you really asking for cities that have 'ashville%'
, or is this just an indirect way of looking for cities that are 'ashville'
, but you have to deal with the comma delineation internally? Then, outside, use lower(t1.city) = 'ashville'
(note the =
: lower(x) ilike 'lower'
is pointlessly slow).
Also, you need to fix these indexes: what you really want is create index whatever on cities((lower(name)))
, as that's what you are actually searching by, not name
: there is no way that you can ever get these indexes to be used if you are searching for something that is unrelated to what you have in the index.
(You might look at the order by name
later and be concerned that it won't be accelerated anymore, but that's ok: the goal here is to rapidly filter from the tons of possible locations down to the tiny set of ones that you are going to operate on; what's left over can be rapidly sorted in memory, as you are probably dealing with 10-20 results.)
Due to this, as regions.id
and countries.id
are probably primary key
s, the other indexes can be deleted if they are only there for this query.
Finally, flatten the query to a single level, remove the group by
, and replace it with a distinct
. The issue is that we want to make certain that we don't force PostgreSQL to generate the full set before attempting the filter: we want to make certain it has enough knowledge of the goal to be able to use the city index to rapidly scan directly to the cities that can match, and then get around to filling in the region and country information.
(PostgreSQL is generally very very good at doing this, even through a sub-query, but as we have a group by
clause through a having
, I can see situations where it will no longer be able to infer.)
(edit) Actually, wait: you have a unique index on cities (name, region_id)
, so you don't even need the distinct
... all it is doing is making the query pointlessly more complex. I just went ahead and removed it from the query: the result will be the same, as you cannot possibly end up with a result where you have the same city in the same region/country being returned twice.
select
cities.name as city,
regions.name as region,
countries.code as country
from cities
join regions on
regions.id = cities.region_id
join countries on
countries.id = regions.country_id
where
lower(cities.name) = 'asheville'
order by
cities.name,
regions.name,
countries.code
limit 10;
create index "cities(lower(name))" on cities ((lower(name)));
(edit) If, by the way, you actually are meaning to do a prefix match, then you will want to change the = 'asheville'
back to a like 'ashevill%'
(note the like
: no i
), and change the index to be specified as follows:
create index "cities(lower(name))" on cities ((lower(name)) text_pattern_ops);
If you really need it to be as fast as it can be then my suggestion is to avoid querying the database at all when searching. As the name of cities and countries are more often static than not - they don't change that often - I'd suggest doing the joins offline and store the result in a format that is optimized for the kind of searching that you want to do.
New Indices:
- regions.id
- cities.region_id
- regions.country_id
Less Work
- string concatenation is time consuming; instead of doing it in the query, consider doing it in your script that uses the result
- instead of doing concatenation and then filtering results, you should do your filtering first and then perform the concatenation/functions (credit: to what Jay Freeman is saying in comments)
Scrap Indices
- None of your current indexes are are being used. You should scrap them, since they're just slowing your query down.
I assume you do have indexes on regions.id
and countries.id
as they sound as if they were the primary key.
As far as I can tell, the two inner joins are not using an index because cities.region_id
and regions.country_id
are not part of an index that can be used here (because in the indexes where they are contained, they are listed at the end).
You can either swap the columns in the existing two indexes ( (region_id, name)
instead of (name, region_id)
) or simply create new indexes for only those columns. I would assume the joins will use those indexes then.
精彩评论