Speeding up SQL query when sorting on foreign keys
This is more of a generic SQL problem but I'm using Firebird 2.5 if anyone knows of a Firebird/Interbase specific optimization. First, the following is a simplified example schema to illustrate the issue I'm trying to solve:
CREATE TABLE users
(
id INTEGER PRIMARY KEY,
name VARCHAR(16)
);
CREATE TABLE data_set
(
id INTEGER PRIMARY KEY,
name VARCHAR(64)
);
CREATE UNIQUE INDEX data_set_name_idx ON data_set(name);
CREATE TABLE data
(
user_id INTEGER,
data_set_id INTEGER,
data BLOB,
PRIMARY KEY(user_id, data_set_id)
);
CREATE INDEX data_user_id_idx ON data(user_id);
CREATE INDEX data_data_set_id_idx ON data(data_set_id);
The query I'm trying to run is as follows:
SELECT users.name, data_set.name, data FROM users, data_set, data
WHERE user_id=XXX AND user_id=users.id AND data_set_id=data_set.id
ORDER BY data_set.name;
With 'XXX' being filled in with the user_id I want. So what I'm doing is selecting all the rows from the data table that are owned by a particular user and I'm sorting the results b开发者_如何学运维ased on the data_set name.
This works as it is but the problem is the data table has over a billion rows in it and the data_set table is not small either. The result set for a single user id may be many hundreds of millions of rows. What happens is that in order for the ORDER BY to work the database has to create a massive amount of temporary data which is incredibly slow and uses a lot of disk space. Without the ORDER BY it's fast but obviously not sorted like I need.
One solution would be to take the data_set.name values and just put them in a varchar column in data. Then that could be indexed and would be quick to sort. The problem with this approach is that it will have a lot of duplicate data and make the database absolutely massive.
Another solution would be something like an Indexed View or an indexed Computed Column. As far as I know neither of those is supported by Firebird.
Any other ideas?
This is rather speculative, but I wonder if this could be restructured as:
- A cartesian product between users and data set, including the predicate on user.
- Order by data set name
- Joining to data
... would be more efficient, especially if you were only interested in the forst rows of the query.
In Oracle I'd think that it would not be because the nested loop join would be much less efficient than a hash join, but I'm not familiar with firebird at all I'm afraid.
Why not index data_set.name?
Also, I would skip primary key definition for facts table (data) and put two separate indexes for foreign keys, to speed up joins. (of course, indexing may have impact on insertions, if you have a large number of records to be inserted).
If you need to ensure unique constraint of the facts table, you could do it from the job transferring data into that table (I don't have details about that :).
Try to define an index on data_set(id, name) and experiment with it - maybe in combination with other suggestions here. You can change your existing UNIQUE index to a UNIQUE CONSTRAINT if your requirements dictate it and Firebird supports unique constraints.
精彩评论