Beginner to sqlite trying to improve performance through database design, 'join' and 'where'
I have a large table (potentially millions of rows) that in its naive form would contain lots of repetition, e.g.:
CREATE TABLE sales(
id INTEGER PRIMARY KEY AUTOINCREMENT,
salesperson TEXT,
customer TEXT
);
INSERT INTO sales VALUES(NULL, "Rod", "Acme");
INSERT INTO sales VALUES(NULL, "Jane", "Xyz Corp");
INSERT INTO sales VALUES(NULL, "Freddy", "Acme");
<... many more lines containing significant repetitions of each salesperson and each customer >
I think I know enough that this is not the ideal approach when I want to perform queries such as:
SELECT count(*) FROM sales WHERE salesperson="Jane";
So my database should be rearranged as:
CREATE TABLE sales(
id INTEGER PRIMARY KEY AUTOINCREMENT,
salesperson INTEGER,
customer INTEGER
);
CREATE TABLE salespeople(
id INTEGER PRIMARY KEY AUTOINCREMENT,
name TEXT
);
CREATE TABLE customers(
id INTEGER PRIMARY KEY AUTOINCREMENT,
name TEXT
);
In reality there are many more columns in the single-table solution, and therefore more distinct tables in the normalized (if that's even the right use of the term) version.
I've timed tests of a dumb query against the single-table v.s. a JOIN with a WHERE aga开发者_如何学Pythoninst the split-tables version and the "smart" version is about 25% slower, so clearly I'm missing something:
SELECT count(*) FROM sales INNER JOIN salespeople ON sales.salesperson=salespeople.id WHERE salespeople.name="Freddy";
If I lookup the salesperson.id first as a separate query, I see times ~33% faster than the dumb implementation (which is less than I was hoping for).
SELECT id FROM salespeople WHERE name="Freddy";
...then
SELECT count(*) FROM sales WHERE salesperson=previouslyLookedUpId;
I get the impression that the WHERE (in the multi-table solution) is being evaluated for every row, rather than once to determine the relevant salesperson id, as intended. Clearly my queries (or constraints, or indices?) aren't allowing the db engine to work efficiently.
I only know enough to know that I don't know enough... What is the right approach here?
First, see my comment about indexes.
Second, the goal of "normalizing" a database design is not performance. Often it is the case that a normalized design will perform more slowly than a denormalized or "flattened" table. Joins can be expensive in terms of resource consumption.
The goal of normalizing is the proper representation of the entities and their relationships; if your domain is accurately mapped to a normalized design it is impossible to get different answers to the same question. That's the goal.
Also, a normalized design is extensible. That is another major virtue and goal.
Since you use the word "smart". A normalized design is "simple-minded", not smart, though it takes smarts to attain simplicity.
P.S.
What kind of performance do you get, relative to a standard join, with this approach:
from sales
where salespersonid = (select id from SALESPEOPLE where name = 'Joe')
when SALESPEOPLE.name and salespersonid are indexed?
or this:
... from sales
inner join
(select id from SALESPEOPLE where name = 'Joe') as MYPEEPS
on sales.salespersonid = MYPEEPS.id
精彩评论