Nested subquery is too slow - outer join equivalent?

2023-02-10 16:09 问答作者：

I'm collecting some basic statistics on our codebase and am trying to generate a query using the following schema data

A files table holding all the files (synthetic Primary Key ID, unique path, and a region column which holds who the file belongs to.
A file_stats table holding开发者_StackOverflow data for the files on a specific date (Primary Key is combination of date and file_id)

CREATE TABLE files (
id INT PRIMARY KEY,
path VARCHAR(255) NOT NULL UNIQUE,
region VARCHAR(4) CHECK (region IN ('NYK', 'LDN', 'CORE', 'TKY')),
)

CREATE TABLE file_stats (
date DATE NOT NULL,
file_id INT NOT NULL REFERENCES files,
num_lines INT NOT NULL,

CONSTRAINT file_stats__pk PRIMARY KEY(date, file_id)
)

I'm trying to create a query which will return all combinations of dates and regions in the tables and the number of files for that combination.

The simple approach of

SELECT date, region, COUNT(*) FROM file_stats fs, files f WHERE fs.file_id = f.id
GROUP BY date, region

doesn't work as not all regions are represnted at all dates. I've tried

SELECT 
d.date, 
r.region,
(SELECT COUNT(*) FROM file_stats fs, files f 
WHERE fs.file_id = file.id AND fs.date = d.date AND d.region = r.region
) AS num_files
FROM
(SELECT DISTINCT date FROM file_stats) AS d,
(SELECT DiSTINCT region FROM files) AS r

but the performance is unacceptable because of the nested subquery.

I've tried LEFT OUTER JOINS, but never seem to be able to make them work. The database is SQLITE

Can anyone suggest a better query?

SELECT date, region, COUNT(*) FROM file_stats fs, files f WHERE fs.file_id = f.id
GROUP BY date, region

doesn't work as not all regions are represnted at all dates.

Assuming you mean it works correctly, but you need all the dates to show whether a region might appear there or not, then you need two things.

A calendar table.
A left join on the calendar table.

After you have a calendar table, something like this . . .

SELECT c.cal_date, f.region, COUNT(*) 
FROM calendar c
LEFT JOIN file_stats fs ON (fs.date = c.cal_date)
INNER JOIN files f ON (fs.file_id = f.id) 
GROUP BY date, region

I used cal_date above. The name you use depends on your calendar table. This will get you started. You can use a spreadsheet to generate the dates.

CREATE TABLE calendar (cal_date date primary key);
INSERT INTO "calendar" VALUES('2011-01-01');
INSERT INTO "calendar" VALUES('2011-01-02');
INSERT INTO "calendar" VALUES('2011-01-03');
INSERT INTO "calendar" VALUES('2011-01-04');
INSERT INTO "calendar" VALUES('2011-01-05');
INSERT INTO "calendar" VALUES('2011-01-06');
INSERT INTO "calendar" VALUES('2011-01-07');
INSERT INTO "calendar" VALUES('2011-01-08');

If you're certain that all the dates are in file_stats, you can do without a calendar table. But there are some cautions.

select fs.date, f.region, count(*)
from file_stats fs
left join files f on (f.id = fs.file_id)
group by fs.date, f.region;

This will work if your data is right, but your tables don't guarantee the data will be right. You don't have a foreign key reference, so there might be file id numbers in each table that don't have matching id numbers in the other table. Let's have some sample data.

insert into files values (1, 'a long path', 'NYK');
insert into files values (2, 'another long path', 'NYK');
insert into files values (3, 'a shorter long path', 'LDN'); -- not in file_stats

insert into file_stats values ('2011-01-01', 1, 35);
insert into file_stats values ('2011-01-02', 1, 37);
insert into file_stats values ('2011-01-01', 2, 40);
insert into file_stats values ('2011-01-01', 4, 35); -- not in files

Running this query (same as immediately above, but add ORDER BY) . . .

select fs.date, f.region, count(*)
from file_stats fs
left join files f on (f.id = fs.file_id)
group by fs.date, f.region
order by fs.date, f.region;

. . . returns

2011-01-01||1
2011-01-01|NYK|2
2011-01-02|NYK|1

'LDN' doesn't show, because there's no row in file_stats with file id number 3. One row has a null region, because no row in files has file id number 4.

You can quickly find mismatched rows with a left join.

select f.id, fs.file_id 
from files f
left join file_stats fs on (fs.file_id = f.id)
where fs.file_id is null;

returns

3|

meaning that there's a row in files that has id 3, but no row in file_stats that has id 3. Flip the table around to determine the rows in file_stats that have no matching row in files.

select fs.file_id, f.id
from file_stats fs 
left join files f  on (fs.file_id = f.id)
where f.id is null;

One (slower due to performance hit of a second half) way of doing what you want is a UNION of things that have a count with manufactured list of things that have zero count:

-- Include the counts for date/region pairs that HAVE files
SELECT date, region, COUNT(*) as COUNT1
FROM file_stats fs, files f 
WHERE fs.file_id = f.id
GROUP BY date, region

UNION

SELECT DISTINCT date, region, 0 as COUNT1
FROM file_stats fs0, files f0
WHERE NOT EXISTS (
    SELECT 1
    FROM   file_stats fs, files f 
    WHERE  fs.file_id = f.id
    AND    fs.date=fs0.date
    AND    f.region=f0.region
)

I'm not entirely sure why you're opposed to the use of temp tables? E.g. (this is Sybasyish syntax for temp table population but should port easily - don't recall exact SQLite one). Table size should be minimal (just # of days * # of regions)

CREATE TABLE COMBINATIONS TEMPORARY (region VARCHAR(4), date DATE)

INSERT COMBINATIONS SELECT DISTINCT date, region FROM files, file_stats

SELECT c.date, c.region, SUM(CASE WHEN file_stats.id IS NULL THEN 0 ELSE 1 END) 
FROM COMBINATIONS c
LEFT JOIN files f ON f.region=c.region
LEFT OUTER JOIN file_stats fs ON fs.date=c.date AND fs.file_id = f.id
GROUP BY c.date, c.region

I suspect that it is having to try scan file_stats and files for every single row of the output. The following version might be substantially faster. And it won't require creating new tables.

SELECT d.date
  , r.region
  , count(f.file_id) AS num_files
FROM (SELECT DISTINCT date FROM file_states) AS d,
  (SELECT DISTINCT region FROM files) AS r,
  LEFT JOIN file_stats AS fs
    ON fs.date = d.date
  LEFT JOIN files f
    ON f.file_id = fs.file_id
      AND f.region = r.region
GROUP BY d.date, r.region;

继续阅读：join outer-join subquery

Nested subquery is too slow - outer join equivalent?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？