SQL query help with GROUPing and nested queries

2023-01-28 08:44 问答作者：

Given are two tables,

Table A containing customerid, lastchange, internallink
Table B containing internallink, turnover

(I'm simplyfying here and changing it to a generic example, the actual structure is more complex. SQL dialect for now is mySQL.)

The only unique thing (per table) is the internallink. There are several records in A with the same customerID, different dates in lastchange, and different internallink values. There are other items linked with this; I cannot change the tables.

I need the recordIDs from A which are the most recent for a customer (highest lastchange value of all with the same customerID) and to which the entries in B which match a certain value condition are linked.

I think the

SELECT `internallink` FROM `B` WHERE (`turnover` > 10000)

part is not the issue.

I got this far:

SELECT `customerID`, MAX(`lastchange`)
  FROM `A` 
 WHERE `lastchange` IN (SELECT `internallink` FROM `B` 
                         WHERE `turnover` > 10000)
 GROUP BY `customerID`;

Alas, that statement gives wrong results, because above will return me customerIDs for which the most rec开发者_JAVA技巧ent value does not fulfill the criteria, but some older did - it selects the oldest that did, and returns this. But in case the most recent entry is below threshold, the customerID should not turn up at all.

Where did I go wrong, and what is the correct approach to this?

Sample Data Table A

customerid   lastchange   internallink
         3   2010-02-11   11
         3   2010-09-04   12
         3   2010-10-22   13
         3   2010-11-23   14
         4   2010-05-05   15
         4   2010-12-01   16
         5   2010-11-28   17
         5   2010-11-29   18

Table B

internallink  turnover
          11     47000
          12     11000
          13      8000
          14     15000
          15     17000
          16     23000
          17     50000
          18     10000

The actual threshold in my tests is 12000. You can see customerID should not be in the result set, since the most recent entry is below the threshold.

The result set should be (3,2010-11-23)(4,2010-12-01) - but currently it also contains (5,2010-11-28), which is wrong.

Getting a bit closer (with your help, thanks!), these two statements both work:

SELECT customerID, MAX(lastchange), internallink FROM A GROUP BY customerID; SELECT internallink FROM B WHERE (turnover > 12000);

Now all I need is the intersection of both... with the correct logic!

The following query should do what you want. It is not the most performant way to write this kind of query. But it's using standard SQL and it executes in any database.

The works like this: The inner subquery finds all customerids along with the latest changedate. For each such pair (customerid, lastchange), we find the original row in table A. Having found a row in table A, we use the internallink to find a matching record in B, but only if the associated turnover is greater than 10000.

drop table a;
drop table b;

create table a(
   customerid   int  not null
  ,lastchange   date not null
  ,internallink int  not null
);

create table b(
   internallink int not null
  ,turnover     int not null
);

insert into a values(3, date '2010-02-11', 11);
insert into a values(3, date '2010-09-04', 12);
insert into a values(3, date '2010-10-22', 13);
insert into a values(3, date '2010-11-23', 14);
insert into a values(4, date '2010-05-05', 15);
insert into a values(4, date '2010-12-01', 16);
insert into a values(5, date '2010-11-28', 17);
insert into a values(5, date '2010-11-29', 18);

insert into b values(11, 47000);
insert into b values(12, 11000);
insert into b values(13,  8000);
insert into b values(14, 15000);
insert into b values(15, 17000);
insert into b values(16, 23000);
insert into b values(17, 50000);
insert into b values(18, 10000);

select a.customerid
      ,a.lastchange
      ,a.internallink
      ,b.turnover
  from a
  join b on (a.internallink = b.internallink)
 where b.turnover > 10000
   and (a.customerid, a.lastchange) in(select customerid,max(lastchange)
                                         from a
                                     group by customerid);

This works in sql server - I'm not sure if mySql has a similar ranking functions.

select a.id, a.lastchange, b.turnover, a.rownumber from B b inner join 
(SELECT id, lastchange, internallink, ROW_NUMBER() OVER(PARTITION BY id ORDER BY lastchange DESC) AS 'rownumber'
FROM A) a on b.internallink = a.internallink
where a.rownumber = 1 and b.turnover > 5000

The "ROW_NUMBER() OVER(PARTITION BY id ORDER BY lastchange DESC) AS 'rownumber'" means...

I want to group all the same ids together and order them by lastchange by desc after that count each row. Oh and name that column rownumber.

id  lastchange    internallink  rownumber
1   2010-01-03    2           1
1   2010-01-02    1           2
1   2010-01-01    1           3
2   2010-01-04    2           1

Selecting any record with a rownumber of 1 will return the last modified record of an id.

After lots of testing and some research, I found this solution, and I post this in case anybody else should face a similar problem.

An additional table "cache" keeps copies of the most recent entries in table A, reducing the complexity a lot. It is kept current by using triggers like this:

CREATE TRIGGER sync_a_insert AFTER INSERT ON a FOR EACH ROW 
    INSERT INTO cache (`customerID`, `internallink`) VALUES (NEW.`customerID`,NEW.`internallink`);
CREATE TRIGGER sync_a_update AFTER UPDATE ON a FOR EACH ROW 
    UPDATE cache SET `internallink` = NEW.`internallink` WHERE (`customerID` = NEW.`customerID`);
CREATE TRIGGER sync_a_delete BEFORE DELETE ON a FOR EACH ROW 
    DELETE FROM cache WHERE `customerID` = OLD.`customerID`;

For INSERT and UPDATE, those triggers go off after the fact, so the entries in Table a are complete before the cache updates. For DELETE, the cache needs to be updated before the original entry vanishes.

Once this is in place, everything else becomes simple:

SELECT `customerID` FROM cache WHERE `internallink` IN 
    (SELECT `internallink` FROM b WHERE (`turnover` > 10000));

For me, this is a viable solution, and it even speeds up the lookups. Of course there is a cost in DB size, but I think the performance is much better overall - as long as there is at least one more read access than write access, there is an improvement.

The answers you gave were, however, very helpful to me. I learned a bunch of things from them and from trying to follow your advice (even putting some of it to use in other places already). Thanks to all who replied to my question!

继续阅读：group-by grouping nested query-help sql

SQL query help with GROUPing and nested queries

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？