SQL: eliminate the selection of redundant records that occur within X minutes
DB in use is Firebird 2.1,in case not familiar with here is select statement sql ref:
http://ibexpert.net/ibe/index.php?n=Doc.DataRetrieval functions ref: http://www.firebirdsql.org/file/documentation/reference_manuals/reference_material/html/langrefupd21.htmli will be happy with any sql slang [i'll convert it].
table schema:
CREATE TABLE EVENT_MASTER (
EVENT_ID BIGINT NOT NULL,
EVENT_TIME BIGINT NOT NULL,
DATA_F1 VARCHAR(40),
DATA_F2 VARCHAR(40),
PRIMARY KEY (EVENT_ID)
);
the bad news is EVENT_TIME stored as seconds elapsed since the Epoch.
data samples:
"EVENT_ID","EVENT_TIME","DATA_F1","DATA_F2"
25327,1297824698,"8604","A"
25328,1297824770,"8604","I"
25329,1297824773,"8604","A"
25330,1297824793,"8604","A"
25331,1297824809,"8604","1"
25332,1297824811,"8604","GREY"
25333,1297824812,"8604","A"
25334,1297824825,"8604","GREY"
25335,1297824831,"8604","A"
25336,1297824833,"8604","GREY"
25337,1297824838,"8604","A"
25338,1297824840,"8604","1"
25339,1297824850,"8604","A"
25340,1297824864,"8604","A"
25341,1297824875,"8804","GREY" //notice DATA_F1 is different
25342,1297824876,"8604","G"
25343,1297824877,"8604","A"
25344,1297824880,"8604","GREY"
25345,1297824895,"8604","1"
25346,1297824899,"8604","A"
25347,1297824918,"8604","GREY"
25348,1297824930,"8604","YELLOW"
25349,1297824939,"8604","GREY"
25350,1297824940,"8604",""
25351,1297824944,"8604","A"
25352,1297824945,"8604","1"
25353,1297824954,"8604","B"
25354,1297824958,"8604",""
25355,1297824964,"8604","1"
25356,1297824966,"8604","GREY"
25357,1297824974,"8604","1"
25358,1297824981,"8604","GREY"
25359,1297824983,"8604",""
25360,1297824998,"8604","GREY"
25361,1297825003,"8604","2"
25362,1297825009,"8604","G"
25363,1297825018,"8604","GREY"
25364,1297825026,"8604","F"
25365,1297825045,"8604","GREY"
25366,1297825046,"8604","1"
expected output:
distinct "DATA_F1","DATA_F2" rows within X minutes according to EVENT_TIME: like:25341,1297824875,"8804","GREY"
25327,1297824698,"8604","A"
25328,1297824770,"8604","I"
25332,1297824811,"8604","GREY"
25348,1297824930,"8604","YELLOW"
..etc
Requirements: eliminate the selection of redundant records that occur within 5 minutes [range calcs according to EVENT_TIME column].
last i'm trying follow this pattern:
SELECT * FROM EVENT_MASTER inner join (
SELECT distinct DATA_F1, DATA_F2 FROM EVENT_MASTER where /*the hard stuff that i need help with: (EVENT_TIME difference within X minutes)*/
) as RemovedDup ON /*EVENT_MASTER.EVENT_ID = problem is i cant select RemovedDup ID otherwise distinct becomes useless!!*/
please help ASAP.
thanks,
EDIT
adding output based on Andrei K. answer:
25331,1297824809,"8604","1"
25327,1297824698,"8604","A"
25342,1297824876,"8604","G"
25332,1297824811,"8604","GREY"
25328,1297824770,"8604","I"
25341,1297824875,"8804","GREY"
25350,1297824940,"8604",""
25352,1297824945,"8604","1" /*bug: time still within 300 seconds, this same as first record*/
25361,1297825003,"8604","2"
25351,1297824944,"8604","A"
25353,1297824954,"8604","B"
25364,1297825026,"8604","F"
25362,1297825009,"8604","G"
25347,1297824918,"8604","GREY"
25372,1297825087,"8604","ORANGE"
25348,1297824930,"8604","YELLOW"
25382,1297825216,"8604","1"
25387,1297825270,"8604","B"
25394,1297825355,"8604","BLUE"
25381,1297825211,"8604","GREY"
EDIT 2: Russell query output: not good output and its very very slow.
1297824698,"8604","A"
1297824770,"8604","I"
1297824809,"8604","1"
1297824811,"8604","GREY"
1297824825,"8604","GREY"
1297824840,"8604","1"
1297824875,"8804","GREY"
1297824876,"8604","G"
1297824880,"8604","GREY"
1297824918,"8604","GREY"
1297824930,"8604","YELLOW"
1297824939,"8604","GREY"
1297824940,"8604",""
1297824945,"8604","1"
1297824954,"8604","B"
1297824964,"8604","1"
1297824998,"8604","GREY"
1297825003,"8604","2"
1297825018,"8604","GREY"
1297825026,"8604","F"
1297825045,"8604","GREY"
1297825046,"8604","1"
1297825063,"8604","1"
1297825079,"8604","GREY"
1297825087,"8604","ORANGE"
1297825094,"8604","GREY"
1297825100,"8604","1"
1297825133,"8604","GREY"
1297825176,"8604","GREY"
1297825216,"8604","1"
EDIT 3:
based on Russell request here is: all rows WHERE DATA_F1 = '8604' AND DATA_F2 = 'GREY'
25332,1297824811,"8604","GREY"
25334,1297824825,"8604","GREY"
25336,1297824833,"8604","GREY"
25344,1297824880,"8604","GREY"
25347,1297824918,"8604","GREY"
25349,1297824939,"8604","GREY"
25356,1297824966,"8604","GREY"
25358,1297824981,"8604","GREY"
25360,1297824998,"8604","GREY"
25363,1297825018,"8604","GREY"
25365,1297825045,"8604","GREY"
25367,1297825059,"8604","GREY"
25371,1297825079,"8604","GREY"
25373,1297825094,开发者_如何转开发"8604","GREY"
25376,1297825116,"8604","GREY"
25378,1297825133,"8604","GREY"
25380,1297825176,"8604","GREY"
25381,1297825211,"8604","GREY"
25384,1297825234,"8604","GREY"
25389,1297825286,"8604","GREY"
25390,1297825314,"8604","GREY"
25391,1297825323,"8604","GREY"
25393,1297825343,"8604","GREY"
25396,1297825370,"8604","GREY"
25397,1297825387,"8604","GREY"
25399,1297825416,"8604","GREY"
25401,1297825436,"8604","GREY"
25402,1297825445,"8604","GREY"
25404,1297825454,"8604","GREY"
50282,1299137344,"8604","GREY"
380151,1309849420,"8604","GREY"
As of this moment [Oct 11,2011 5:00 AM GMT] no absolute correct answer posted, and Andrei K. answer still the best try among others. so sql experts please help me find the solution otherwise i'll start to think that sql is not capable of handling the question requirements! Is it??
remark: event_time is not unique so multiple events can occur at the same second.
If by redundant rows you mean rows registered within 5 minutes and having the same data_f1, data_f2 then try something like this:
SELECT
e2.event_id,
e2.event_time,
e2.data_f1,
e2.data_f2
FROM
(SELECT trunc(event_time / 300), data_f1, data_f2, min(event_id) as e_id
FROM event_master
GROUP BY 1, 2, 3) e1
JOIN
event_master e2 ON e1.e_id = e2.event_id
U can try this :::
SELECT * FROM EVENT_MASTER group by (DATAF1, DATAF2) where
event_time >(SELECT TIME_TO_SEC(now())-300)
Hope this will help you..
I am not familiar with Firebird but I am using the documentation so if that is correct then this should work.
SELECT DISTINCT MIN(A.EVENT_TIME) as MINEVENT_TIME, B.DATA_F1, B.DATA_F2
FROM EVENT_MASTER as A
JOIN EVENT_MASTER as B ON A.EVENT_TIME BETWEEN B.EVENT_TIME-299 AND B.EVENT_TIME
AND B.DATA_F1 = A.DATA_F1 AND B.DATA_F2 = A.DATA_F2
GROUP BY B.DATA_F1, B.DATA_F2, B.EVENT_TIME
This is syntax checked but untested.
This assumes all records have different values in event_time (or they will exclude each other).
SELECT
*
FROM
event_master AS data
WHERE
NOT EXISTS (
SELECT * FROM event_master
WHERE event_time > data.event_time - 300
AND event_time <= data.event_time
)
If mutliple events CAN happen with the same value in event_time
, can we assume that an event with a higher event_id
CAN NOT happen before an event with a lower event_id
? If so, you can modify the above as follows
SELECT
*
FROM
event_master AS data
WHERE
NOT EXISTS (
SELECT * FROM event_master
WHERE event_time > data.event_time - 300
AND event_time <= data.event_time
AND event_id < data.event_id
)
In the cases of multiple events happening at the same time, the one with the lowest event_id will be chosen.
In terms of performance, ensure that the data has an index where event_time
is the first indexed field.
As I understand this, you want to get distinct values for DATA_F1 and DATA_F2 but only for a 'Window' of 5 minutes; after that the values may occur again, right? (Sorry if I missunderstood the question, it's been a long day...) I don't know much about Firebird, but here's how you would do that in MS SQL server:
SELECT a.EVENT_ID, a.DATA_F1, a.DATA_F2, a.EVENT_TIME FROM
EVENT_MASTER AS a LEFT JOIN EVENT_MASTER AS b
ON a.DATA_F1=b.DATA_F1 AND
a.DATA_F2=b.DATA_F2 AND
a.EVENT_TIME<b.EVENT_TIME AND
b.EVENT_TIME-a.EVENT_TIME<=5*60
WHERE
b.EVENT_ID IS NULL
Also, while you test this, also try the modified version below: Hope this helps!
SELECT a.EVENT_ID, a.DATA_F1, a.DATA_F2, a.EVENT_TIME FROM
EVENT_MASTER AS a LEFT JOIN EVENT_MASTER AS b
ON a.DATA_F1=b.DATA_F1 AND
a.DATA_F2=b.DATA_F2 AND
a.EVENT_ID<b.EVENT_ID AND
a.EVENT_TIME<=b.EVENT_TIME AND
b.EVENT_TIME-a.EVENT_TIME<=5*60
WHERE
b.EVENT_ID IS NULL
ADDED: Ok, It seems like we have the correct results. Here's my suggestion to optimize this baby (as I've seen that Firebird support the EXISTS keyword, I've rewritten the query below):
SELECT a.EVENT_ID, a.DATA_F1, a.DATA_F2, a.EVENT_TIME FROM EVENT_MASTER AS a
WHERE NOT EXISTS (SELECT * FROM EVENT_MASTER AS b
WHERE a.DATA_F1=b.DATA_F1 AND
a.DATA_F2=b.DATA_F2 AND
a.EVENT_ID<b.EVENT_ID AND
a.EVENT_TIME<=b.EVENT_TIME AND
b.EVENT_TIME-a.EVENT_TIME<=5*60)
Also, please add the following index:
CREATE INDEX IX_SPEED ON EVENT_MASTER (EVENT_ID DESC, EVENT_TIME ASC, DATA_F1 ASC, DATA_F2 ASC)
Hope this helps!
You'd need a very nasty recursive query to accomplish this in purely "functional" way. I don't profess to be clever enough to construct such a query, let alone make it performant.
On the other hand, allowing side-effects (i.e. temporary table) significantly simplifies things. You should even be able to make it fairly quick by adding appropriate indexes on the temporary table (not shown here). Here is the actual SQL:
CREATE GLOBAL TEMPORARY TABLE EVENT_MASTER_TMP (
EVENT_ID BIGINT NOT NULL,
EVENT_TIME BIGINT NOT NULL,
DATA_F1 VARCHAR(40),
DATA_F2 VARCHAR(40),
PRIMARY KEY (EVENT_ID)
);
INSERT INTO EVENT_MASTER_TMP
SELECT * FROM
(SELECT * FROM EVENT_MASTER ORDER BY EVENT_TIME) E
WHERE
NOT EXISTS (
SELECT *
FROM EVENT_MASTER_TMP T
WHERE
E.DATA_F1 = T.DATA_F1
AND E.DATA_F2 = T.DATA_F2
AND E.EVENT_TIME - T.EVENT_TIME <= 5*60
);
SELECT * FROM EVENT_MASTER_TMP;
In plain English:
- Go through events from older to newer,
- for each event, check if it is redundant relative to some row that is already in the temporary table
- and if not, insert it in the temporary table, so it can be used as a criteria for the remaining events.
Executing this on your test data yields:
25327 1297824698 8604 A
25328 1297824770 8604 I
25331 1297824809 8604 1
25332 1297824811 8604 GREY
25341 1297824875 8804 GREY
25342 1297824876 8604 G
25348 1297824930 8604 YELLOW
25350 1297824940 8604
25353 1297824954 8604 B
25361 1297825003 8604 2
25364 1297825026 8604 F
Lowering the time threshold from 5*60
to, say, 233
, yield this:
25327 1297824698 8604 A
25328 1297824770 8604 I
25331 1297824809 8604 1
25332 1297824811 8604 GREY
25341 1297824875 8804 GREY
25342 1297824876 8604 G
25348 1297824930 8604 YELLOW
25350 1297824940 8604
25351 1297824944 8604 A <-- 246s difference
25353 1297824954 8604 B
25361 1297825003 8604 2
25364 1297825026 8604 F
25365 1297825045 8604 GREY <-- 234s difference
25366 1297825046 8604 1 <-- 237s difference
try:
SELECT T1.* FROM EVENT_MASTER T1 WHERE EXISTS (
SELECT * FROM EVENT_MASTER T2
WHERE T2.DATA_F1=T1.DATA_F1
AND T2.DATA_F2=T1.DATA_F2
AND (T2.EVENT_TIME-T1.EVENT_TIME)<300
)
精彩评论