开发者

SQL to find the first occurrence of sets of data in a table

Say if I have a table:

CREATE TABLE T
(
    TableDTM  TIMESTAMP  NOT NULL,
    Code      INT        NOT NULL
);

And I insert some rows:

INSERT INTO T (TableDTM, Code) VALUES ('2011-01-13 10:00:00', 5);
INSERT INTO T (TableDTM, Code) VALUES ('2011-01-13 10:10:00', 5);
INSERT INTO T (TableDTM, Code) VALUES ('2011-01-13 10:20:00', 5);
INSERT INTO T (TableDTM, Code) VALUES ('2011-01-13 10:30:00', 5);
INSERT INTO T (TableDTM, Code) VALUES ('2011-01-13 10:40:00', 0);
INSERT INTO T (TableDTM, Code) VALUES ('2011-01-13 10:50:00', 1);
INSERT INTO T (TableDTM, Code) VALUES ('2011-01-13 11:00:00', 1);
INSERT INTO T (TableDTM, Code) VALUES ('2011-01-13 11:10:00', 1);
INSERT INTO T (TableDTM, Code) VALUES ('2011-01-13 11:20:00', 0);
INSERT INTO T (TableDTM, Code) VALUES ('2011-01-13 11:30:00', 5);
INSERT INTO T (TableDTM, Code) VALUES ('2011-01-13 11:40:00', 5);
INSERT INTO T (TableDTM, Code) VALUES ('2011-01-13 11:50:00', 3);
INSERT INTO T (TableDTM, Code) VALUES ('2011-01-13 12:00:00', 3);
INSERT INTO T (TableDTM, Code) VALUES ('2011-01-13 12:10:00', 3);

So I end up with a table similar to:

2011-01-13 10:00:00开发者_StackOverflow中文版, 5
2011-01-13 10:10:00, 5
2011-01-13 10:20:00, 5
2011-01-13 10:30:00, 5
2011-01-13 10:40:00, 0
2011-01-13 10:50:00, 1
2011-01-13 11:00:00, 1
2011-01-13 11:10:00, 1
2011-01-13 11:20:00, 0
2011-01-13 11:30:00, 5
2011-01-13 11:40:00, 5
2011-01-13 11:50:00, 3
2011-01-13 12:00:00, 3
2011-01-13 12:10:00, 3

How can I select the first date of each set of identical numbers, so I end up with this:

2011-01-13 10:00:00, 5
2011-01-13 10:40:00, 0
2011-01-13 10:50:00, 1
2011-01-13 11:20:00, 0
2011-01-13 11:30:00, 5
2011-01-13 11:50:00, 3

I've been messing about with sub queries and the like for most of the day and for some reason I can't seem to crack it. I'm sure there's a simple way somewhere!

I would probably want to exclude the 0's from the results, but that's not important for now..


Revised 15 Jan 11

I'm sure there's a simple way somewhere!

Yes, there is. But first, two Issues.

  1. The table is not a Relational Database table. It does not have an unique key, which is demanded by the RM and Normalisation (specifically that each row must have an unique identifier; not necessarily a PK). Therefore SQL, a standard language, for operating on Relational Database tables, cannot perform basic operations on it.

    • it is a Heap (data structure, inserted and deleted in chronological order), with records not rows.
    • any and all operations using SQL will be horribly slow, and will not be correct
    • SET ROWCOUNT to 1, perform row processing, and SQL will work on the Heap just fine
    • your best bet is use any unix utiliy to operate on it (awk, cut, chop). They are blindingly fast. The awk script required to answer your requirement would take 3 mins to write and it will run in seconds for millions of records (I wrote a few last week).
      .

    So the question really is SQL to find the first occurrence of sets of data in a non-relational Heap.

    Now if your question was SQL to find the first occurrence of sets of data in a Relational table, implying of course some unique row identifier, that would be (a) easy in SQL, and (b) fast in any flavour of SQL ...

    • except Oracle, which is known to handle subqueries badly (specifically Tony Andrews' comments, he is a well-known authority on Oracle). In which case, use Materialised Views.
      .
  2. The question is very generic (no complaint). But many of these specific needs are usually applied within a larger context, and the context has requirements which are absent from the specification here. Generally the need is for a simple Subquery (but in Oracle use a Materialised View to avoid the subquery). And the subquery, too, depends on the outer context, the outer query. Therefore the answer to the small generic question will not contain the answer to the actual specific need.


Anyway, I do not wish to avoid the question. Why don't we use a real world example, rather than a simple generic one; and find the first or last occurrence, or minimum or maximum value, of a set of data, within another set of data, in a Relational table ?

Main Query

Let's use the ▶Data Model◀ from your previous question.

Report all Alerts since a certain date, with the peak Value for the duration, that are not Acknowledged

Since you will be using exactly the same technique (with different table and column names) for all your temporal and History requirements, you need to fully understand the basic construct of a Subquery, and its different applications.

Introduction

Note that you have, not only a pure 5NF Database, with Relational Identifiers (composite keys), you have full Temporal capability throughout, and the temporal requirement is rendered without breaking 5NF (No Update Anomalies), which means the ValidToDateTime for periods and durations is derived, and not duplicated in data. Point is, that complicates things, hence this is not the best example for a tutorial on Subqueries.

  • Remember the SQL engine is a set-processor, so we approach the problem with a set-oriented mindset
    • do not dumb the engine down to row-processing; that is very slow
    • and more important, unnecessary
  • Subqueries are normal SQL. The syntax I am using is straight ISO/IEC/ANSI SQL.
    • if you cannot code subqueries in SQL, you will be very limited; and then need to introduce data duplication or use large result sets as Materialised Views or temporary tables or all manner of additional data and additional processing, which will be s.l.o.w to v.e.r.y s.l.o.w, not to mention completely unnecessary
    • if there is anything you cannot do in a truly Relational Database (and my Data Models always are) without switching to row-processing or inline views or temp tables, ask for help, which is what you have done here.
  • You need to fully understand the first Subquery (simpler) before attempting to understand the second; etc.

Method

First build the Outer query using minimum joins, etc, based on the structure of the result set that you need, and nothing more. It is very important that the structure of the outer query is resolved first; otherwise you will go back and forth trying to make the subquery fit the outer query, and vice versa.

  • That happens to require a Subquery as well. So leave that part out for now, and pick that up later. For now, the Outer query gets all (not un-acknowledged) Alerts after a certain date

The ▶SQL code◀ required is on page 1 (sorry, the SO edit features are horrible, it destroys the formatting, and the code is already formatted).

Then build the Subquery to fill each cell.

Subquery (1) Derive Alert.Value

That is a simple derived datum, select the Value from the Reading that generated the Alert. The tables are related, the cardinality is 1::1, so it is a straight join on the PK.

  • The type of Subquery required here is a Correlated Subquery, we need to correlate a table in the Outer query to a table in the (inner) Subquery.
    • in order to do that, we need an Alias for the table in the Outer query, to correlate it to a table in the Subquery.
    • to make the distinction, I have used aliases only for such required correlation, and fully qualified names for plain joins
  • Subqueries are very fast in any engine (except Oracle)
  • SQL is a cumbersome language. But that's all we have. So get used to it.

The ▶SQL code◀ required is on page 2.

I have purposely given you a mix of joins in the Outer Query vs obtaining data via Subquery, so that you can learn (you could alternately obtain Alert.Value via a join, but that would be even more cumbersome).

The next Subquery we need derives Alert.PeakValue. For that we need to determine the Temporal Duration of the Alert. We have the beginning of the Alert Duration; we need to determine the end of the Duration, which is the next (temporally) Reading.Value that is within range. That requires a Subquery as well, which we better handle first.

  • Work the logic from the inside, outward. Good old BODMAS.

Subquery (2) Derive Alert.EndDtm

A slightly more complex Suquery to select the first Reading.ReadingDtm, that is greater than or equal to the Alert.ReadingDtm, that has a Reading.Value which is less than or equal to its Sensor.UpperLimit.

Handling 5NF Temporal Data

For handling temporal requirements in a 5NF Database (in which EndDateTime is not stored, as is duplicate data), we work on a StartDateTime only, and the EndDateTime is derived: it is the next StartDateTime. This is the Temporal notion of Duration.

  • Technically, it is one millisec (whatever the resolution for the Datatype used) less.
  • However, in order to be reasonable, we can speak of, and report, EndDateTime as simply the Next.StartDateTime, and ignore the one millisecond issue.
  • The code should always use >= This.StartDateTime and < Next.StartDateTime.
    • That eliminates a slew of avoidable bugs
    • Note that these comparison operators, which bracket the Temporal Duration, and should be used in a conventional manner throughout as per above, are quite independent of similar comparison operators related to business logic, eg. Sensor.UpperLimit (ie. watch for it, because both are often located in one WHERE clause, and it is easy to mix them up or get confused).

The ▶SQL code◀ required, along with test data used, is on page 3.

Subquery (3) Derive Alert.PeakValue

Now it is easy. Select the MAX(Value) from Readings between Alert.ReadingDtm and Alert.EndDtm, the duration of the Alert.

The ▶SQL code◀ required is on page 4.

Scalar Subquery

In addition to being Correlated Subqueries, the above are all Scalar Subqueries, as they return a single value; each cell in the grid can be filled with only one value. (Non-Scalar Subqueries, that return multiple values, are quite legal, but not for the above.)

Subquery (4) Acknowledged Alerts

Ok, now that you have a handle on the above Correlated Scalar Subqueries, those that fill cells in a set, a set that is defined by the Outer query, let's look at a Subquery that can be used to constrain the Outer query. We do not really want all Alerts (above), we want Un-Acknowledged Alerts: the Identifiers that exist in Alert, that do not exist in Acknowledgement. That is not filling cells, that is changing the content of the Outer set. Of course, that means changing the WHERE clause.

  • We are not changing the structure of the Outer set, so there is no change to the FROM and existing WHERE clauses.

Simply add a WHERE condition to exclude the set of Acknowledged Alerts. 1::1 cardinality, straight Correlated join.

The ▶SQL code◀ required is on page 5.

The difference is, this is a non-Scalar Subquery, producing a set of rows (one column). We have an entire set of Alerts (the Outer set) matched against an entire set of Acknowledgements.

  • The matching is processed because we have told the engine that the Subquery is Correlated, by using an alias (no need for cumbersome joins to be identified)
  • Use 1, because we are performing an existence check. Visualise it as a column added onto the Alert set defined by the Outer query.
  • Never use * because we do not need the entire set of columns, and that will be slower
  • Likewise, failing to use a correlation, means a WHERE NOT IN () is required, but again, that constructs the defined column set, then compares the two sets. Much slower.

Subquery (5) Actioned Alerts

As an alternative constraint on the Outer query, for un-actioned Alerts, instead of (4), exclude the set of Actioned Alerts. Straight Correlated join.

The ▶SQL code◀ required is on page 5.

This code has been tested on Sybase ASE 15.0.3 using 1000 Alerts and 200 Acknowledgements, of different combinations; and the Readings and Alerts identified in the document. Zero milliseconds execution time (0.003 second resolution) for all executions.

If you need it, here is the ▶SQL Code in Text Format◀.

Response to Comments

(6) ▶Register Alert from Reading◀
This code executes in a loop (provided), selecting new Readings which are out-of-range, and creating Alerts, except where applicable Alerts already exist.

(7) ▶Load Alert From Reading◀
Given that you have a full set of test data for Reading, this code uses a modified form of (6) to load the applicable Alerts.

Common Problem

It is "simple" when you know how. I repeat, writing SQL without the ability to write Subqueries is very limiting; it is essential for handling Relational Databases, which is what SQL was designed for.

  • Half the reason developers implement unnormalised data heaps (massive data duplication) is because they cannot write the subqueries required for Normalised structures
    • it is not that they have "denormalised for performance"; it is that they cannot code for Normalised. I have seen it a hundred times.
    • Case in point here: you have a fully Normalised Relational Database, and the difficulty is coding for it, and you were contemplating duplicating tables for processing purposes.
  • And that is not counting the added complexity of a temporal database; or a 5NF temporal database.
  • Normalisation means Never Duplicate Anything, more recently known as Don't Repeat Yourself
  • Master Suqueries and you will be in the 98th percentile: Normalised, true Relational Databases; zero data duplication; very high performance.

I think you can figure out the remaining queries you have.

Relational Identifier

Note, this example also happens to demonstrate the power of using Relational Identifiers, in that several tables in-between the ones we want do not have to be joined (yes! the truth is Relational Identifiers means less, not more, joins, than Id keys). Simply follow the solid lines.

  • Your temporal requirement demands keys containing DateTime. Imagine trying to code the above with Id PKs, there would be two levels of processing: one for the joins (and there would be far more of them), and another for the data processing.

Label

I try to stay away from colloquial labels ("nested", "inner", etc) because they are not specific, and stick to specific technical terms. For completeness and understanding:

  • a Subquery after the FROM clause, is a Materialised View, a result set derived in one query and then fed into the FROM clause of another query, as a "table".
    • The Oracle types call this Inline View.
    • In most cases, you can write Correlated Subqueries as Materialised Views, but that is massively more I/O and processing (since Oracles handling of subqueries is abyssmal, for Oracle only, Materialised Views are "faster").
      .
  • A Subquery in the WHERE clause is a Predicate Subquery, because it changes the content of the result set (that which it is predicated upon). It can return either a Scalar (one value) or non-Scalar (many values).

    • for Scalars, use WHERE column =, or any scalar operator

    • for non-Scalars, use WHERE [NOT] EXISTS, or WHERE column [NOT] IN

  • A Suquery in the WHERE clause does not need to be Correlated; the following works just fine. Identify all superfluous appendages:

    SELECT  [Never] = FirstName,
            [Acted] = LastName 
        FROM User 
        WHERE UserId NOT IN ( SELECT DISTINCT UserId
            FROM Action
            )


Try this:

SELECT MIN(TableDTM) TableDTM, Code
FROM
(
    SELECT T1.TableDTM, T1.Code, MIN(T2.TableDTM) XTableDTM
    FROM T T1
    LEFT JOIN T T2
    ON T1.TableDTM <= T2.TableDTM
    AND T1.Code <> T2.Code
    GROUP BY T1.TableDTM, T1.Code
) X
GROUP BY XTableDTM, Code
ORDER BY 1;


PostgreSQL supports window functions, have a look at this

[EDIT] Try the following:

SELECT TableDTM, Code FROM
(
    SELECT TableDTM,
           Code,
           LAG(Code, 1, NULL) OVER (ORDER BY TableDTM) AS PrevCode
    FROM   T
)
WHERE PrevCode<>Code OR PrevCode IS NULL;


could you try something like

"SELECT DISTINCT Code, (SELECT MIN(TableDTM) FROM T AS Q WHERE Q.Code = T.Code) As TableDTM FROM T;"

and if you need to exclude the 0, change it in:

 SELECT DISTINCT Code, (SELECT MIN(TableDTM) FROM T AS Q WHERE Q.Code = T.Code) As TableDTM FROM T WHERE Code <> 0;


Maybe I don't understand the question. But I don't see any mention of Common Table Expression or Analytic Functions. These are my weapons of choice for most problems, and when they can't handle it I start resorting to temporary tables.

I think, I recently solve a similar problem where I want to get the data of the first occurrence of an error when processing a daily interface file. Records on the interface that cause a problem are removed to a set of holding table so the rest of the records can be processed.

-- EE with errors removed from most recent batch
with current_batch as (
      select employee_number, PVL.ADDITIONAL_INFORMATION
      from PERSONNEL_VALIDATION_LOG_X PVL
      where PVL.PERSONNEL_BATCH_ID = EMPSRV.CURRENTPERSONNELBATCH(6,900)
)
, hist as (
  select 
    row_number() over (
      partition by X.EMPLOYEE_NUMBER, X.ADDITIONAL_INFORMATION
      order by B.BATCH_STATUS_DATE
    ) as RN,
    B.PERSONNEL_BATCH_ID BatchId,
    B.SUBMITTAL_DATE,
    X.EMPLOYEE_NUMBER EMPNUM,
    MX.LAST_NAME,
    MX.FIRST_NAME,
    X.ADDITIONAL_INFORMATION
  from PERSONNEL_VALIDATION_LOG_X X
  join current_batch C on
    X.Employee_number = C.EMPLOYEE_NUMBER
    and X.additional_information = C.ADDITIONAL_INFORMATION
  join empsrv.personnel_batch B 
    on B.PERSONNEL_BATCH_ID = X.PERSONNEL_BATCH_ID
  join EMPSRV.PERSONNEL_MEMBER_DATA_X MX
    on X.PERSONNEL_BATCH_ID = MX.PERSONNEL_BATCH_ID
      and X.EMPLOYEE_NUMBER = MX.EMPLOYEE_NUMBER
)
select 
  batchId, 
  to_char(submittal_date, 'mm/dd/yyyy') First_Reported,
  EmpNum, 
  Last_name, 
  first_name, 
  additional_information
from hist where rn = 1
order by submittal_date desc;

The first CTE just limits the population to current errors. The hist CTE goes through the logs and picks up the first occurrence of that error (ie. ame EE and messge) This isn't perfect because maybe the error went away and came back, I would get the oldest occurrence and not the start of the most recent sequence. But this is good enough and not likely due to the shape of the error message itself. The finally query just picks off the top row of each group which will be the first occurrence.

The query takes a few seconds to run, but my logs are not especially large, so performance is almost never an issue for me ever. I also don't pay much attent to the dates on the questions.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜