doing a pivot-table-ish JOIN in SQL
My employer has a batch compute cluster that processes jobs submitted by开发者_如何学Go users. Each batch job consists of three steps:
- job started
- job finished
- results reported to the user
The batch job management software logs when each of these steps happen, and the logfile consists of a tuple with an ID code of the employee who submitted the job, what step occurred, and a timestamp of when it occurred. In CSV, it looks like:
ID step timestamp
-- ------ ---------
A start 3533
B start 3538
B finish 3549
C start 3551
A finish 3557
B report 3559
C finish 3602
A report 3603
B start 3611
C report 3623
B finish 3643
B report 3657
And so forth.
One additional characteristic of the dataset is that there is concurrence between employees, but no concurrence within employees; ie, each employee has to wait until their current job has reported before their next job starts. So when I sort by date and limit the results to a single employee, the records always come out in the order "start","finish","report".
I want to create a pivot table that groups each job into a single row. So the above data becomes:
employee-ID started finished reported
----------- ------- -------- --------
A 3533 3557 3603
B 3538 3549 3559
B 3611 3643 3657
C 3551 3602 3623
So, on to the SQL:
SELECT
log.ID AS employee-ID,
start.timestamp AS started,
finish.timestamp AS finished,
report.timestamp AS reported
FROM
log
LEFT OUTER JOIN log start ON
log.ID = start.ID
AND start.step = 'start'
LEFT OUTER JOIN log finish ON
log.ID = finish.ID
AND finish.step = 'finish'
AND start.timestamp < finish.timestamp
LEFT OUTER JOIN log report ON
log.ID = report.ID
AND report.step = 'report'
AND finish.timestamp < report.timestamp
ORDER BY employee-ID,started,finished,reported;
I do need LEFT OUTER JOIN, because I also need to identify the jobs that were started but were not finished or reported.
This works pretty well. It does give me the rows that I need. But it gives me a lot of spurious rows, because the JOINs match finish
and report
entries for future jobs of the same employee in addition to the current job. So the report comes out looking like:
employee-ID started finished reported
----------- ------- -------- --------
A 3533 3557 3603
B 3538 3549 3559
B 3538 3549 3657 <-- spurious
B 3538 3643 3657 <-- spurious
B 3611 3643 3657
C 3551 3602 3623
It's easy to recognize the spurious rows: each job gets started only once, so given the sorting, the correct row is the first row with a unique "started" value. I'm working around the spurious rows problem right now at the application level by just skipping over the spurious rows, but that just seems, well, inelegant. And it's costly: some of these employees have dozens of jobs submitted, so currently, the results of my queries are about 15% legitimate entries and 85% spurious. That's a lot of wasted time skipping over bogus entries. It would be nice if each job had a unique ID, but I just don't have that data.
I need to somehow limit the JOIN so that it picks off only one "finished" and "reported" entry for each "started" entry: the single entry that has the minimum timestamp greater than the timestamp of the preceding step. I tried doing this by using a subquery as the table to which I was JOINing, but I couldn't figure out how to do it without a correlated subquery. I also tried doing it by using "GROUP BY employee-ID,started", but this didn't necessarily pick the "correct" row. Most of the rows the "GROUP BY" reported were the wrong ones.
So, SQL gurus, is it possible to report just the rows I need to? And if so, how? I am using sqlite3 right now, but could transfer the database to MySQL if needed.
The problem is how you're joining to finish
and report
You don't want the start.timestamp < finish.timestamp
you really want start.timestamp < MIN(finish.timestamp)
Of course that doesn't work so you'll have to do it after the join.
e.g.
SELECT
log.ID AS employee_ID,
start.timestamp AS started,
MIN(finish.timestamp) AS finished,
MIN(report.timestamp) AS reported
FROM
log
LEFT OUTER JOIN log start ON
log.ID = start.ID
AND start.step = 'start'
LEFT OUTER JOIN log finish ON
log.ID = finish.ID
AND finish.step = 'finish'
AND start.timestamp < finish.timestamp
LEFT OUTER JOIN log report ON
log.ID = report.ID
AND report.step = 'report'
AND finish.timestamp < report.timestamp
GROUP BY log.ID,
start.timestamp
ORDER BY
employee_ID,started,finished,reported
Also you could probably convert the start to an inner join as it doesn't make a whole lot of sense to have a finish without a start
精彩评论