Transitive SQL query on same table
Hey. consider d following table and data...
in_timestamp | out_timestamp | name | in_id | out_id | in_server | out_s开发者_如何学JAVAerver | status
timestamp1 | timestamp2 | data1 |id1 | id2 | others-server1 | my-server1 | success
timestamp2 | timestamp3 | data1 | id2 | id3 | my-server1 | my-server2 | success
timestamp3 | timestamp4 | data1 | id3 | id4 | my-server2 | my-server3 | success
timestamp4 | timestamp5 | data1 | id4 | id5 | my-server3 | others-server2 | success
- the above data represent log of a execution flow of some data across servers.
- e.g. some data has flowed from some 'outside-server1' to bunch of 'my-servers' and finally to destined 'others-server2'.
Question :
1) I need to give this log in representable form to client where he doesn't need to know anything about the bunch of 'my-servers'. All i am supposed to give is timestamp of the data entered my infrastructure and when it left; drilling down to following info.
in_timestamp (of 'others_server1' to 'my-server1')
out_timestamp (of 'my-server3' to 'others-server2')
name
status
I want to write sql for the same! Can someone help? NOTE : there might not be 3 'my-servers' all the time. It differs from situation to situation. e.g. there might be 4 'my-server' involved for, say, data2!
2) Are there any other alternatives to SQL? I mean stored procs/etc?
3) Optimizations? (The records are huge in number! As of now, it is around 5 million a day. And we are supposed to show records that are upto a week old.)
In advance, THANKS FOR THE HELP! :)
WITH RECURSIVE foo AS
(
SELECT *, in_timestamp AS timestamp1, 1 AS hop, ARRAY[in_id] AS hops
FROM log_parsing.log_of_sent_mails
WHERE in_server = 'other-server1'
UNION ALL
SELECT t_alias2.*, foo.timestamp1, foo.hop + 1, hops || in_id
FROM foo
JOIN log_parsing.log_of_sent_mails t_alias2
ON t_alias2.in_id = (foo.t_alias1).out_id
)
SELECT *
FROM foo
ORDER BY
hop DESC
LIMIT 1
Your table has a heirarchical structure (adjacency lists). This can be queried efficiently in PostgreSQL v8.4 and later using recursive CTEs. Quassnoi has written a blog post about how to implement it. It is a quite complex query that you need to write but he explains it well with examples very similar to what you need. Especially if you look at his last example, he demonstrates a query than gets the complete path from the first node to the last by using an array.
One way of doing it - if the data is STABLE (e.g. never changes onc inserted) is to compute the transitive relationships ON THE FLY (e.g. via a trigger or by the app which does the insertion) at the insert time.
E.g. you have a new column "start_ts" in your table; when you insert a record:
in_timestamp | out_timestamp | name | in_id | out_id | in_server | out_server | status
timestamp3 | timestamp4 | data1 | id3 | id4 | my-server2 | my-server3 | success
... then your logic automatically finds the record with name=data1
and out_id=id3
and clones its start_ts
into the newly inserted record. You may need some special logic around propagating last status as well depending on how you compute those transitive values.
BTW, you need not necessarily have to look up the previous (name=data1
and out_id=id3
) record - you can persist the start_ts
value in the data record's meta data itself while processing.
Then the final report is simply select start_ts, out_ts from T where out_server=others_server2
(of course more complicated as far as out_server and status, but still a single simple select)
A second option is of course the more straightforward loop computing the resulting report - google or "stack" (is that an accepted verb now?) for SQL BFS implementations if you're not sure how.
@Other Readers :
Refer to 1st answer posted by Mark Byers first. I used 'answering' rather than 'commenting' his post since i needed to use tables/links etc. which is not available while commenting on answers. :)
@Mark Byers :
Thanks for the link... It really helped me and i was able to figure out the way to generate the path between the servers... Have a look @ what i was able to do.
in_id | in_timestamp | out_timestmap | name | hops_count | path |
id1 | timestamp1 | timestamp2 | data1 | 1 | {id1} |
id2 | timestamp2 | timestamp3 | data1 | 2 | {id1,id2} |
id3 | timestamp3 | timestamp4 | data1 | 3 | {id1,id2,id3} |
id4 | timestamp4 | timestamp2 | data1 | 4 | {id1,id2,id3,id4} |
* path is generated using 'in_id'
I used the following query...
WITH RECURSIVE foo AS
(
SELECT t_alias1, 1 AS hops_count, ARRAY[in_id] AS hops
FROM log_parsing.log_of_sent_mails t_alias1
WHERE in_server = 'other-server1'
UNION ALL
SELECT t_alias2, foo.hops_count + 1 AS hops_count, hops || in_id
FROM foo
JOIN log_parsing.log_of_sent_mails t_alias2
ON t_alias2.in_id = (foo.t_alias1).out_id
)
SELECT (foo.t_alias1).in_id,
(foo.t_alias1).name,
(foo.t_alias1).in_timestamp,
hops_count,
hops::VARCHAR AS path
FROM foo ORDER BY hops
But i could not reach the ultimate stage yet. Here is what i wish to get ultimately...
in_id | in_timestamp | out_timestmap | name | hops_count | path |
id4 | timestamp1 | timestamp5 | data1 | 4 | {id1,id2,id3,id4}|
* observe the timestamp. Its required since i do not wish the client to know about the internal infrastructure. So for him the time-lag between timestamp1 and timestamp5 is what matters.
Any clue how possibly i could achieve it!?
p.s. I would try contacting Quassnoi too. :)
精彩评论