How to identify clusters of nodes in a network
I have a table describing several sets of connected nodes:
node
origin_node REFERENCES node
start_time
end_time
and I want to find out how many clusters the datas开发者_运维知识库et contains, e.g. if the records were:
A, B, 10:00, 11:00
B, C, 9:00, 9:15
D, E, 10:00, 10:15
B, A, 13:00, 13:30
E, B, 12:00, 13:20
F, G, 9:00, 9:15
...then I'd have 2 clusters {A,B,C,D,E} and {F,G}
(the times are pretty much irrelevant - it's just there to demonstrate that node+origin_node is not necessarily unique / ordered).
But I'm a bit stuck in working out an algorithm which identifies the clusters from a few thousand rows.
I'm working with MySQL 5.0.22 - so no 'CONNECT BY', and have access to PHP and awk - although it'd be easier for me to understand an algorithm rather than a coded solution. And as long as it takes less than a couple of hours to analyse the data, I'd lean to simplicity over order.
BTW: its a real-world problem - not homework (I stopped being a student a long time ago - perhaps too early ;)
TIA
it'd be easier for me to understand an algorithm rather than a coded solution
Tried these links?
http://en.wikipedia.org/wiki/Cluster_analysis
http://en.wikipedia.org/wiki/Category:Data_clustering_algorithms
Also, though not MySQL, there also is stuff on Microsoft's site:
http://msdn.microsoft.com/en-us/library/ms174879.aspx
Edit, per your comment:
In your particular case, something akin to creating a closure table might work.
Using a temporary table...
Start with an arbitrary node. Assign it to a new cluster.
Next node. Is there a link to a node from a currently identified cluster?
If no, assign it to a new cluster.
If yes, assign it to that cluster. Then, for each link, verify that already processed node is in the same cluster. If not, reassign them to that cluster.
Went with walking the network and flagging visited nodes (similar to garbage collection algorithms). Its reasonably efficient but needed quite a bit of code.
精彩评论