Pruning large graphs of stray nodes
I have a graph consisting of about 35,000 nodes represented in plain text:
node1 -> node35000
node29420 -> node35000
node2334 -> node4116
...
I'd like to trim it down by removing nodes that are not part of a chain at least three long. So if I had only
1 -> 2;
2 -> 3;
3 -> 4;
0 -> 4;
I'd like to keep 1, 2, 3, and 4 (since 1 -> 2 -> 3 -> 4
is four nodes long) but discard 0, that is, remove 0 -> 4
.
Any idea of a good way to do this? I tried a combination of Perl and shell functions but I think I need a better approach. Unless maybe there are tools to do this already? The data is in graphviz format but I didn'开发者_运维技巧t see any tools in that suite relevant to the task at hand.
Oh, and if there's an easy way to do something like this I'm open to suggestions -- it doesn't need to be exactly the task I suggested. I'm just looking for a way to remove most of the noise surrounding the big clumps (which are rare and mostly just a few intersecting chains).
The tool gvpr which is part of the graphviz tools allows to apply rules to a graph and output the modified graph.
From the description:
It copies input graphs to its output, possibly transforming their structure and attributes, creating new graphs, ...
It looks like you want to remove all nodes having an indegree of 0 and having only linked nodes (successors) with an outdegree of 0.
Here's my version of a gvpr
script nostraynodes.gv
:
BEGIN {node_t n; int candidates[]; int keepers[];}
E{
if (tail.indegree == 0 && head.outdegree == 0)
{
candidates[tail] = 1;
candidates[head] = 1;
}
else if (tail.indegree == 0)
{
keepers[tail] = 1;
}
else if (head.outdegree == 0)
{
keepers[head] = 1;
}
}
END_G {
for (candidates[n]){
if (n in keepers == 0)
{
delete(NULL, n);
}
}
}
Here's what the script does:
Loop through all edges one time and populate two lists:
- candidates - a list of nodes which may have to be removed, and
- keepers - a list of nodes which may end up in candidates but should not be removed.
So what gets added to which list?
- Any two nodes connected to each other, where the tail node does not have any incoming edges and the head node does not have any outgoing edges, form a chain of only 2 nodes and are therefore candidates to be deleted; that is, unless the same nodes are part of an other chain longer than 2 nodes:
- A tail node without any incoming edges, but connected to a head node which itself has outgoing edges, is a keeper; and
- A head node without any outgoing edges, but connected to a tail node which itself has incoming edges, is also a keeper.
- Delete all candidates not in keepers
This solution is not generic and only works for the problem stated in the question, that is keeping only chains at least 3 nodes long. It also won't delete short loops (two nodes connected to each other).
You can call this using the following line:
gvpr -c -f .\nostraynodes.gv .\graph.dot
The output using your sample graph is:
digraph g {
1 -> 2;
2 -> 3;
3 -> 4;
}
Please note that this is my first gvpr
script - there are probably better ways to write this, and I'm not sure how this handles 35000 nodes, though I'm confident this should not be a big deal.
See also Graphviz/Dot - how to mark all leaves in a tree with a distinctive color? for a simpler example of graph transformation.
Gephi is an excellent open-source GUI tool for visualizing and manipulating graphs, and you will probably be able to find some kind of filter in there for this sort of thing... Maybe a degree filter would do: it would remove nodes which only have one edge. You can also filter on in-degree, out-degree, you can compute PageRank etc. It's also got some really nice size/label/colour options and is easy to zoom in/out of.
Supposing that any given node can have arbitrarily many predecessors or successors, then in-degree and out-degree of nodes is irrelevant to solving the problem.
Following is a simple O(N+E) algorithm for all graphs of N nodes and E edges, under the path-length-3 criterion. This algorithm can be easily implemented in Perl or C. The method is based on a definition and an assertion: Define a "made node" as any node that has a parent and child (predecessor and successor). Every node that will be kept is a made node or is a parent or child of a made node.
Initialize a status array S[Nmax] to all zeroes. Nmax is the maximum node number. If Nmax is not known at outset, read all the data and find it out.
Read in the given list of edges. Each input item specifies a directed edge (p, q) from node p to node q. For each (p, q) item that is read in: Set S[p] to S[p] | 1 to denote that p has a child, and set S[q] to S[q] | 2 to denote that q has a parent. (After this step, every made node n has S[n] == 3.)
Read the list of edges again. For each (p, q) item that is read in: If (S[p]==3) or (S[q] == 3) output edge (p,q).
To extend this method to path length K other than 3, keep the edge list in memory, maintain Sp[] and Sc[] with lengths of parent chains and child chains, and perform K/2 extra passes. Might be possible to do in time O(N+K*E). The problem does not specify whether the graph is a DAG (directed acyclic graph) but the example given is a DAG. For K>3, it may make a difference.
Update 1 Here's a more precise statement of a K>3 algorithm, with H[i].p and H[i].q being endpoints of edge #i, and pc[j], cc[j] being lengths of predecessor and successor chains about node j. Also, let E = # of edges; N = # of nodes; and K = desired min chain length for keeping an edge.
Read E edge data entries into H[ ] array. Set all pc[j], cc[j] entries to 0.
For i = 1 to E, set cc[H[i].p]=1 and pc[H[i].q]=1.
For j = 1 to K+1, { for i = 1 to E, { Let p=H[i].p and q=H[i].q. Set cc[p] = max(cc[p], 1+cc[q]) and pc[q] = max(pc[q], 1+pc[p]). } }
For i = 1 to E, { Let p=H[i].p and q=H[i].q. Output edge (p,q) if pc[p]+cc[p]+1 >= K and pc[q]+cc[q]+1 >= K.}
This method can make mistakes if graph is not a DAG and contains short looped paths. For example, if graph edges include (1,2) and (2,1) and no other nodes link to nodes 1 or 2, then neither of those edges should be output; but we end up with K+2 for cc[] and pc[] of those nodes, so they get output anyway.
精彩评论