开发者

Counting the occurrence of a sub-graph in a graph

I have a 3 dimensional dataset that describes the gene interactions which can be formulated as a graph. The sample of dataset is:

a + b  
b + c  
c - f  
b - d  
a + c  
f + g  
g + h  
f + h  

'+' indicates that a gene on the left side positively regulates the g开发者_Go百科ene on the right. In this data I want to count the sub-graph where a gene (say, x) positively regulates another gene (say, y), y in turn positively regulates yet another gene (say, z). Furthermore, z is also positively regulated by x. There are two such cases in above graph. I want to perform this search preferably using awk but any scripting language is fine. My apologies for being a too specific question and thanks in advance for the help.


Note: See the information regarding Graphviz below.

This should give you a starting point:

Edit: This version handles genes that are described by more than one character.

awk '
    BEGIN { regdelim = "|" }
    {
        delim=""
        if ($2 == "+") {
            if (plus[$1]) delim=regdelim
            plus[$1]=plus[$1] delim $3
        }
        else
            if ($2 == "-") {
            if (minus[$1]) delim=regdelim
                minus[$1]=minus[$1] delim $3
            }
    }
    END {
        for (root in plus) {
            split(plus[root],regs,regdelim)
            for (reg in regs) {
                if (plus[regs[reg]] && plus[root] ~ plus[regs[reg]]) {
                    print "Match: ", root, "+", regs[reg], "+", plus[regs[reg]]
                }
            }
        }
    }
' inputfile

In the BEGIN clause, set regdelim to a character that doesn't appear in your data.

I've omitted the processing code for the minus data.

Output:

Match:  a + b + c
Match:  f + g + h

Edit 2:

The version below allows you to search for arbitrary combinations. It generalizes the technique used in the original version so no code needs to be duplicated. It also fixes a couple of other bugslimitations.

#!/bin/bash
# written by Dennis Williamson - 2010-11-12
# for http://stackoverflow.com/questions/4161001/counting-the-occurrence-of-a-sub-graph-in-a-graph
# A (AB) B, A (AC) C, B (BC) C - where "(XY)" represents a + or a - 
# provided by the positional parameters $1, $2 and $3
# $4 carries the data file name and is referenced at the end of the script
awk -v AB=$1 -v AC=$2 -v BC=$3 '
    BEGIN { regdelim = "|" }
    {
        if ($2 == AB) {
            if (regAB[$1]) delim=regdelim; else delim=""
            regAB[$1]=regAB[$1] delim $3
        }
        if ($2 == AC) {
            if (regAC[$1]) delim=regdelim; else delim=""
            regAC[$1]=regAC[$1] delim $3
        }
        if ($2 == BC) {
            if (regBC[$1]) delim=regdelim; else delim=""
            regBC[$1]=regBC[$1] delim $3
        }
    }
    END {
        for (root in regAB) {
            split(regAB[root],ABarray,regdelim)
            for (ABindex in ABarray) {
                split(regAC[root],ACarray,regdelim)
                for (ACindex in ACarray) {
                    split(regBC[ABarray[ABindex]],BCarray,regdelim)
                    for (BCindex in BCarray) {
                        if (ACarray[ACindex] == BCarray[BCindex]) {
                            print "    Match:", root, AB, ABarray[ABindex] ",", root, AC, ACarray[ACindex] ",", ABarray[ABindex], BC, BCarray[BCindex]
                        }
                    }
                }
            }
        }
    }
' "$4"

This can be called like this to do an exhaustive search:

for ab in + -; do for ac in + -; do for bc in + -; do echo "Searching: $ab$ac$bc"; ./searchgraph $ab $ac $bc inputfile; done; done; done

For this data:

a - e
a + b
b + c
c - f
m - n
b - d
a + c
b - e
l - n
f + g
b + i
g + h
l + m
f + h
a + i
a - j
k - j
a - k

The output of the shell loop calling the new version of the script would look like this:

Searching: +++
    Match: a + b, a + c, b + c
    Match: a + b, a + i, b + i
    Match: f + g, f + h, g + h
Searching: ++-
Searching: +-+
Searching: +--
    Match: l + m, l - n, m - n
    Match: a + b, a - e, b - e
Searching: -++
Searching: -+-
Searching: --+
Searching: ---
    Match: a - k, a - j, k - j

Edit 3:

Graphviz

Another approach would be to use Graphviz. The DOT language can describe the graph and gvpr, which is an "AWK-like"1 programming language, can analyze and manipulate DOT files.

Counting the occurrence of a sub-graph in a graph

Given the input data in the format as shown in the question, you can use the following AWK program to convert it to DOT:

#!/usr/bin/awk -f
BEGIN {
    print "digraph G {"
    print "    size=\"5,5\""
    print "    ratio=.85"
    print "    node [fontsize=24 color=blue penwidth=3]"
    print "    edge [fontsize=18 labeldistance=5 labelangle=-8 minlen=2 penwidth=3]"
    print "    {rank=same; f l}"
    m  = "-"    # ASCII minus/hyphen as in the source data
    um = "−"    # u2212 minus: − which looks better on the output graphic
    p  = "+"
}

{
    if ($2 == m) { $2 = um; c = lbf = "red"; arr=" arrowhead = empty" }
    if ($2 == p) { c = lbf = "green3"; arr="" }
    print "    " $1, "->", $3, "[taillabel = \"" $2 "\" color = \"" c "\" labelfontcolor = \"" lbf "\"" arr "]"
}
END {
    print "}"
}

The command to run would be something like this:

$ ./dat2dot data.dat > data.dot

You can then create the graphic above using:

$ dot -Tpng -o data.png data.dot

I used the extended data as given above in this answer.

To do an exhaustive search for the type of subgraphs you specified, you can use the following gvpr program:

BEGIN {
    edge_t AB, BC, AC;
}

E {
    AB = $;
    BC = fstedge(AB.head);
    while (BC && BC.head.name != AB.head.name) {
        AC = isEdge(AB.tail,BC.head,"");
        if (AC) {
            printf("%s %s %s, ", AB.tail.name, AB.taillabel, AB.head.name);
            printf("%s %s %s, ", AC.tail.name, AC.taillabel, AC.head.name);
            printf("%s %s %s\n", BC.tail.name, BC.taillabel, BC.head.name);
        }
        BC = nxtedge(BC, AB.head);
    }
}

To run it, you could use:

$ gvpr -f groups.g data.dot | sort -k 2,2 -k 5,5 -k 8,8

The output would be similar to that from the AWK/shell combination above (under "Edit 2"):

a + b, a + c, b + c
a + b, a + i, b + i
f + g, f + h, g + h
a + b, a − e, b − e
l + m, l − n, m − n
a − k, a − j, k − j

1 Loosely speaking.


An unconventional approach using Perl is below.

#! /usr/bin/perl

use warnings;
use strict;

my $graph = q{
  a + c
  b + c
  c - f
  b - d
  a + b
  f + g
  g + h
  f + h
};

my $nodes = join ",", sort keys %{ { map +($_ => 1), $graph =~ /(\w+)/g } };
my $search = "$nodes:$nodes:$nodes:$graph";

my $subgraph = qr/
  \A  .*?  (?<x>\w+)  .*:
      .*?  (?<y>\w+)  .*:
      .*?  (?<z>\w+)  .*:
  (?= .*^\s*  \k<x>  \s*  \+  \s*  \k<y>  \s*$)
  (?= .*^\s*  \k<y>  \s*  \+  \s*  \k<z>  \s*$)
  (?= .*^\s*  \k<x>  \s*  \+  \s*  \k<z>  \s*$)
  (?{ print "x=$+{x}, y=$+{y}, z=$+{z}\n" })
  (?!)
/smx;

$search =~ /$subgraph/;

The regex engine is a powerful tool. For your problem, we describe the structure of a transitive subgraph and then allow the resulting machine to go find all of them.

Output:

x=a, y=b, z=c
x=f, y=g, z=h

A more general tool using this same technique is possible. For example, let's say you want to be able to specify gene patterns such as a+b+c;a+c or g1+g2-g3;g1+g3. I hope the meanings of these patterns are obvious.

In the front matter, I specify a minimum version of 5.10.0 because the code uses // and lexical $_. The code constructs dynamic regexes that will evaluate code, which the use re 'eval' pragma enables.

#! /usr/bin/perl

use warnings;
use strict;

use 5.10.0;
use re 'eval';

An identifier is a sequence of one or more “word characters,” i.e., letters, digits, or underscores.

my $ID = qr/\w+/;

Given a regex that accepts variable names, unique_vars searches some specification for all variable names and returns them without repetition.

sub unique_vars {
  my($_,$pattern) = @_;
  keys %{ { map +($_ => undef), /($pattern)/g } };
}

Compiling a gene pattern into a regex is a little hairy. It dynamically generates a search target and regex with the same form as the static one above.

The first part with multiple occurrences of comma-separated variables lets the regex engine try each possible value for each gene. Then the lookaheads, (?=...), scan the graph looking for edges with the desired properties. If all the lookaheads succeed, we record the hit.

The strange regex (?!) at the end is an unconditional failure that forces the matcher to backtrack and attempt the match with different genes. Because it's unconditional, the engine will evaluate all possibilities.

Calling the same closure from multiple threads concurrently will likely produce strange results.

sub compile_gene_pattern {
  my($dataset,$pattern) = @_;
  my @vars   = sort +unique_vars $pattern, qr/[a-z]\d*/;  # / for SO hilite
  my $nodes  = join ",", sort +unique_vars $dataset, $ID;
  my $search = join("", map "$_:", ($nodes) x @vars) . "\n"
             . $dataset;

  my $spec = '\A' . "\n" . join("", map ".*?  (?<$_>$ID)  .*:\n", @vars);
  for (split /;/, $pattern) {
    while (s/^($ID)([-+])($ID)/$3/) {
      $spec .= '(?= .*^\s*  ' .
               ' \b\k<' .           $1  . '>\b ' .
               ' \s*'   . quotemeta($2) . '\s* ' .
               ' \b\k<' .           $3  . '>\b ' .
               ' \s*$)' . "\n";
    }
  }
  my %hits;
  $spec .= '(?{ ++$hits{join "-", @+{@vars}} })' . "\n" .
           '(?!) # backtrack' . "\n";

  my $nfa = eval { qr/$spec/smx } || die "$0: INTERNAL: bad regex:\n$@";
  sub {
    %hits = ();  # thread-safety? :-(
    (my $_ = $search) =~ /$nfa/;
    map [split /-/], sort keys %hits;
  }
}

Read the dataset and let the user know about any problems.

sub read_dataset {
  my($path) = @_;

  open my $fh, "<", $path or die "$0: open $path: $!";

  local $/ = "\n";
  local $_;
  my $graph;

  my @errors;
  while (<$fh>) {
    next if /^\s*#/ || /^\s*$/;

    if (/^ \s* $ID \s* [-+] \s* $ID \s* $/x) {
      $graph .= $_;
    }
    else {
      push @errors, "$.: syntax error";
    }
  }

  return $graph unless @errors;

  die map "$0: $path:$_\n", @errors;
}

Now we set it all into motion:

my $graphs = shift // "graphs.txt";
my $dataset = read_dataset $graphs;

my $ppp = compile_gene_pattern $dataset, "a+b+c;a+c";
print "@$_\n" for $ppp->();

my $pmp = compile_gene_pattern $dataset, "g1+g2-g3;g1+g3";
print "@$_\n" for $pmp->();

Given graphs.txt with contents

a + b  
b + c  
c - f  
b - d  
a + c  
f + g  
g + h  
f + h

foo + bar
bar - baz
foo + baz

and then running the program, we get the following output:

a b c
f g h
foo bar baz


I assume that by "count the sub-graph" you mean counting the nodes in a sub-graph. If that's what you need, you can use any scripting language and will have to store the graph, first of all, by creating a structure or class where you store your graph, the node structure/class should look like this (this is not conforming the syntax of any language, this is only a plan for your application):

Node {color = 0; title = ""; minusNodeSet = null; plusNodeSet = null}

Where color = 0 (the defaul value of color means you haven't visited this node before), title will be 'a', 'b', 'c', and so on. minusNodeSet is a Set of Nodes where those nodes are stored, where a minus vertice points from our Node, plusNodeSet is a Set of Nodes where those nodes are stored, where a plus vertice points from our Node.

Now, we have an architecture and should use it in a depth-first algoritm:

int depth_first(Node actualNode)
{
    if (actualNode.color == 1)
    return;

    number = 1;
    actualNode.color = 1;
    foreach actualNode.nodeSet as node do
        if (node.color == 0)
            number = number + depth_first(node);
    return number;
}

If I misunderstood your question, please, tell me, to be able to edit my answer to be a more useful one.


The structure of the regex in my other answer resembles list-monad processing. Given that inspiration, a search for the transitive subgraphs is below as a literate Haskell. Copy-and-paste this answer to a file with the extension .lhs to get a working program. Be sure to surround the code sections, marked by leading >, with empty lines.

Thanks for the fun problem!

A bit of front matter:

> {-# LANGUAGE ViewPatterns #-}

> module Main where

> import Control.Monad (guard)
> import Data.List (nub)
> import Data.Map (findWithDefault,fromListWith,toList)

The name of a gene can be any string, and for a given Gene g, a function of type PosReg should give us all the genes that g positively regulates.

> type Gene = String
> type PosReg = Gene -> [Gene]

From a graph specified as in your question, we want triples of genes such that the is-positively-regulated-by relation is transitive, and subgraphs describes the desired properties. First, pick an arbitrary gene x from the graph. Next, choose one of the genes y that x positively regulates. For the transitive property to hold, z must be a gene that both x and y positively regulate.

> subgraphs :: String -> [(Gene,Gene,Gene)]
> subgraphs g = do
>   x <- choose
>   y <- posRegBy x
>   z <- posRegBy y
>   guard $ z `elem` posRegBy x
>   return (x,y,z)
>   where (choose,posRegBy) = decode g

With the simple parser in decode, we distill the list of genes in the graph and a PosReg function that gives all genes positively regulated by some other gene.

> decode :: String -> ([Gene], PosReg)
> decode g =
>   let pr = fromListWith (++) $ go (lines g)
>       gs = nub $ concatMap (\(a,b) -> a : b) $ toList pr
>   in (gs, (\x -> findWithDefault [] x pr))
>   where
>     go ((words -> [a, op, b]):ls)
>       | op == "+" = (a,[b]) : go ls
>       | otherwise = go ls
>     go _ = []

Finally, the main program glues it all together. For each subgraph found, print it to the standard output.

> main :: IO ()
> main = mapM_ (putStrLn . show) $ subgraphs graph
>   where graph = "a + b\n\
>                 \b + c\n\
>                 \c - f\n\
>                 \b - d\n\
>                 \a + c\n\
>                 \f + g\n\
>                 \g + h\n\
>                 \f + h\n"

Output:

("a","b","c")
("f","g","h")
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜