Perl program to mimic restriction enzymes using references, hash tables and subs

2023-01-27 11:20 问答作者：

I'm a student in an intro Perl class. I'm looking for suggestions on how to approach an assignment. My professor encourages forums. The assignment is:

Write a Perl program that will take two files from the command line, an enzyme file and a DNA file. Read the file with restriction enzymes and apply the restriction enzymes to the DNA file.

The output will be fragments of DNA arranged in the order they occur in the dna file. The name of the output files should be constructed by appending the name of the restriction enzyme to the name of the DNA file, with an undersco开发者_如何学Gore between them.

For example, if the enzyme is EcoRI and the DNA file is named BC161026, the output file should be named BC161026_EcoRI.

My approach is to create a main program and two subs as follows:

Main: Not sure how to tie my subs together?

Sub program $DNA: Take a DNA file and remove any new lines to make a single string

Sub program Enzymes: Read and store the lines from the enzyme file which is from the command line Parse the file in a way that it separates the enzyme acronym from the position of the cut. Store the position of the cut as a regular expression in a hash table Store the name of the acronym in a hash table

Note on enzyme file format: The enzyme file follows a format known as Staden. Examples:

AatI/AGG'CCT//
AatII/GACGT'C//
AbsI/CC'TCGAGG//

The enzyme acronym consists of the characters before the first slash (AatI, in the first example. The recognition sequence is everything between the first and second slash (AGG'CCT, again, in the first example). The cut point is denoted by an apostrophe in the recognition sequence There are standard abbreviations for dna within enzymes as follows:

R = G or A B = not A (C or G or T) etc...

Along with a recommendation for a main chunk, do you see any missing pieces that I've omitted? Can you recommend specific tools that you think would be useful in patching this program together?

Example input enzyme: TryII/RRR'TTT//

Example string to read: CCCCCCGGGTTTCCCCCCCCCCCCAAATTTCCCCCCCCCCCCAGATTTCCCCCCCCCCGAGTTTCCCCC

The output should be:

CCCCCCGGG

TTTCCCCCCCCCCCCAAA

TTTCCCCCCCCCCCCAGA

TTTCCCCCCCCCCGAG

TTTCCCCC

Ok, I know I shouldn't just do your homework, but there were some fun tricks to this one, so I played with it. Learn from this, not just copy. I didn't comment very well, so if there is something you don't understand, please ask. There is some slight magic in this that if you haven't covered it in your class, your prof will know, so be sure you understand.

#!/usr/bin/env perl

use strict;
use warnings;

use Getopt::Long;

my ($enzyme_file, $dna_file);
my $write_output = 0;
my $verbose = 0;
my $help = 0;
GetOptions(
  'enzyme=s' => \$enzyme_file,
  'dna=s' => \$dna_file,
  'output' => \$write_output,
  'verbose' => \$verbose,
  'help' => \$help
);

$help = 1 unless ($dna_file && $enzyme_file);
help() if $help; # exits

# 'Main'
my $dna = getDNA($dna_file);
my %enzymes = %{ getEnzymes($enzyme_file) }; # A function cannot return a hash, so return a hashref and then store the referenced hash
foreach my $enzyme (keys %enzymes) {
  print "Applying enzyme " . $enzyme . " gives:\n";
  my $dna_holder = $dna;
  my ($precut, $postcut) = ($enzymes{$enzyme}{'precut'}, $enzymes{$enzyme}{'postcut'});

  my $R = qr/[GA]/;
  my $B = qr/[CGT]/;

  $precut =~ s/R/${R}/g;
  $precut =~ s/B/${B}/g;
  $postcut =~ s/R/${R}/g;
  $postcut =~ s/B/${B}/g;
  print "\tPre-Cut pattern: " . $precut . "\n" if $verbose;
  print "\tPost-Cut pattern: " . $postcut . "\n" if $verbose;

  #while(1){
  #  if ($dna_holder =~ s/(.*${precut})(${postcut}.*)/$1/ ) {
  #    print "\tFound section:" . $2 . "\n" if $verbose;
  #    print "\tRemaining DNA: " . $1 . "\n" if $verbose;
  #    unshift @{ $enzymes{$enzyme}{'cut_dna'} }, $2;
  #  } else {
  #    unshift @{ $enzymes{$enzyme}{'cut_dna'} }, $dna_holder;
  #    print "\tNo more cuts.\n" if $verbose;
  #    print "\t" . $_ . "\n" for @{ $enzymes{$enzyme}{'cut_dna'} };
  #    last;
  #  }
  #}
  unless ($dna_holder =~ s/(${precut})(${postcut})/$1'$2/g) {
    print "\tHas no effect on given strand\n" if $verbose;
  }
  @{ $enzymes{$enzyme}{'cut_dna'} } = split(/'/, $dna_holder);
  print "\t$_\n" for @{ $enzymes{$enzyme}{'cut_dna'} };

  writeOutput($dna_file, $enzyme, $enzymes{$enzyme}{'cut_dna'}) if $write_output; #Note that $enzymes{$enzyme}{'cut_dna'} is an arrayref already
  print "\n";
}

sub getDNA {
  my ($dna_file) = @_;

  open(my $dna_handle, '<', $dna_file) or die "Cannot open file $dna_file";
  my @dna_array = <$dna_handle>;
  chomp(@dna_array);

  my $dna = join('', @dna_array);

  print "Using DNA:\n" . $dna . "\n\n" if $verbose;
  return $dna;
}

sub getEnzymes {
  my ($enzyme_file) = @_;
  my %enzymes;

  open(my $enzyme_handle, '<', $enzyme_file) or die "Cannot open file $enzyme_file";
  while(<$enzyme_handle>) {
    chomp;
    if(m{([^/]*)/([^']*)'([^/]*)//}) {
      print "Found Enzyme " . $1 . ":\n\tPre-cut: " . $2 . "\n\tPost-cut: " . $3 . "\n" if $verbose;
      $enzymes{$1} = {
        precut => $2,
        postcut => $3,
        cut_dna => [] #Added to show the empty array that will hold the cut DNA sections
      };
    }
  }

  print "\n" if $verbose;
  return \%enzymes;
}

sub writeOutput {

  my ($dna_file, $enzyme, $cut_dna_ref) = @_;

  my $outfile = $dna_file . '_' . $enzyme;
  print "\tSaving data to $outfile\n" if $verbose; 
  open(my $outfile_handle, '>', $outfile) or die "Cannot open $outfile for writing";

  print $outfile_handle $_ . "\n" for @{ $cut_dna_ref };
}

sub help {

  my $filename = (split('/', $0))[-1];

  my $enzyme_text = <<'END';
AatI/AGG'CCT//
AatII/GACGT'C//
AbsI/CC'TCGAGG//
TryII/RRR'TTT//
Test/AAA'TTT//
END

  my $dna_text = <<'END';
CCCCCCGGGTTTCCCCCCC
CCCCCAAATTTCCCCCCCCCCCCAGATTTC
CCCCCCCCCGAGTTTCCCCC
END

  print <<END;
Usage: 
    $filename --enzyme (-e) <enzyme-filename> --dna (-d) <dna-filename> [options] (files may come in either order)
    $filename -h    (shows this help)

Options: 
    --verbose (-v)  print additional (debugging) information
    --output (-o)   output final data to files


Files:
The DNA file contains one DNA string which may be broken over many lines. E.G.:

$dna_text

The enzymes file constains enzyme definitions, one per line. E.G.:

$enzyme_text
END

exit;
}

Edit: Added cut_dna initialization explicitly because this is the final result holder for each enzyme, so I thought it would be good to see it more clearly.

Edit 2: Added output routine, call, flag and corresponding help.

Edit 3: Changed main routine to incorporate the best of canavanin's method while removing loops. Now its a global replace to add temporary cut mark (') and then split on cut mark into array. Left old method as comment, new method is the 5 lines following.

Edit 4: Additional test case for writing to multiple files. (Below)

my @names = ('cat','dog','sheep'); 
foreach my $name (@names) { #$name is lexical, ie dies after each loop
  open(my $handle, '>', $name); #open a lexical handle for the file, also dies each loop
  print $handle $name; #write to the handle
  #$handles closes automatically when it "goes out of scope"
}

Note that in Enzymes, when you store an enzyme in the hash the name of the enzyme should be the key and the site should be the value (since in principle two enzymes could have identical sites).

In the Main routine, you can iterate through the hash; for each enzyme produce one output file. The most direct way is to translate the site to a regex (by means of other regexs) and apply it to the DNA sequence, but there are other ways. (It is probably worth splitting this off into at least one other sub.)

Here is how I have gone about trying to solve the problem (code below).
1) The file names are picked up from the arguments and respective filehandles are created.
2) A new file handle is created for the output file which in the specified format
3) The "cut points" are extracted from the first file
4) The DNA Sequences in the second file are looped over the cut points obtained in step 3.

#!/usr/bin/perl
use strict;
use warnings;
my $file_enzyme=$ARGV[0];
my $file_dna=$ARGV[1];

open DNASEQ, ">$file_dna"."_"."$file_enzyme";
open ENZYME, "$file_enzyme";
open DNA, "$file_dna";
while (<ENZYME>) {
 chomp;
  if( /'(.*)\/\//) { # Extracts the cut point between ' & // in the enzyme file
    my $pattern=$1;
    while (<DNA>) {
     chomp;
     #print $pattern;
     my @output=split/$pattern/,;
     print DNASEQ shift @output,"\n"; #first recognized sequence being output
     foreach (@output) {
        print DNASEQ "$pattern$_\n"; #prefixing the remaining sequences with the cut point pattern
     }
   }
 }
}
close DNA;
close ENZYME;
close DNASEQ;

I know there have been several answers already, but hey... I just felt like trying my luck, so here's my suggestion:

#!/usr/bin/perl

use warnings;
use strict;
use Getopt::Long;

my ($enz_file, $dna_file);

GetOptions( "e=s" => \$enz_file,
            "d=s" => \$dna_file,
          );

if (! $enz_file || ! $dna_file) {
   # some help text 
   print STDERR<<EOF; 

   Usage: restriction.pl -e enzyme_file -d DNA_file

   The enzyme_file should contain one enzyme entry per line.
   The DNA_file may contain the sequence on one single or on
   several lines; all lines will be concatenated to yield a
   single string.
EOF      
   exit();
}

my %enz_and_patterns; # stores enzyme name and corresponding pattern

open ENZ, "<$enz_file" or die "Could not open file $enz_file: $!";
while (<ENZ>) {
   if (m#^(\w+)/([\w']+)//$#) {
      my $enzyme  = $1; # could also remove those two lines and use 
      my $pattern = $2; # the match variables directly, but this is clearer

      $enz_and_patterns{$enzyme} = $pattern;
   }
}
close ENZ;

my $dna_sequence;

open DNA, "<$dna_file" or die "Could not open file $dna_file: $!";
while (my $line = <DNA>) {
   chomp $line;
   $dna_sequence .= $line; # append the current bit to the sequence
                           # that has been read so far
}
close DNA;

foreach my $enzyme (keys %enz_and_patterns) {
   my $dna_seq_processed = $dna_sequence; # local copy so that we retain the original

   # now translate the restriction pattern to a regular expression pattern:
   my $pattern = $enz_and_patterns{$enzyme};
   $pattern    =~ s/R/[GA]/g; # use character classes
   $pattern    =~ s/B/[^A]/g;
   $pattern    =~ s/(.+)'(.+)/($1)($2)/; # remove the ', but due to the grouping
                                         # we "remember" its position

   $dna_seq_processed =~ s/$pattern/$1\n$2/g; # in effect we are simply replacing
                                              # each ' with a newline character
   my $outfile = "${dna_file}_$enzyme";
   open OUT, ">$outfile" or die "Could not open file $outfile: $!";
   print OUT $dna_seq_processed , "\n";
   close OUT;
}

I've tested my code with your TryII example, which worked fine.

As this is a task which can be handled by writing just a few lines of non-repetitive code I didn't feel creating separate subroutines would have been justified. I hope I will be forgiven... :)

继续阅读：bioinformatics hash perl reference regex

Perl program to mimic restriction enzymes using references, hash tables and subs

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？