How do I find the count and location of two substrings in two different files?

2023-02-08 10:12 问答作者：

From two seque开发者_StackOverflow社区nces given I need to check for every three codons and if the changes are same as in the following list, then I have to check out the location of changes and the codons which are changed and count their number of occurrences.

For example:

sequence 1 - TTCAUUUCCCAU
sequence 2 - TTTAUAUCGCAC

The output which I need to get is

TTC->TTT considered/location-1/count-1
AUU->AUA considered/location-2/count-1
UCC->UCG considered/location-3/count-1

NOTE: CAU->CAC not considered because it is not there in the following list. LIST:-> The direction of changes should also be considered.

first sequence->second sequence
TTC->TTT
CTG->UUA
AUU->AUA
GUG->GUA
UCC->UCG
CCC->CCG
ACC->ACG
GCC->GCG
UAC->UAU
UGA->UAG
CAC->CAU
CAG->CAA
AAC->AAU
AAG->AAA
GAC->GAU
GAG->GAA
UGC->UGU
CGG->CGU
AGC->AGU
AGG->CGU
AGA->CGU
UAA->UAG
GGC->GGU

The code which I have written until now is:

print "Enter the sequence:";
$a = <>;

print "Enter the mutated sequence:";
$b = <>;

chomp($a);
chomp($b);

my @codon = split(/(\w{3})/, $a);
my @codon1 = split(/(\w{3})/, $b);

open(OUT, ">output.txt") or die;
$count = 0;
@new = ();
@new1 = ();
for ($i = 0; $i <= $#codon; $i++) {
    for ($j = 0; $j <= $#codon1; $j++) {
        if ($codon[$i] = {TTC}) || ($codon1[$j] = {TTT}) {
            $count++;
        }
    }
}
print OUT " @new";
close OUT;

#!/usr/bin/env perl
use strict;

my %seq_map = (
    "TTC"=>"TTT",
    "CTG"=>"UUA",
    "AUU"=>"AUA",
    "GUG"=>"GUA",
    "UCC"=>"UCG",
    "CCC"=>"CCG",
    "ACC"=>"ACG",
    "GCC"=>"GCG",
    "UAC"=>"UAU",
    "UGA"=>"UAG",
    "CAC"=>"CAU",
    "CAG"=>"CAA",
    "AAC"=>"AAU",
    "AAG"=>"AAA",
    "GAC"=>"GAU",
    "GAG"=>"GAA",
    "UGC"=>"UGU",
    "CGG"=>"CGU",
    "AGC"=>"AGU",
    "AGG"=>"CGU",
    "AGA"=>"CGU",
    "UAA"=>"UAG",
    "GGC"=>"GGU"
);

my %seq_count = ();

my $seq1 = "TTCAUUUCCCAU";
my $seq2 = "TTTAUAUCGCAC";

my $max = int(length($seq1) / 3);
for(my $i=0;$i<$max;$i++) {
    my $c1 = substr($seq1, $i*3, 3);
    my $c2 = substr($seq2, $i*3, 3);
    my $found = $seq_map{$c1};
    if ($found && ($found eq $c2)) {
        $seq_count{$c1} ||= 0;
        my $count = ++$seq_count{$c1};
        my $loc = $i+1;
        print "${c1}->${c2} considered / location ${loc} / count ${count}\n";
    }
}

There are many ways to accomplish this, as is the case typically in Perl.

If the file is not large, you can read in the file line by line into an array (or if it is already one entry per line, then just slurp the whole file into an array). Then use a while loop (and the second file's file handle) to compare the position of the dinucleotides.

Because this is a bioinformatics problem, and the files are typically large, I would be smart and look into reading from each file handle, line by line, and doing comparasons.

For the 3 character split you are trying to do, I would use a for loop, going until the length of the string you are checking divided by 3 -1. Then create a regex as you go on to grab the first three letters, then the next, and so on…

Something like /\d{$count}(\w{3})/

The while loop could look something like this:

#!/usr/bin/perl -w
use strict;

open FILE1, "file1.txt" or die "Cannot open file1.txt: $!\n";
open FILE2, "file2.txt" or die "Cannot open file2.txt: $!\n";

my $count = 0;
while (<FILE1>) {
    chomp(my $lineF1 = $_);
    chomp(my $lineF2 = <FILE2>);

    # some changes may need to be made to this if statement
    if ($lineF1 eq $lineF2) {
        # do something important here
        print "$lineF1\n";
    } else {
        print "Line $count mismatch\n";
    }
    $count++;
}

close(FILE1);
close(FILE2);

Can you consider that the codons in the two files are "aligned"? If that's the case, the problem is simple: you load the list of valid transitions in a 2-level hash:

 # of course, you load this from a file...
 $transitions{TTC}{TTT} = 1;
 $transitions{CTG}{UUA} = 1;
 ...

Then, reading both files, line by line (or are they just one string?):

# of course, I'm leaving out all the file manipulation...
my $line1 = <FILE1>;
my $line2 = <FILE2>;

my $maxlen1 = length($line1);
my $maxlen2 = length($line2);
my $i = 0;

while($i < $maxlen1 && $i < $maxlen2){
  my $codon1 = substr($line1, $i, $i+3);
  if(exists($transitions{$codon1}){
    my $codon2 = substr($line2, $i, $i+3);
    if(exists($transitions{$codon1}{$codon2}){
      print "we have a match $codon1 -> $codon2 at index $i\n";
    }
  }
  $i += 3;
}

NOTE use 'exists() instead of defined() as it will save you some extra computation. If you don't want to have nexted if(), you can compute $codon1 and $codon2 and then check for if(exists($transitions{$codon1}{$codon2})) {} Using 'exists' avoids the autovivification problem...

继续阅读：bioinformatics perl

How do I find the count and location of two substrings in two different files?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？