开发者

PERL -- Regex incl all hash keys (sorted) + deleting empty fields from $_ in file read

I'm working on a program and I have a couple of questions, hope you can help:

First I need to access a file and retrieve specific information according to an index that is obtained from a previous step, in which the indexes to 开发者_如何学运维retrieve are found and store in a hash.

I've been looking for a way to include all array elements in a regex that I can use in the file search, but I haven´t been able to make it work. Eventually i've found a way that works:

my @atoms = ();
my $natoms=0;

foreach my $atomi (keys %{$atome}){
push (@atoms,$atomi);
$natoms++;
}
@atoms = sort {$b cmp $a} @atoms;

and then I use it as a regex this way:

while (<IN_LIG>){
if (!$natoms) {last;}
......
if ($_ =~ m/^\s*$atoms[$natoms-1]\s+/){
    $natoms--;  
    .....
}

Is there any way to create a regex expression that would include all hash keys? They are numeric and must be sorted. The keys refer to the line index in IN_LIG, whose content is something like this:

8 C5          9.9153    2.3814   -8.6988 C.ar      1 MLK        -0.1500 

The key is to be found in column 0 (8). I have added ^ and \s+ to make sure it refers only to the first column.

My second problem is that sometimes input files are not always identical and they make contain white spaces before the index, so when I create an array from $_ I get column0 = " " instead of column0=8

I don't understand why this "empty column" is not eliminated on the split command and I'm having some trouble to remove it. This is what I have done:

@info = split (/[\s]+/,$_);

if ($info[0] eq " ") {splice (@info, 0,1);} # also tried $info[0] =~ m/\s+/

and when I print the array @info I get this:

Array: 

Array: 8

Array: C5

Array: 9.9153

Array: 2.3814

.....

How can I get rid of the empty column?

Many thanks for your help Merche


There is a special form of split where it will remove both leading and trailing spaces. It looks like this, try it:

my $line = '  begins  with    spaces  and ends   with   spaces    ';
my @tokens = split ' ', $line;
# This prints |begins:with:spaces:and:ends:with:spaces|
print "|", join(':', @tokens), "|\n";

See the documentation for split at http://p3rl.org/split (or with perldoc split)

Also, the first part of your program might be simpler as:

my @atoms = sort {$b cmp $a} keys %$atome;
my $natoms = @atoms;

But, what is your ultimate goal with the atoms? If you simply want to verify that the atoms you're given are indeed in the file, then you don't need to sort them, nor to count them:

my @atoms = keys %$atome;
while (<IN_LIG>){
    # The atom ID on this line
    my ($atom_id) = split ' ';
    # Is this atom ID in the array of atom IDs that we are looking for
    if (grep { /$atom_id/ } @atoms) {
        # This line of the file has an atom that was in the array: $atom_id
    }
}


Lets warm up by refining and correcting some of your code:

# If these are all numbers, do a numerical sort: <=> not cmp
my @atoms = ( sort { $b <=> $a } keys %{$atome} ); 
my $natoms = scalar @atoms;

No need to loop through the keys, you can insert them into the array right away. You can also sort them right away, and if they are numbers, the sort must be numerical, otherwise you will get a sort like: 1, 11, 111, 2, 22, 222, ...

$natoms can be assigned directly by the count of values in @atoms.


while(<IN_LIG>) {
    last unless $natoms;
    my $key = (split)[0]; # split splits on whitespace and $_ by default
    $natoms-- if ($key == $atoms[$natoms - 1]);
}

I'm not quite sure what you are doing here, and if it is the best way, but this code should work, whereas your regex would not. Inside a regex, [] are meta characters. Split by default splits $_ on whitespace, so you need not be explicit about that. This split will also definitely remove all whitespace. Your empty field is most likely an empty string, '', and not a space ' '.

The best way to compare two numbers is not by a regex, but with the equality operator ==.

Your empty field should be gone by splitting on whitespace. The default for split is split ' '.

Also, if you are not already doing it, you should use:

use strict;
use warnings;

It will save you a lot of headaches.


for your second question you could use this line:

@info = $_ =~ m{^\s*(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)}xms;

in order to capture 9 items from each line (assuming they do not contain whitespace).

The first question I do not understand.

Update: I would read alle the lines of the file and use them in a hash with $info[0] as the key and [@info[1..8]] as the value. Then you can lookup the entries by your index.

my %details;
while (<IN_LIG>) {
    @info = $_ =~ m{^\s*(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)}xms;
    $details{ $info[0] } = [ @info[1..$#info] ];
}

Later you can lookup details for the indices you are interested in and process as needed. This assumes the index is unique (has the property of keys).


thanks for all your replies. I tried the split form with ' ' and it saved me several lines of code. thanks!

As for the regex, I found something that could make all keys as part of the string expression with join and quotemeta, but I couldn't make it work. Nevertheless I found an alternative that works, but I liked the join/quotemeta solution better

The atom indexes are obtained from a text file according to some energy threshold. Later, in the IN_LIG loop, I need to access the molecule file to obtain more information about the atoms selected, thus I use the atom "index" in the molecule to identify which lines of the file I have to read and process. This is a subroutine to which I send a hash with the atom index and some other information.

I tried this for the regex:

 my $strings = join "|" map quotemeta,
 sort { $hash->{$b} <=> $hash->{$a}} keys  %($hash);

but I did something wrong cos it wouldn't take all keys

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜