Need help speeding up my perl program
Ok, so I am working on a exploit finder to run against change roots, My issue is that when searching for a large number of strings in a large number of files I.E. htdocs, It is taking longe开发者_C百科r than I would like, I'm positive some advanced perl writers out there can help me speed things up a bit. Here is the part of my program I would like to improve.
sub sStringFind {
if (-B $_ ) {
}else{
open FH, '<', $_ ;
my @lines = <FH>;
foreach $fstring(@lines) {
if ($fstring =~ /sendraw|portscan|stunshell|Bruteforce|fakeproc|sub google|sub alltheweb|sub uol|sub bing|sub altavista|sub ask|sub yahoo|virgillio|filestealth|IO::Socket::INET|\/usr\/sbin\/bjork|\/usr\/local\/apache\/bin\/httpd|\/sbin\/syslogd|\/sbin\/klogd|\/usr\/sbin\/acpid|\/usr\/sbin\/cron|\/usr\/sbin\/httpd|irc\.byroe\.net|milw0rm|tcpflooder/) {
push(@huhFiles, "$_");
}
}
}
}
#End suspicious string find.
find(\&sStringFind, "$cDir/www/htdocs");
for(@huhFiles) {
print "$_\n";
}
Perhaps some hashing? Not sure am not great with perl atm. Any help is appreciated, thanks guys.
So by "hashing" I presume you mean doing a checksum at the file or line level so you don't have to check it again?
The basic problem is, checksum or not, you still have to read every line of every file either to scan it or to hash it. So this doesn't fundamentally change your algorithm, it just pushes around the constants.
If you have a lot of duplicate files, checksuming at the file level might save you a lot of time. If you don't, it will waste a lot of time.
cost = (checksum_cost * num_files) + (regex_cost * lines_per(unique_files))
Checksuming at the line level is a toss up between the cost of the regex and the cost of the checksum. If there's not many duplicate lines, you lose. If your checksum is too expensive, you lose. You can write it out like so:
cost = (checksum_cost * total_lines) + (regex_cost * (total_lines - duplicate_lines))
I'd start by figuring out what percentage of the files and lines are duplicates. That's as simple as:
$line_frequency{ checksum($line) }++
and then looking at the percentage where the frequency is >= 2
. That percentage is the maximum performance increase you will see by checksuming. If it's 50% you will only ever see an increase of 50%. That assumes the checksum cost is 0, which it isn't, so you're going to see less. If the checksum costs half what the regex costs then you'll only see 25%.
This is why I recommend grep. It will iterate through files and lines faster than Perl can attacking the fundamental problem: you have to read every file and every line.
What you can do is not look at every file every time. A simple thing to do is remember the last time you scanned and look at the modification time of each file. It is hasn't changed, and your regex hasn't changed, don't check it again. A more robust version would be to store the checksums of each file, in case the file was changed by the modification time was altered. If all your files aren't changing very often, that will see a big win.
# Write a timestamp file at the top of the directory you're scanning
sub set_last_scan_time {
my $dir = shift;
my $file = "$dir/.last_scan";
open my $fh, ">", $file or die "Can't open $file for writing: $!";
print $fh time;
return
}
# Read the timestamp file
sub get_last_scan_time {
my $dir = shift;
my $file = "$dir/.last_scan";
return 0 unless -e $file;
open my $fh, "<", $file or die "Can't open $file: $!";
my $time = <$fh>;
chomp $time;
return $time;
}
use File::Slurp 'read_file';
use File::stat;
my $last_scan_time = get_last_scan_time($dir);
# Place the regex outside the routine just to make things tidier.
my $regex = qr{this|that|blah|...};
my @huhFiles;
sub scan_file {
# Only scan text files
return unless -T $_;
# Don't bother scanning if it hasn't changed
return if stat($_)->mtime < $last_scan_time;
push(@huhFiles, $_) if read_file($_) =~ $regex;
}
# Set the scan time to before you start so if anything is edited
# while you're scanning you'll catch it next time.
set_last_scan_time($dir);
find(\&scan_file, $dir);
You're not doing anything that will cause an obvious performance problem, so you will have to look outside Perl. Use grep
. It should be much faster.
open my $grep, "-|", "grep", "-l", "-P", "-I", "-r", $regex, $dir;
my @files = <$grep>;
chomp @files;
-l
will return just filenames that match. -P
will use Perl compatible regular expressions. -r
will make it recurse through files. -I
will ignore binary files. Make sure your system's grep has all those options.
Contrary to the other answers, I would suggest performing the regex once on each entire file, not once per line.
use File::Slurp 'read_file';
...
if (-B $_ ) {
}else{
if ( read_file("$_") =~ /sendraw|portscan|stunshell|Bruteforce|fakeproc|sub google|sub alltheweb|sub uol|sub bing|sub altavista|sub ask|sub yahoo|virgillio|filestealth|IO::Socket::INET|\/usr\/sbin\/bjork|\/usr\/local\/apache\/bin\/httpd|\/sbin\/syslogd|\/sbin\/klogd|\/usr\/sbin\/acpid|\/usr\/sbin\/cron|\/usr\/sbin\/httpd|irc\.byroe\.net|milw0rm|tcpflooder/) {
push(@huhFiles, $_);
}
}
Make sure you are using at least perl5.10.1.
You have a number of things I would do to improve performance.
First, you should be precompiling your regex. In general, I do it like this: my @items=qw(foo bar baz); #usually I pull this from a config file my $regex='^' . join "|", @items . '$'; #as an example. I do a lot of capturing, too. $regex=qr($regex)i;
Second, as mentioned, you should be reading the files a line at a time. Most performance from what I've seen is running out of ram, not cpu.
Third, If you are running out of one cpu and have a lot of files to work, split the app into caller and receivers using fork(), so that you can process multiple files at one time using more than one cpu. You could write to a common file, and when done, parse that.
Finally, watch your memory usage -- a lot of times, a file append lets you keep what is in memory a lot smaller.
I have to process large data dumps using 5.8 and 5.10, and this works for me.
I'm not sure if this will help, but when you open the <FH>
you're reading the entire file into a perl array (@lines
) all at once. You might get better performance by opening the file, and reading it line by line, rather than loading the entire file into memory before processing it. However, if your files are small, your current method might actually be faster...
See this page for an example: http://www.perlfect.com/articles/perlfile.shtml
It might look something like this (note the scalar $line
variable - not an array):
open FH, '<' $_;
while ($line = <FH>)
{
# do something with line
}
close FH;
As written, your script reads the entire contents of each file into @lines, then scans every line. That suggests two improvements: Reading a line at a time, and stopping as soon as a line matches.
Some additional improvements: The if (-B $_) {} else { ... }
is odd - if you only want to process text files, use the -T
test. You should always check the return value of open(). And there's a useless use of quotes in your push()
. Taken all together:
sub sStringFind {
if (-T $_) {
# Always - yes, ALWAYS check for failure on open()
open(my $fh, '<', $_) or die "Could not open $_: $!";
while (my $fstring = <$fh>) {
if ($fstring =~ /sendraw|portscan|stunshell|Bruteforce|fakeproc|sub google|sub alltheweb|sub uol|sub bing|sub altavista|sub ask|sub yahoo|virgillio|filestealth|IO::Socket::INET|\/usr\/sbin\/bjork|\/usr\/local\/apache\/bin\/httpd \/sbin\/syslogd|\/sbin\/klogd|\/usr\/sbin\/acpid|\/usr\/sbin\/cron|\/usr\/sbin\/httpd|irc\.byro \.net|milw0rm|tcpflooder/) {
push(@huhFiles, $_);
last; # No need to keep checking once this file's been flagged
}
}
}
}
Just to add something else.
If you're assembling you regexp from a list of search terms. Then Regexp::Assemble::Compressed can be used to fold your search terms into a shorter regular expression:
use Regexp::Assemble::Compressed;
my @terms = qw(sendraw portscan stunshell Bruteforce fakeproc sub google sub alltheweb sub uol sub bing sub altavista sub ask sub yahoo virgillio filestealth IO::Socket::INET /usr/sbin/bjork /usr/local/apache/bin/httpd /sbin/syslogd /sbin/klogd /usr/sbin/acpid /usr/sbin/cron /usr/sbin/httpd irc.byroe.net milw0rm tcpflooder);
my $ra = Regexp::Assemble::Compressed->new;
$ra->add("\Q${_}\E") for @terms;
my $re = $ra->re;
print $re."\n";
print "matched" if 'blah blah yahoo' =~ m{$re};
This produces:
(?-xism:(?:\/(?:usr\/(?:sbin\/(?:(?:acpi|http)d|bjork|cron)|local\/apache\/bin\/httpd)|sbin\/(?:sys|k)logd)|a(?:l(?:ltheweb|tavista)|sk)|f(?:ilestealth|akeproc)|s(?:tunshell|endraw|ub)|(?:Bruteforc|googl)e|(?:virgilli|yaho)o|IO::Socket::INET|irc\.byroe\.net|tcpflooder|portscan|milw0rm|bing|uol))
matched
This may be of benefit for very long lists of search terms, particularly for Perl pre 5.10.
Just working from your code:
#!/usr/bin/perl
# it looks awesome to use strict
use strict;
# using warnings is beyond awesome
use warnings;
use File::Find;
my $keywords = qr[sendraw|portscan|stunshell|Bruteforce|fakeproc|sub google|sub alltheweb|sub uol|sub bing|sub altavista|sub ask|sub yahoo|virgillio|filestealth|IO::Socket::INET|\/usr\/sbin\/bjork|\/usr\/local\/apache\/bin\/httpd|\/sbin\/syslogd|\/sbin\/klogd|\/usr\/sbin\/acpid|\/usr\/sbin\/cron|\/usr\/sbin\/httpd|irc\.byroe\.net|milw0rm|tcpflooder];
my @huhfiles;
find sub {
return unless -f;
my $file = $File::Find::name;
open my $fh, '<', $file or die "$!\n";
local $/ = undef;
my $contents = <$fh>;
# modern Perl handles this but it's a good practice
# to close the file handle after usage
close $fh;
if ($contents =~ $keywords) {
push @huhfiles, $file;
}
}, "$cDir/www/htdocs";
if (@huhfiles) {
print join "\n", @huhfiles;
} else {
print "No vulnerable files found\n";
}
Don't read all the of the lines at once. Read one line at a time, and then when you find a match in the file, break out of the loop and stop reading from that file.
Also, don't interpolate when you don't need to. Instead of
push(@huhFiles, "$_");
do
push(@huhFiles, $_);
This won't be a speed issue, but it's better coding style.
精彩评论