开发者

Why does my Perl script die with an "out of memory" exception?

I need to read a 200mb "space"-separated file line-by-line and collect its contents into an array.

Every time I run the script, Perl throws an "out of memory" exception, but I don't understand why!

Some advice please?

#!/usr/bin/perl -w
use strict;
use warnings;

open my $fh, "<", "../cnai_all.csd";
my @parse = ();

while (<$fh>) {
     my @words = split(/\s/,$_);
     push (@parse, \@words);
}

print scalar @parse;

the cnai file looks like this: it contains 11000 rows and 420开发者_JAVA技巧0 values, seperated by "space", per line.

VALUE_GROUP_A VALUE_GROUP_B VALUE_GROUP_C 
VALUE_GROUP_A VALUE_GROUP_B VALUE_GROUP_C 
VALUE_GROUP_A VALUE_GROUP_B VALUE_GROUP_C 
VALUE_GROUP_A VALUE_GROUP_B VALUE_GROUP_C 

The code above is just a stripped down sample.

The final script will store all values in a hash and write it to a database later .

But first, I have to solve that memory problem!


That would be because... you're running out of memory!

You're not merely storing 200MB of data. You're creating a new list data structure for each line, with all of its associated overhead, and also creating a bunch of separate string objects for each word, with all of their associated overhead.

Edit: As an example of the kind of overhead we're talking about here, each and every value (and that includes strings) has the following overhead:

/* start with 2 sv-head building blocks */
#define _SV_HEAD(ptrtype) \
    ptrtype sv_any;     /* pointer to body */   \
    U32     sv_refcnt;  /* how many references to us */ \
    U32     sv_flags    /* what we are */

#define _SV_HEAD_UNION \
    union {             \
    char*   svu_pv;     /* pointer to malloced string */    \
    IV      svu_iv;         \
    UV      svu_uv;         \
    SV*     svu_rv;     /* pointer to another SV */     \
    SV**    svu_array;      \
    HE**    svu_hash;       \
    GP* svu_gp;         \
    }   sv_u


struct STRUCT_SV {      /* struct sv { */
    _SV_HEAD(void*);
    _SV_HEAD_UNION;
};

So that's at least 4 32-bit values per Perl object.


Generally this means you are running out of memory for Perl, but it's possible you aren't running out of system memory. First up, there are ways you can get more information on perl's memory usage in the perl debug guts doc -- though you may find yourself recompiling perl, then. (Also note the warning in that doc about perl's hunger for memory...)

However, many operating systems it is possible for memory limits to be set per-process or per-user. If, for example, you're using Linux (or another POSIX system) you might need to alter your ulimits. Type 'ulimit -a' and look at your memory sizes; it's possible your 'max memory size' is below the memory in your machine -- or you have a limited data seg size. You can then reset it with the appropriate option, e.g. ulimit -d 1048576 for a 1GB data seg size limit.

Of course, there is another option: process the file line-by-line, if your situation allows it. (The example code above can be rewritten in such a way.)


Rather than reading all 46,200,000 values in core at once, your description of the data in cnai_all.csd as having many rows suggests that each line can be processed independently. If so, use

while (<$fh>) {
  my @words = split /\s/, $_;
  insert_row \@words;
}

where insert_row is a sub you'd define to insert that row into your database.

Note that split /\s/ is often a mistake. The perlfunc documentation on split explains:

As a special case, specifying a PATTERN of space (' ') will split on white space just as split with no arguments does. Thus, split(' ') can be used to emulate awk's default behavior, whereas split(/ /) will give you as many null initial fields as there are leading spaces. A split on /\s+/ is like a split(' ') except that any leading whitespace produces a null first field. A split with no arguments really does a split(' ', $_) internally.

In the nominal case, everything is fine:

  DB<1> x split /\s/, "foo bar baz"
0  'foo'
1  'bar'
2  'baz'

But what if there are multiple spaces between fields? Does that mean an empty field or just a "wide" separator?

  DB<2> x split /\s/, "foo  bar baz"
0  'foo'
1  ''
2  'bar'
3  'baz'

What about leading whitespace?

  DB<3> x split /\s/, " foo bar baz"
0  ''
1  'foo'
2  'bar'
3  'baz'

The default behavior of split isn't arbitrary. Let the tool work for you!

while (<$fh>) {
  insert_row [ split ];
}


Your while loop does not read from the file. You should have

<$fh>

or something inside the parentheses.


At last I have found a more suitable solution for my problem:

After some research for other parsers I've had to develop, I learned about the module DBD::CSV.

DBD::CSV allows me to select only the needed columns (out of >4000) of the "whitespace"-seperated fields. This reduceses memory usage and perfoms quite well.

More at DBD-CSV @ CPAN.org

Thanks to gbacon I've changed my strategy from reading the whole file in one go to reading it part by part. DBD::CSV makes this possible without much coding.

#!/usr/bin/perl  -w

use strict;
use warnings;
use DBI;
## -------------------------------------------------------------------------##

## -------------------------------------------------------------------------##
## SET GLOBAL CONFIG #############
my $globalConfig = {
                _DIR => qq{../Data},
                _FILES => {
                    'cnai_all.csd'   => '_TEST'
                    }               
                };
## -------------------------------------------------------------------------##


## -------------------------------------------------------------------------##
my $sTime = time();

my $sepChar = " ";
my $csv_dbh = DBI->connect("DBI:CSV:f_dir=".$globalConfig->{_DIR}.";");

$csv_dbh->{csv_eol} ="\n";
#$csv_dbh->{csv_quote_char} ="'";
#$csv_dbh->{csv_escape_char} ="\\";
$csv_dbh->{csv_null} = 1;
$csv_dbh->{csv_quote_char} = '"';
$csv_dbh->{csv_escape_char} = '"';
$csv_dbh->{csv_sep_char} = "$sepChar";
$csv_dbh->{csv_always_quote} = 0;
$csv_dbh->{csv_quote_space} = 0;
$csv_dbh->{csv_binary} = 0;
$csv_dbh->{csv_keep_meta_info} = 0;
$csv_dbh->{csv_allow_loose_quotes} = 0;
$csv_dbh->{csv_allow_loose_escapes} = 0;
$csv_dbh->{csv_allow_whitespace} = 0;
$csv_dbh->{csv_blank_is_undef} = 0;
$csv_dbh->{csv_empty_is_undef} = 0;
$csv_dbh->{csv_verbatim} = 0;
$csv_dbh->{csv_auto_diag} = 0;


my @list = $csv_dbh->func('list_tables');
my $sth = $csv_dbh->prepare("SELECT CELL,NW,BSC,n_cell_0 FROM cnai_all.tmp");


#print join ("\n",@list);

print "\n-------------------\n";

$sth->execute();
while (my $row = $sth->fetchrow_hashref) {
    # just print a hash refrence
    print "$row\n";
}
$sth->finish();

print "\n finish after ".(time()-$sTime)." sec ";

On my machine this runs in roughly 20s and use no more than 10MB memory.


The database you are using probably has a bulk import function. I would try that first.

If you need to do something with each row before putting it into the database (assuming the operations do not require referencing other rows), you should insert the row into the database as soon as processing is complete (turn AutoCommit off) rather than trying to store all the data in memory.

If the processing of each row depends on information in other rows, then you can use Tie::File to treat the input file as an array of lines. Again, don't try to store the contents of each line in memory. When processing is complete, ship it off to the database.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜