开发者

is there a way to designate the line token delimiter in Perl's file reader?

I'm reading a text file via CGI in, in perl, and noticing that when the file is saved in mac's textEdit the line separator is recognized, but when I upload a CSV that is exported straight from excel, they are not. I'm g开发者_开发百科uessing it's a \n vs. \r issue, but it got me thinking that I don't know how to specify what I would like the line terminator token to be, if I didn't want the one it's looking for by default.


Yes. You'll want to overwrite the value of $/. From perlvar

$/

The input record separator, newline by default. This influences Perl's idea of what a "line" is. Works like awk's RS variable, including treating empty lines as a terminator if set to the null string. (An empty line cannot contain any spaces or tabs.) You may set it to a multi-character string to match a multi-character terminator, or to undef to read through the end of file. Setting it to "\n\n" means something slightly different than setting to "", if the file contains consecutive empty lines. Setting to "" will treat two or more consecutive empty lines as a single empty line. Setting to "\n\n" will blindly assume that the next input character belongs to the next paragraph, even if it's a newline. (Mnemonic: / delimits line boundaries when quoting poetry.)

local $/;           # enable "slurp" mode
local $_ = <FH>;    # whole file now here
s/\n[ \t]+/ /g;

Remember: the value of $/ is a string, not a regex. awk has to be better for something. :-)

Setting $/ to a reference to an integer, scalar containing an integer, or scalar that's convertible to an integer will attempt to read records instead of lines, with the maximum record size being the referenced integer. So this:

local $/ = \32768; # or \"32768", or \$var_containing_32768
open my $fh, "<", $myfile or die $!;
local $_ = <$fh>;

will read a record of no more than 32768 bytes from FILE. If you're not reading from a record-oriented file (or your OS doesn't have record-oriented files), then you'll likely get a full chunk of data with every read. If a record is larger than the record size you've set, you'll get the record back in pieces. Trying to set the record size to zero or less will cause reading in the (rest of the) whole file.

On VMS, record reads are done with the equivalent of sysread, so it's best not to mix record and non-record reads on the same file. (This is unlikely to be a problem, because any file you'd want to read in record mode is probably unusable in line mode.) Non-VMS systems do normal I/O, so it's safe to mix record and non-record reads of a file.

See also "Newlines" in perlport. Also see $..


The variable has multiple names:

  • $/
  • $RS
  • $INPUT_RECORD_SEPARATOR

For the longer names, you need:

use English;

Remember to localize carefully:

{
local($/) = "\r\n";
...code to read...
}


If you are reading in a file with CRLF line terminators, you can open it with the CRLF discipline, or set the binmode of the handle to do automatic translation.

open my $fh, '<:crlf', 'the_csv_file.csv' or die "Oh noes $!";

This will transparently convert \r\n sequences into \n sequences.

You can also apply this translation to an existing handle by doing:

binmode( $fh, ':crlf' );

:crlf mode is typically default in Win32 Perl environments and works very well in practice.


For reading a CSV file, follow Robert-P's advice in his comment, and use a CSV module.

But for the general case of reading lines from a file with different line-endings, what I generally do is slurp the file whole and split it on \R. If it's not a multi-gigabytes file, that should be the safest and easiest way.

So:

perl -ln -0777 -e 'my @lines = split /\R/;
    print length($_), " bytes split into ", scalar(@lines), " lines."' $YOUR_FILE

or in your script:

{
  local $/ = undef;
  open F, $YOUR_FILE or die;
  @lines = split /\R/, <F>;
  close F;
}

\R works with Unix LF (\x0A), Windows/Internet CRLF, and also with CR (\x0D) which was used by Macs in the nineties, but is in fact still used by some Mac programs.

From the perldoc :

\R matches a generic newline; that is, anything considered a linebreak sequence by Unicode. This includes all characters matched by \v (vertical whitespace), and the multi character sequence "\x0D\x0A" (carriage return followed by a line feed, sometimes called the network newline; it's the end of line sequence used in Microsoft text files opened in binary mode)

Or see this much nicer and exhaustive explanation about \R in Brian D Foy's article : The \R generic line ending which even has a couple of fun videos.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜