Why do I get extra line breaks in the web page I download with Perl?
I'm writing a simple Perl script (on Windows) to download the response of a get request to a url to a file. Pretty straight-forward. Except when it writes to the output file, I get extra line b开发者_如何学JAVAreaks. So like instead of:
<head>
<title>title</title>
<link .../>
</head>
I get
<head>
<title>title</title>
<link .../>
</head>
Here's the Perl script:
use LWP::Simple;
my $url = $ARGV[0];
my $content = get($url);
open(outputFile, '+>', $ARGV[1]);
print outputFile $content;
close(outputFile);
I suppose I could just get wget for Windows, but now this is bothering me. How do I get rid of those extra line breaks?!
- There's no sane reason for the
>+
mode in your example code. Just saying. LWP::Simple
has agetstore
method. If you're usingLWP::Simple
, why not use it?- By default, open is going to push the
:crlf
I/O layer when running on win32, which turns\n
into\r\n
. But the data you're writing already has\r\n
, so you're ending up with too many newlines. If you want data to be written verbatim, you should usebinmode
, or open the handle with:raw
to begin with. LWP already does this correctly.
I'm guessing that $content
already includes CRLF newlines and Perl's IO layer is doing LF -> CRLF conversion. (Internally, "\n" is a single character in Perl, usually LF). I'd add
binmode(outputFile);
after the open
to disable that conversion and write the results of $content
directly.
chomp($content) would be my guess. as it looks like there is natively already set of \n's in it.
EDIT: Sorry I just realized that chomp won't work, unless you split the file up into lines, then chomp each line, as chomp will only chomp the end of the input string, my solution wouldn't help in this case, however, you could split it on \n\n and then join? I do like the solution to use a regex on the string returned in an answer below. however for me the minor modification of: including some additional changes, so it still separates lines but it will check for either 2+ \n's or 2+ \r's or any combination of the two. then returning a \n in it's place, that way it's only going to have one new line per line (hopefully)
$content =~ s/[\n\r]+/\n/g;
EDITED Above again, accidentally put a ! in there for some reason....not sure why
精彩评论