Perl splitting text string (from HTML page, text document, etc.) by line into array?
This is kind of a weird question, at least for me, as I don't exactly understand what is fully involved in this. Basically, I have been doing this process where I save a scraped document (such as a web page) to a .txt
file. Then I can easily use Perl to read this file and put each line into an array. However, it is not doing this based on any visible thing in the document (i.e., it is not going by HTML linebreaks); it just knows where a new line is, based on the .txt
format.
However, I would 开发者_运维百科like to cut this process out and just do the same thing from within a variable, so instead I would have what would have been the contents of the .txt
file in a string and then I want to parse it, in the same way, line by line. The problem for me is that I don't know much about how this would work as I don't really understand how Perl would be able to tell where a new line is (assuming I'm not going by HTML linebreaks, as often it is just a web based .txt file (which presents to my scraper, www:mechanize, as a web page) I'm scraping so there is no HTML to go by). I figure I can do this using other parameters, such as blank spaces, but am interested to know if there is a way to do this by line. Any info is appreciated.
I'd like to cut the actual saving of a file to reduce issues related to permissions on servers I use and also am just curious if I can make the process more efficient.
Here's an idea that might help you: you can open
from strings as well as files.
So if you used to do this:
open( my $io, '<', 'blah.txt' ) or die "Could not open blah.txt! - $!";
my @list = <$io>;
You can just do this:
open( my $io, '<', \$text_I_captured );
my @list = <$io>;
It's hard to tell what your code's doing since we don't have it in front of us; it would be easier to help if you posted what you had. However, I'll give it a shot. If you scrape the text into a variable, you will have a string which may have embedded line breaks. These will either be \n
(the traditional Unix newline) or \r\n
(the traditional Windows newline sequence). Just as you can split on a space to get (a first approximation of) the words in a sentence, you can instead split on the newline sequence to get the lines in. Thus, the single line you'll need should be
my @lines = split(/\r?\n/, $scraped_text);
Use the $/
variable, this determines what to break lines on. So:
local $/ = " ";
while(<FILE>)...
would give you chunks separated by spaces. Just set it back to "\n"
to get back to the way it was - or better yet, go out of the local $/
scope and let the global one come back, just in case it was something other than "\n"
to begin with.
You can eliminate it altogether:
local $/ = undef;
To read whole files in one slurp. And then iterate through them however you like. Just be aware that if you do a split
or a splice
, you may end up copying the string over and over, using lots of CPU and lots of memory. One way to do it with less is:
# perl -de 0
> $_="foo\nbar\nbaz\n";
> while( /\G([^\n]*)\n/go ) { print "line='$1'\n"; }
line='foo'
line='bar'
line='baz'
If you're breaking apart things by newlines, for example. \G
matches either the beginning of the string or the end of the last match, within a /g
-tagged regex.
Another weird tidbit is $/=\10
... if you give it a scalar reference to an integer (here 10
), you can get record-length chunks:
# cat fff
eurgpuwergpiuewrngpieuwngipuenrgpiunergpiunerpigun
# perl -de 0
$/ = \10;
open FILE, "<fff";
while(<FILE>){ print "chunk='$_'\n"; }
chunk='eurgpuwerg'
chunk='piuewrngpi'
chunk='euwngipuen'
chunk='rgpiunergp'
chunk='iunerpigun'
chunk='
'
More info: http://www.perl.com/pub/a/2004/06/18/variables.html
If you combine this with FM's answer of using:
$data = "eurgpuwergpiuewrngpieuwngipuenrgpiunergpiunerpigun";
open STRING, "<", \$data;
while(<STRING>){ print "chunk='$_'\n"; }
I think you can get every combination of what you need...
精彩评论