Extract contents of paragraph tag using a Perl one liner
I would like to match the contents of a paragraph tag u开发者_JAVA百科sing a perl reg ex one liner. The paragraph is something like this:
<p style="font-family: Calibri,Helvetica,serif;">Text I want to extract</p>
so I have been using something like this:
perl -nle 'm/<p>($.)<\/p>/ig; print $1' file.html
Any ideas appreciated
thanks
Mandatory link to what happens when you try to parse HTML with regular expressions.
David Dorward's comment, to use HTML::TreeBuilder, is a good one. Another good way to do this, is by using HTML::DOM:
perl -MHTML::DOM -e 'my $dom = HTML::DOM->new(); $dom->parse_file("file.html"); my @p = $dom->getElementsByTagName("p"); print $p[0]->innerText();'
$
in matching part means 'end-of-the-string' and you need also match all in p-tag non-greedy way:
perl -nle 'm/<p.*?>(.+)<\/p/ig; print $1' test.html
精彩评论