RegEx to remove carriage returns between <p> tags
I've stumped myself trying to figure out how to remove carriage returns that occur between <p>
tags. (Technically I need to replace them with spaces, not remove them.)
Here's an example. I've used a dollar sign $
as a ca开发者_如何学JAVArriage return marker.
<p>
Ac nec <strong>
suspendisse est, dapibus.</strong>
Nulla taciti curabitur enim hendrerit.$
$
Nisl nullam sodales, tincidunt dictum dui eget, gravida anno. Montes convallis$
adipiscing, aenean hac litora. Ridiculus, ut consequat curae, amet. Nostra$
phasellus ridiculus class interdum justo. <em>
Pharetra urna est hac</em>
laoreet, magna.$
Porttitor purus purus, quis rutrum turpis. Montes netus nibh ornare potenti quam$
class. Natoque nec proin sapien augue curae, elementum.</p>
As the example shows, there can be other tags inbetween the <p>
tags. So I'm looking for a regex to replace all these carriage returns with spaces but not touch any carriage returns outside the <p>
tags.
Any help is greatly appreciated. Thanks!
A single-regex solution is basically impossible here. If you absolutely insist on not using an HTML parser, and you can count on your input being well-formed and predictable then you can write a simple lexer that will do the job (and I can provide sample code) -- but it's still not a very good idea :)
For reference:
- Why shouldn't I parse XML or XHTML with a regex?
- How can I parse HTML in my language of choice?
The standard answer is: don't try to process HTML (or SGML or XML) with a regex. Use a proper parser.
Regular expressions are singularly unsuitable to deal with "balanced parentheses" kinds of problems, even though people persist in trying to shoehorn them there (and some implementations -- I'm thinking of very recent perl releases, for example -- try to cooperate with this widespread misconception by extending and stretching "regular expressions" well beyond the CS definition thereof;-).
If you don't have to deal with nesting, it's comfortably doable in a two-pass approach -- grab each paragraph with e.g. <p>.*?</p>
(possibly with parentheses for grouping), then perform the substitution within each paragraph thus identified.
[\r\n]+(?=(?:[^<]+|<(?!/?p\b))*</p>)
The first part matches one or more of any kind of line separator (\n
, \r\n
, or \r
). The rest is a lookahead that attempts to match everything up to the next closing </p>
tag, but if it finds an opening <p>
tag first, the match fails.
Note that this regex can be fooled very easily, for example by SGML comments, <script>
elements, or plain old malformed HTML. Also, I'm assuming your regex flavor supports positive and negative lookaheads. That's a pretty safe assumption these days, but if the regex doesn't work for you, we'll need to know exactly which language or tool you're using.
Just use '\n
' but ensure that you enable multiple line regex.
I think it should work like this:
- get whole paragraph (text between <p> and </p>) from teh body
- create copy of this paragraph
- in copy replace \n with space
- in the body repace paragraph with modified copy
You can do it using regex, but I think simple character scanning can be used.
Some code in Python:
rx = re.compile(r'(<p>.*?</p>)', re.IGNORECASE | re.MULTILINE | re.DOTALL)
def get_paragraphs(body):
paragraphs = []
body_copy = body
rxx = rx.search(body_copy)
while rxx:
paragraphs.append(rxx.group(1))
body_copy = body_copy[rxx.end(1):]
rxx = rx.search(body_copy)
return paragraphs
def replace_paragraphs(body):
paragraphs = get_paragraphs(body)
for par in paragraphs:
par_new = par.replace('\n', ' ')
body = body.replace(par, par_new)
return body
def main():
new_body = replace_paragraphs(BODY)
print(new_body)
main()
This is the "almost good enough" lexing solution promised in my other answer, to sketch how it can be done. It makes a half-hearted attempt at coping with attributes, but not seriously. It also doesn't attempt to cope with unencoded "<" in attributes. These are relatively minor failings, and it does handle nested P tags, but as described in the comments it's totally unable to handle the case where someone doesn't close a P, because we can't do that without a thorough understanding of HTML. Considering how prevalent that practice still is, it's safe to declare this code "nearly useless". :)
#!/usr/bin/perl
use strict;
use warnings;
while ($html !~ /\G\Z/cg) {
if ($html =~ /\G(<p[^>]*>)/cg) {
$output .= $1;
$in_p ++;
} elsif ($html =~ m[\G(</p>)]cg) {
$output .= $1;
$in_p --; # Woe unto anyone who doesn't provide a closing tag.
# Tag soup parsers are good for this because they can generate an
# "artificial" end to the P when they find an element that can't contain
# a P, or the end of the enclosing element. We're not smart enough for that.
} elsif ($html =~ /\G([^<]+)/cg) {
my $text = $1;
$text =~ s/\s*\n\s*/ /g if $in_p;
$output .= $text;
} elsif ($html =~ /\G(<)/cg) {
$output .= $1;
} else {
die "Can't happen, but not having an else is scary!";
}
}
精彩评论