开发者

RegEx to remove carriage returns between <p> tags

I've stumped myself trying to figure out how to remove carriage returns that occur between <p> tags. (Technically I need to replace them with spaces, not remove them.)

Here's an example. I've used a dollar sign $ as a ca开发者_如何学JAVArriage return marker.

<p>Ac nec <strong>suspendisse est, dapibus.</strong> Nulla taciti curabitur enim hendrerit.$

Ante ornare phasellus tellus vivamus dictumst dolor aliquam imperdiet lectus.$

Nisl nullam sodales, tincidunt dictum dui eget, gravida anno. Montes convallis$

adipiscing, aenean hac litora. Ridiculus, ut consequat curae, amet. Nostra$

phasellus ridiculus class interdum justo. <em>Pharetra urna est hac</em> laoreet, magna.$

Porttitor purus purus, quis rutrum turpis. Montes netus nibh ornare potenti quam$

class. Natoque nec proin sapien augue curae, elementum.</p>

As the example shows, there can be other tags inbetween the <p> tags. So I'm looking for a regex to replace all these carriage returns with spaces but not touch any carriage returns outside the <p> tags.

Any help is greatly appreciated. Thanks!


A single-regex solution is basically impossible here. If you absolutely insist on not using an HTML parser, and you can count on your input being well-formed and predictable then you can write a simple lexer that will do the job (and I can provide sample code) -- but it's still not a very good idea :)

For reference:

  • Why shouldn't I parse XML or XHTML with a regex?
  • How can I parse HTML in my language of choice?


The standard answer is: don't try to process HTML (or SGML or XML) with a regex. Use a proper parser.


Regular expressions are singularly unsuitable to deal with "balanced parentheses" kinds of problems, even though people persist in trying to shoehorn them there (and some implementations -- I'm thinking of very recent perl releases, for example -- try to cooperate with this widespread misconception by extending and stretching "regular expressions" well beyond the CS definition thereof;-).

If you don't have to deal with nesting, it's comfortably doable in a two-pass approach -- grab each paragraph with e.g. <p>.*?</p> (possibly with parentheses for grouping), then perform the substitution within each paragraph thus identified.


[\r\n]+(?=(?:[^<]+|<(?!/?p\b))*</p>)

The first part matches one or more of any kind of line separator (\n, \r\n, or \r). The rest is a lookahead that attempts to match everything up to the next closing </p> tag, but if it finds an opening <p> tag first, the match fails.

Note that this regex can be fooled very easily, for example by SGML comments, <script> elements, or plain old malformed HTML. Also, I'm assuming your regex flavor supports positive and negative lookaheads. That's a pretty safe assumption these days, but if the regex doesn't work for you, we'll need to know exactly which language or tool you're using.


Just use '\n' but ensure that you enable multiple line regex.


I think it should work like this:

  1. get whole paragraph (text between <p> and </p>) from teh body
  2. create copy of this paragraph
  3. in copy replace \n with space
  4. in the body repace paragraph with modified copy

You can do it using regex, but I think simple character scanning can be used.

Some code in Python:

rx = re.compile(r'(<p>.*?</p>)', re.IGNORECASE | re.MULTILINE | re.DOTALL)

def get_paragraphs(body):
    paragraphs = []
    body_copy = body
    rxx = rx.search(body_copy)
    while rxx:
        paragraphs.append(rxx.group(1))
        body_copy = body_copy[rxx.end(1):]
        rxx = rx.search(body_copy)
    return paragraphs

def replace_paragraphs(body):
    paragraphs = get_paragraphs(body)
    for par in paragraphs:
        par_new = par.replace('\n', ' ')
        body = body.replace(par, par_new)
    return body

def main():
    new_body = replace_paragraphs(BODY)
    print(new_body)

main() 


This is the "almost good enough" lexing solution promised in my other answer, to sketch how it can be done. It makes a half-hearted attempt at coping with attributes, but not seriously. It also doesn't attempt to cope with unencoded "<" in attributes. These are relatively minor failings, and it does handle nested P tags, but as described in the comments it's totally unable to handle the case where someone doesn't close a P, because we can't do that without a thorough understanding of HTML. Considering how prevalent that practice still is, it's safe to declare this code "nearly useless". :)

#!/usr/bin/perl
use strict;
use warnings;

while ($html !~ /\G\Z/cg) {
  if ($html =~ /\G(<p[^>]*>)/cg) {
    $output .= $1;
    $in_p ++;
  } elsif ($html =~ m[\G(</p>)]cg) {
    $output .= $1;
    $in_p --; # Woe unto anyone who doesn't provide a closing tag.
    # Tag soup parsers are good for this because they can generate an
    # "artificial" end to the P when they find an element that can't contain
    # a P, or the end of the enclosing element. We're not smart enough for that.
  } elsif ($html =~ /\G([^<]+)/cg) {
    my $text = $1;
    $text =~ s/\s*\n\s*/ /g if $in_p;
    $output .= $text;
  } elsif ($html =~ /\G(<)/cg) {
    $output .= $1;
  } else {
    die "Can't happen, but not having an else is scary!";
  }
}
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜