RegEx to remove carriage returns between tags

2022-12-09 06:44 问答作者：

I've stumped myself trying to figure out how to remove carriage returns that occur between  tags. (Technically I need to replace them with spaces, not remove them.)

Here's an example. I've used a dollar sign $ as a ca开发者_如何学JAVArriage return marker.

Ac nec suspendisse est, dapibus. Nulla taciti curabitur enim hendrerit.$

Ante ornare phasellus tellus vivamus dictumst dolor aliquam imperdiet lectus.$

Nisl nullam sodales, tincidunt dictum dui eget, gravida anno. Montes convallis$

adipiscing, aenean hac litora. Ridiculus, ut consequat curae, amet. Nostra$

phasellus ridiculus class interdum justo. Pharetra urna est hac laoreet, magna.$

Porttitor purus purus, quis rutrum turpis. Montes netus nibh ornare potenti quam$

class. Natoque nec proin sapien augue curae, elementum.

As the example shows, there can be other tags inbetween the  tags. So I'm looking for a regex to replace all these carriage returns with spaces but not touch any carriage returns outside the  tags.

Any help is greatly appreciated. Thanks!

A single-regex solution is basically impossible here. If you absolutely insist on not using an HTML parser, and you can count on your input being well-formed and predictable then you can write a simple lexer that will do the job (and I can provide sample code) -- but it's still not a very good idea :)

For reference:

Why shouldn't I parse XML or XHTML with a regex?
How can I parse HTML in my language of choice?

The standard answer is: don't try to process HTML (or SGML or XML) with a regex. Use a proper parser.

Regular expressions are singularly unsuitable to deal with "balanced parentheses" kinds of problems, even though people persist in trying to shoehorn them there (and some implementations -- I'm thinking of very recent perl releases, for example -- try to cooperate with this widespread misconception by extending and stretching "regular expressions" well beyond the CS definition thereof;-).

If you don't have to deal with nesting, it's comfortably doable in a two-pass approach -- grab each paragraph with e.g. .*? (possibly with parentheses for grouping), then perform the substitution within each paragraph thus identified.

[\r\n]+(?=(?:[^<]+|<(?!/?p\b))*</p>)

The first part matches one or more of any kind of line separator (\n, \r\n, or \r). The rest is a lookahead that attempts to match everything up to the next closing  tag, but if it finds an opening  tag first, the match fails.

Note that this regex can be fooled very easily, for example by SGML comments, <script> elements, or plain old malformed HTML. Also, I'm assuming your regex flavor supports positive and negative lookaheads. That's a pretty safe assumption these days, but if the regex doesn't work for you, we'll need to know exactly which language or tool you're using.

Just use '\n' but ensure that you enable multiple line regex.

I think it should work like this:

get whole paragraph (text between and ) from teh body
create copy of this paragraph
in copy replace \n with space
in the body repace paragraph with modified copy

You can do it using regex, but I think simple character scanning can be used.

Some code in Python:

rx = re.compile(r'(<p>.*?</p>)', re.IGNORECASE | re.MULTILINE | re.DOTALL)

def get_paragraphs(body):
    paragraphs = []
    body_copy = body
    rxx = rx.search(body_copy)
    while rxx:
        paragraphs.append(rxx.group(1))
        body_copy = body_copy[rxx.end(1):]
        rxx = rx.search(body_copy)
    return paragraphs

def replace_paragraphs(body):
    paragraphs = get_paragraphs(body)
    for par in paragraphs:
        par_new = par.replace('\n', ' ')
        body = body.replace(par, par_new)
    return body

def main():
    new_body = replace_paragraphs(BODY)
    print(new_body)

main()

This is the "almost good enough" lexing solution promised in my other answer, to sketch how it can be done. It makes a half-hearted attempt at coping with attributes, but not seriously. It also doesn't attempt to cope with unencoded "<" in attributes. These are relatively minor failings, and it does handle nested P tags, but as described in the comments it's totally unable to handle the case where someone doesn't close a P, because we can't do that without a thorough understanding of HTML. Considering how prevalent that practice still is, it's safe to declare this code "nearly useless". :)

#!/usr/bin/perl
use strict;
use warnings;

while ($html !~ /\G\Z/cg) {
  if ($html =~ /\G(<p[^>]*>)/cg) {
    $output .= $1;
    $in_p ++;
  } elsif ($html =~ m[\G(</p>)]cg) {
    $output .= $1;
    $in_p --; # Woe unto anyone who doesn't provide a closing tag.
    # Tag soup parsers are good for this because they can generate an
    # "artificial" end to the P when they find an element that can't contain
    # a P, or the end of the enclosing element. We're not smart enough for that.
  } elsif ($html =~ /\G([^<]+)/cg) {
    my $text = $1;
    $text =~ s/\s*\n\s*/ /g if $in_p;
    $output .= $text;
  } elsif ($html =~ /\G(<)/cg) {
    $output .= $1;
  } else {
    die "Can't happen, but not having an else is scary!";
  }
}

继续阅读：regex

RegEx to remove carriage returns between <p> tags

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？