开发者

Replace text not contained in a tag using either Regex or XmlParser

I know that using Regular expressions to parse or manipulate HTML/XML is a bad idea and I usually would never do it. But considering it because of lack of alternatives.

I need to replace text inside a string that is not already part of a tag (ideally a span tag with specific id) using C#.

For example, Lets say I want to replace all instaces of ABC in the following text开发者_Python百科 that are not inside a span with Alternate text (another span in my case)

ABC at start of line or ABC here must be replaced but, <span id="__publishingReusableFragment" >ABC inside span must not be replaced with anything. Another ABC here </span> this ABC must also be replaced

I tried using regex with both look ahead and look behind assertion. Various combinations along the lines of

string regexPattern = "(?<!id=\"__publishingReusableFragment\").*?" + stringToMatch + ".*?(?!span)";

but gave up on that.

I tried loading it into an XElement and trying to create a writer from there and getting text not inside of a node. But couldn't figure that out either.

XElement xel = XElement.Parse("<payload>" + inputString + @"</payload>");
XmlWriter requiredWriter = xel.CreateWriter();

I am hoping somehow to use the writer to get the strings that are not part of a node and replacing them.

Basically I am open to any suggestions/solutions to solve this problem.

Thanks in advance for the help.


resultString = Regex.Replace(subjectString, 
    @"(?<!              # assert that we can't match the following 
                        # before the current position: 
                        # An opening span tag with specified id
     <\s*span\s*id=""__publishingReusableFragment""\s*>
     (?:                # if it is not followed by...
      (?!<\s*/\s*span)  # a closing span tag
      .                 # at any position between the opening tag
     )*                 # and our text
    )                   # End of lookbehind assertion
    ABC                 # Match ABC", 
    "XYZ", RegexOptions.Singleline | RegexOptions.IgnorePatternWhitespace);

will work with all the caveats about HTML parsing (that you seem to know, so I won't repeat them here) still valid.

The regex matches ABC if it's not preceded by an opening <span id=__publishingReusableFragment"> tag and if there is no closing <span> tag between the two. It will obviously fail if there can be nested <span> tags.


I know its slightly ugly, but this will work

var s =
    @"ABC at start of line or ABC here must be replaced but, <span id=""__publishingReusableFragment"" >ABC inside span must not be replaced with anything. Another ABC here </span> this ABC must also be replaced";
var newS = string.Join("</span>",s.Split(new[] {"</span>"}, StringSplitOptions.None)
    .Select(t =>
        {
            var bits = t.Split(new[] {"<span"}, StringSplitOptions.None);
            bits[0] = bits[0].Replace("ABC","DEF");
            return string.Join("<span", bits);
        }));
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜