Replace text not contained in a tag using either Regex or XmlParser
I know that using Regular expressions to parse or manipulate HTML/XML is a bad idea and I usually would never do it. But considering it because of lack of alternatives.
I need to replace text inside a string that is not already part of a tag (ideally a span tag with specific id) using C#.
For example, Lets say I want to replace all instaces of ABC in the following text开发者_Python百科 that are not inside a span with Alternate text (another span in my case)
ABC at start of line or ABC here must be replaced but, <span id="__publishingReusableFragment" >ABC inside span must not be replaced with anything. Another ABC here </span> this ABC must also be replaced
I tried using regex with both look ahead and look behind assertion. Various combinations along the lines of
string regexPattern = "(?<!id=\"__publishingReusableFragment\").*?" + stringToMatch + ".*?(?!span)";
but gave up on that.
I tried loading it into an XElement and trying to create a writer from there and getting text not inside of a node. But couldn't figure that out either.
XElement xel = XElement.Parse("<payload>" + inputString + @"</payload>");
XmlWriter requiredWriter = xel.CreateWriter();
I am hoping somehow to use the writer to get the strings that are not part of a node and replacing them.
Basically I am open to any suggestions/solutions to solve this problem.
Thanks in advance for the help.
resultString = Regex.Replace(subjectString,
@"(?<! # assert that we can't match the following
# before the current position:
# An opening span tag with specified id
<\s*span\s*id=""__publishingReusableFragment""\s*>
(?: # if it is not followed by...
(?!<\s*/\s*span) # a closing span tag
. # at any position between the opening tag
)* # and our text
) # End of lookbehind assertion
ABC # Match ABC",
"XYZ", RegexOptions.Singleline | RegexOptions.IgnorePatternWhitespace);
will work with all the caveats about HTML parsing (that you seem to know, so I won't repeat them here) still valid.
The regex matches ABC
if it's not preceded by an opening <span id=__publishingReusableFragment">
tag and if there is no closing <span>
tag between the two. It will obviously fail if there can be nested <span>
tags.
I know its slightly ugly, but this will work
var s =
@"ABC at start of line or ABC here must be replaced but, <span id=""__publishingReusableFragment"" >ABC inside span must not be replaced with anything. Another ABC here </span> this ABC must also be replaced";
var newS = string.Join("</span>",s.Split(new[] {"</span>"}, StringSplitOptions.None)
.Select(t =>
{
var bits = t.Split(new[] {"<span"}, StringSplitOptions.None);
bits[0] = bits[0].Replace("ABC","DEF");
return string.Join("<span", bits);
}));
精彩评论