开发者

Regex - remove the last <p> segment of an HTML string

I have an HTML structure that is being pulled from an RSS feed, and I need to remove part of it, but it is not a standalone part of the stream.

So I have

<p>Some Html... </p>
<br />
<p>The p section I want to remove</p>

Is there a regex pattern that can do this? find the last <p> segment of a gi开发者_Python百科ven string and chop it out? I am using C# for the Regex.


Are you sure you want to use Regex for this? Actually I think that you should use them only when you need to.

Why don't you consider something like (assuming the HTML is well formed and that there are not nested paragraphs):

string html = GetRSS();
int pStartIndex = html.LastIndexOf("<p>");
int pEndIndex = html.LastIndexOf("</p>");
string result = html.Remove(pStartIndex, pEndIndex - pStartIndex + 4);

Alternatively you could consider using something more advanced (and maybe appropriate) like HTML Agility Pack or (worse if you are working with bad formed html) the integrated .NET XML parser (EDIT: As svicks says if you choose this solution please make sure that you are working with HTML that is also valid XML).


You can use this regular expression to replace the last occurrence of the <p> tag.

// Begin with '<p>' followed by any character and then end with '</p>'
var pattern = @"<p>.*</p>"; 
var regex = new Regex(pattern);

var sourceString = @"<p>Some Html... </p>\n<br />\n<p>The p section I want to remove</p>";

var matchCollection = regex.Matches(sourceString);
if(matchCollection.Count > 0)
{
    sourceString.Replace(matchCollection[matchCollection.Count - 1].Value, string.Empty);
}
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜