Parsing sections of HTML in c#
I need to parse sections from a string of HTML. For example:
<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit.</p>
<p>[section=quote]</p>
<p>Mauris at turpis nec dolor bibendum sollicitudin ac quis neque.</p>
<p>[/section]</p>
Parsing the quote section should return:
<p>Mauris at turpis nec dolor bibendum sollicitudin ac quis neque.</p>
Currently I'm using a regular expression to grab the content inside [section=quote]...[/section], but since the sections are entered using a WYSIWYG editor, the section tags themselves get wrapped in a paragraph tag, so the parsed result is:
</p>
<p>Mauris at turpis nec dolor bibendum sollicitudin ac quis neque.</p>
<p>
The Regular Expression I'm using currently is:
\[section=(.+?)\](.+?)\[/section\]
And I'm also doing some additional cleanup prior to parsing the sections:
protected string CleanHtml(string input) {
// remove whitespace
input = Regex.Replace(input, @"\s*(<[^>]+>)\s*", "$1", RegexOptions.Singleline);
// remove empty p elements
input = Regex.Replace(input, @"<p\s*/>|<p>\s*</p>", string.Empty);
return input;
}
Can anyone provide a regular expression that would achieve what I am looking for or am I wasting my time trying to do this with Regex? I've seen references to the Html Agility Pack - would this be better for something like this?
[Update]
Thanks to Oscar I have used a combination of the HTML Agility pack and Regex to parse the sections. It still needs a bit of refining but it's nearly there.
public void ParseSections(string content)
{
this.SourceContent = content;
this.NonSectionedContent = content;
content = CleanHtml(content);
if (!sectionRegex.IsMatch(content))
return;
var doc = new HtmlDocument();
doc.LoadHtml(content);
bool flag = false;
string sectionName = string.Empty;
var sectionContent = new StringBuilder();
var unsectioned = new StringBuilder();
foreach (var n in doc.DocumentNode.SelectNodes("//p")) {
if (startSectionRegex.IsMatch(n.InnerText)) {
flag = true;
开发者_Python百科 sectionName = startSectionRegex.Match(n.InnerText).Groups[1].Value.ToLowerInvariant();
continue;
}
if (endSectionRegex.IsMatch(n.InnerText)) {
flag = false;
this.Sections.Add(sectionName, sectionContent.ToString());
sectionContent.Clear();
continue;
}
if (flag)
sectionContent.Append(n.OuterHtml);
else
unsectioned.Append(n.OuterHtml);
}
this.NonSectionedContent = unsectioned.ToString();
}
The following works, using HtmlAgilityPack
library:
using HtmlAgilityPack;
...
HtmlDocument doc = new HtmlDocument();
doc.Load(@"C:\file.html");
bool flag = false;
var sb = new StringBuilder();
foreach (var n in doc.DocumentNode.SelectNodes("//p"))
{
switch (n.InnerText)
{
case "[section=quote]":
flag = true;
continue;
case "[/section]":
flag = false;
break;
}
if (flag)
{
sb.AppendLine(n.OuterHtml);
}
}
Console.Write(sb);
Console.ReadLine();
If you just want to print
Mauris at turpis nec dolor bibendum sollicitudin ac quis neque.
without <p>...</p>
, you can replace n.OuterHtml
by n.InnerHtml
.
Of course, you should check if doc.DocumentNode.SelectNodes("//p")
is null
.
If you want to load the html from an online source instead of a file, you can do:
var htmlWeb = new HtmlWeb();
var doc = htmlWeb.Load("http://..../page.html");
Edit:
If [section=quote]
an [/section]
could be inside any tag (not always <p>
), you can replace doc.DocumentNode.SelectNodes("//p")
by doc.DocumentNode.SelectNodes("//*")
.
How about replacing
<p>[section=quote]</p>
with
[section=quote]
and
<p>[/section]</p>
with
[/section]
as part of your cleanup. Then you can use your existing regular expression.
精彩评论