How to parse this piece of HTML?
good morning! i am using c# (framework 3.5sp1) and want to parse following piece of html via regex:
<h1>My caption</h1>
<p>Here will be some text</p>
<hr class="cs" />
<h2 id="x">CaptionX</h2>
<p>Some text</p>
<hr class="cs" />
<h2 id="x">CaptionX</h2>
<p>Some text</p>
<hr class="cs" />
<h2 id="x">CaptionX</h2>
<p>Some text</p>
i need following output:
- group 1: content of h1
- group 2: content of h1-following text
- group 3-n: content of subcaptions + text
what i have atm:
<hr.*?/>
<h2.*?>(.*?)</h2>
([\W\S]*?)
<hr.*?/>
this will give me every odd subcaption + content (eg. 1, 3, ...) due to the trailing <hr/>
. for parsing the h1-caption i have another pattern (<h1.*?>(.*?)</h1>
), which only gives me开发者_开发百科 the caption but not the content - i'm fine with that atm.
does anybody have a hint/solution for me or any alternative logics (eg. parsing the html via reader and assigning it this way?)?
edit:
as some brought in HTMLAgilityPack, i was curious about this nice tool. i accomplished getting content of the<h1>
-tag.
but ... myproblem is parsing the rest. this is caused by: the tags for the content may vary - from <p>
to <div>
and <ul>
...
atm this seems more or less iterate over the whole document and parsing tag for tag ...?
any hints?You will really need HTML parser for this
Don't use regex to parse HTML. Consider using the HTML Agility Pack.
There are some possibilities:
REGEX - Fast but not reliable, it cant deal with malformed html.
HtmlAgilityPack - Good, but have many memory leaks. If you want to deal with a few files, there is no problem.
SGMLReader - Really good, but there are a problem. Sometimes it cant find the default namespace to get others nodes, then it is impossible to parse html.
http://developer.mindtouch.com/SgmlReader
Majestic-12 - Good but not so fast as SGMLReader.
http://www.majestic12.co.uk/projects/html_parser.php
Example for SGMLreader (VB.net)
Dim sgmlReader As New Sgml.SgmlReader()
Public htmldoc As New System.Xml.Linq.XDocument
sgmlReader.DocType = "HTML"
sgmlReader.WhitespaceHandling = System.Xml.WhitespaceHandling.All
sgmlReader.CaseFolding = Sgml.CaseFolding.ToLower
sgmlReader.InputStream = New System.IO.StringReader(vSource)
sgmlReader.CaseFolding = CaseFolding.ToLower
htmldoc = XDocument.Load(sgmlReader)
Dim XNS As XNamespace
' In this part you can have a bug, sometimes it cant get the Default Namespace*********
Try
XNS = htmldoc.Root.GetDefaultNamespace
Catch
XNS = "http://www.w3.org/1999/xhtml"
End Try
If XNS.NamespaceName.Trim = "" Then
XNS = "http://www.w3.org/1999/xhtml"
End If
'use it with the linq commands
For Each link In htmldoc.Descendants(XNS + "script")
Scripts &= link.Value
Next
In Majestic-12 is different, you have to walk to every tag with a "Next" command. You can find a example code with the dll.
As others have mentioned, use the HtmlAgilityPack. However, if you like jQuery/CSS selectors, I just found a fork of the HtmlAgilityPack called Fizzler:
http://code.google.com/p/fizzler/
Using this you could find all <p>
tags using:
var pTags = doc.DocumentNode.QuerySelectorAll('p').ToList();
Or find a specific div like <div id="myDiv"></div>
:
var myDiv = doc.DocumentNode.QuerySelectorAll('#myDiv');
It can't get any easier than that!
精彩评论