How to parse this piece of HTML?

2022-12-16 16:01 问答作者：

good morning! i am using c# (framework 3.5sp1) and want to parse following piece of html via regex:

<h1>My caption</h1>
<p>Here will be some text</p>

<hr class="cs" />
<h2 id="x">CaptionX</h2>
<p>Some text</p>

<hr class="cs" />
<h2 id="x">CaptionX</h2>
<p>Some text</p>

<hr class="cs" />
<h2 id="x">CaptionX</h2>
<p>Some text</p>

i need following output:

group 1: content of h1
group 2: content of h1-following text
group 3-n: content of subcaptions + text

what i have atm:

<hr.*?/>
<h2.*?>(.*?)</h2>
([\W\S]*?)
<hr.*?/>

this will give me every odd subcaption + content (eg. 1, 3, ...) due to the trailing <hr/>. for parsing the h1-caption i have another pattern (<h1.*?>(.*?)</h1>), which only gives me开发者_开发百科 the caption but not the content - i'm fine with that atm.

does anybody have a hint/solution for me or any alternative logics (eg. parsing the html via reader and assigning it this way?)?

edit:

as some brought in HTMLAgilityPack, i was curious about this nice tool. i accomplished getting content of the <h1>-tag.

but ... myproblem is parsing the rest. this is caused by: the tags for the content may vary - from <p> to <div> and <ul>... atm this seems more or less iterate over the whole document and parsing tag for tag ...? any hints?

You will really need HTML parser for this

Don't use regex to parse HTML. Consider using the HTML Agility Pack.

There are some possibilities:

REGEX - Fast but not reliable, it cant deal with malformed html.

HtmlAgilityPack - Good, but have many memory leaks. If you want to deal with a few files, there is no problem.

SGMLReader - Really good, but there are a problem. Sometimes it cant find the default namespace to get others nodes, then it is impossible to parse html.

http://developer.mindtouch.com/SgmlReader

Majestic-12 - Good but not so fast as SGMLReader.

http://www.majestic12.co.uk/projects/html_parser.php

Example for SGMLreader (VB.net)

Dim sgmlReader As New Sgml.SgmlReader()
Public htmldoc As New System.Xml.Linq.XDocument
sgmlReader.DocType = "HTML"
sgmlReader.WhitespaceHandling = System.Xml.WhitespaceHandling.All
sgmlReader.CaseFolding = Sgml.CaseFolding.ToLower
sgmlReader.InputStream = New System.IO.StringReader(vSource)
sgmlReader.CaseFolding = CaseFolding.ToLower
htmldoc = XDocument.Load(sgmlReader)    
Dim XNS As XNamespace 

' In this part you can have a bug, sometimes it cant get the Default Namespace*********
Try
      XNS = htmldoc.Root.GetDefaultNamespace
Catch
        XNS = "http://www.w3.org/1999/xhtml"
End Try
If XNS.NamespaceName.Trim = "" Then
        XNS = "http://www.w3.org/1999/xhtml"
End If

'use it with the linq commands
For Each link In htmldoc.Descendants(XNS + "script")
        Scripts &= link.Value
Next

In Majestic-12 is different, you have to walk to every tag with a "Next" command. You can find a example code with the dll.

As others have mentioned, use the HtmlAgilityPack. However, if you like jQuery/CSS selectors, I just found a fork of the HtmlAgilityPack called Fizzler: http://code.google.com/p/fizzler/ Using this you could find all <p> tags using:

var pTags = doc.DocumentNode.QuerySelectorAll('p').ToList();

Or find a specific div like <div id="myDiv"></div>:

var myDiv = doc.DocumentNode.QuerySelectorAll('#myDiv');

It can't get any easier than that!

继续阅读：html-agility-pack

How to parse this piece of HTML?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？