Extracting Inner text from HTML BODY node with Html Agility Pack
Need a bit of help with HTML Agility Pack!
Basically I want to grab plain-text withing the body node of the HTML. So far I have tried this in vb.net and it fails to return the innertext meaning no change is seen, well atleast from what I can see.
Dim htmldoc As HtmlDocument = New HtmlDocument
htmldoc.LoadHtml(html)
Dim paragraph As HtmlNodeCollection = htmldoc.DocumentNode.SelectNodes("//body")
If Not htmldoc Is Nothing Then
For Each node In paragraph
node.ParentNode.RemoveChild(node, True)
Next
End If
Return htmldoc.DocumentNode.WriteContentTo
I have tried this:
Return htmldoc.DocumentNode.InnerT开发者_如何学Pythonext
But still no luck!
Any advice???
How about:
Return htmldoc.DocumentNode.SelectSingleNode("//body").InnerText
Jeff's solution is ok if you haven't tables, because text located in the table is sticking like cell1cell2cell3. To prevent this issue use this code (C# example):
var words = doc.DocumentNode?.SelectNodes("//body//text()")?.Select(x => x.InnerText);
return words != null ? string.Join(" ", words) : String.Empty;
精彩评论