Get only text from html with html agility
I'm trying to remove from the html everything that is concerned to html with html agility, but I need to keep the text. For example, from this tag:
<TR><TD>
<B><A HREF="survival/index.html">Survival</A></B><BR>
<I>Be Suspicious, Be Worried, Be Prepared</I><BR>
<TD>
I want to keep only "Be suspicious..."
I have this method, but doesn't work very well:
private static HtmlDocument RemoveHTML(HtmlDocument document)
{
HtmlDocument textOfDoc = new HtmlDocument();
foreach (var node in document.DocumentNode.SelectNodes(".//p|.//title|.//body"))
{
var newNode = HtmlNode.CreateNode(node.InnerText+" ");
开发者_如何学编程 textOfDoc.DocumentNode.AppendChild(newNode);
}
return textOfDoc;
}
THANKS!
It looks like you're only extracting P, TITLE and BODY tags. If you want I tags as well, you need to do this:
document.DocumentNode.SelectNodes(".//p|.//title|.//body|.//i")
精彩评论