C# parsing HTML for general use?

2022-12-31 18:25 问答作者：

What is the best way to take a string of HTML and turn it in to something useful?

Essentially if i take a URL and go get the HTML from that URL in .net i get a response but this would come in the form of either a file or stream or string.

What if i want an actual document or something 开发者_StackOverflow中文版I can crawl like an XmlDocument object?

I have some thoughts and an already implemented solution on this but I am interested to see what the community thinks about this.

HTML pages are rarely valid XML even if written in XHTML, so they cannot be loaded in to a standard XML object.

Take a look at the HTML Agility Pack. This .net component will allow you to traverse the DOM even if it is not valid.

I use the mshtml api.

simply refer to the mshtml assembly then include the namespace.

from there you can declare a HTMLDocument object which is queryable, its a bit of headache in places because the API design forces you to do random casting but it does get the job done and it can always be put in to a util class on it's own so you don't have to keep your oddities in your main app code classes.

You can use Tidy.net to format the html you get in your response. You will then be able to load that into an XmlDocument and traverse the nodes to get what you want.

Tidy document = new Tidy();
TidyMessageCollection messageCollection = new TidyMessageCollection();

document.Options.DocType = DocType.Omit;
document.Options.Xhtml = true;
document.Options.CharEncoding = CharEncoding.UTF8;
document.Options.LogicalEmphasis = true;

document.Options.MakeClean = false;
document.Options.QuoteNbsp = false;
document.Options.SmartIndent = false;
document.Options.IndentContent = false;
document.Options.TidyMark = false;

document.Options.DropFontTags = false;
document.Options.QuoteAmpersand = true;
document.Options.DropEmptyParas = true;

MemoryStream input = new MemoryStream();
MemoryStream output = new MemoryStream();
byte[] array = Encoding.UTF8.GetBytes(xmlResult);
input.Write(array, 0, array.Length);
input.Position = 0;

document.Parse(input, output, messageCollection);

string tidyXhtml = Encoding.UTF8.GetString(output.ToArray());

XmlDocument outputXml = new XmlDocument();
outputXml.LoadXml((tidyXhtml);

The easiest way is to load it into the System.Windows.Forms.HtmlDocument class. You can then access the DOM from there.

Of course you would want to look at the content-type in the HTTP response to determine if this is actually HTML (which the question referred to) or if this is perhaps binary data such as an image.

HTTP basically just spits out a raw document which is either binary data or markup text and the browser generally does the rest, using the hints it is provided in the response header. This is of course all nicely wrapped in the HTTPWebResponse clas, ready to use.

var browser = new System.Windows.Forms.WebBrowser();
browser.Navigate(new System.Uri("http://example.com"));
var doc = browser.Document;

HtmlDocument has a number of useful members

For example, doc.All which is HtmlControlCollection which can become a generic collection ICollection<HtmlControl>.

HtmlControl.DomElement refers to mshtml namespace mentioned in another answer.

Some usage example you can find in the source of this project

In addition to HTML Agility Pack, I've published my HtmlMonkey (a lightweight HTML parser) on Github.

It doesn't rely on any third-party tools. It just examine each character of the text to extract HTML tokens, and builds a DOM that you can traverse from code.

继续阅读：.net

C# parsing HTML for general use?

更多精彩内容

精彩评论

最新问答

求几款适合日常出游佩戴的戒指？最好与众不同一点！？

2500千以内的家用投影仪推荐下?只要效果好,不要求啥子牌子？

向僵尸开炮流派技能怎么选?？

绝区零音擎怎么获取?？

绝经后怎么改善子宫已经萎缩的症状？

问答排行榜

Escaping "<" in Perl-generated XML

微信重新建群怎么建？

imessage会显示已读吗？

太快了能不能慢一点好爽~好大~不要拔出来了？

二年级家长回音怎么写大全简短的（二年级家长回音怎么写）？