C# parsing HTML for general use?
What is the best way to take a string of HTML and turn it in to something useful?
Essentially if i take a URL and go get the HTML from that URL in .net i get a response but this would come in the form of either a file or stream or string.
What if i want an actual document or something 开发者_StackOverflow中文版I can crawl like an XmlDocument object?
I have some thoughts and an already implemented solution on this but I am interested to see what the community thinks about this.
HTML pages are rarely valid XML even if written in XHTML, so they cannot be loaded in to a standard XML object.
Take a look at the HTML Agility Pack. This .net component will allow you to traverse the DOM even if it is not valid.
I use the mshtml api.
simply refer to the mshtml assembly then include the namespace.
from there you can declare a HTMLDocument object which is queryable, its a bit of headache in places because the API design forces you to do random casting but it does get the job done and it can always be put in to a util class on it's own so you don't have to keep your oddities in your main app code classes.
You can use Tidy.net to format the html you get in your response. You will then be able to load that into an XmlDocument and traverse the nodes to get what you want.
Tidy document = new Tidy();
TidyMessageCollection messageCollection = new TidyMessageCollection();
document.Options.DocType = DocType.Omit;
document.Options.Xhtml = true;
document.Options.CharEncoding = CharEncoding.UTF8;
document.Options.LogicalEmphasis = true;
document.Options.MakeClean = false;
document.Options.QuoteNbsp = false;
document.Options.SmartIndent = false;
document.Options.IndentContent = false;
document.Options.TidyMark = false;
document.Options.DropFontTags = false;
document.Options.QuoteAmpersand = true;
document.Options.DropEmptyParas = true;
MemoryStream input = new MemoryStream();
MemoryStream output = new MemoryStream();
byte[] array = Encoding.UTF8.GetBytes(xmlResult);
input.Write(array, 0, array.Length);
input.Position = 0;
document.Parse(input, output, messageCollection);
string tidyXhtml = Encoding.UTF8.GetString(output.ToArray());
XmlDocument outputXml = new XmlDocument();
outputXml.LoadXml((tidyXhtml);
The easiest way is to load it into the System.Windows.Forms.HtmlDocument class. You can then access the DOM from there.
Of course you would want to look at the content-type in the HTTP response to determine if this is actually HTML (which the question referred to) or if this is perhaps binary data such as an image.
HTTP basically just spits out a raw document which is either binary data or markup text and the browser generally does the rest, using the hints it is provided in the response header. This is of course all nicely wrapped in the HTTPWebResponse clas, ready to use.
var browser = new System.Windows.Forms.WebBrowser();
browser.Navigate(new System.Uri("http://example.com"));
var doc = browser.Document;
HtmlDocument
has a number of useful members
For example, doc.All
which is HtmlControlCollection
which can become a generic collection ICollection<HtmlControl>
.
HtmlControl.DomElement
refers to mshtml
namespace mentioned in another answer.
Some usage example you can find in the source of this project
In addition to HTML Agility Pack, I've published my HtmlMonkey (a lightweight HTML parser) on Github.
It doesn't rely on any third-party tools. It just examine each character of the text to extract HTML tokens, and builds a DOM that you can traverse from code.
精彩评论