How to read HTML as XML?

2023-02-20 07:27 问答作者：

I want to extract a couple of links from an html page downloaded from the internet, I think that using linq to XML would be a good solution for my case.

My problem is that I can't create an XmlDocument from the HTML, using Load(string url) didn't work so I downloaded the html to a string using:

public static string readHTML(string url)
    {
        HttpWebRequest req = (HttpWebRequest)WebRequest.Create(url);
        HttpWebResponse res = (HttpWebResponse)req.GetResponse();
        StreamReade开发者_如何学编程r sr = new StreamReader(res.GetResponseStream());

        string html = sr.ReadToEnd();
        sr.Close();
        return html;
    }

When I try to load that string using LoadXml(string xml) I get the exception

'--' is an unexpected token. The expected token is '>'

What way should I take to read the html file to a parsable XML

HTML simply isn’t the same as XML (unless the HTML actually happens to be conforming XHTML or HTML5 in XML mode). The best way is to use a HTML parser to read the HTML. Afterwards you may transform it to Linq to XML – or process it directly.

I haven't used it myself, but I suggest you take a look at SgmlReader. Here's a sample from their home page:

// setup SgmlReader
Sgml.SgmlReader sgmlReader = new Sgml.SgmlReader()
{
    DocType = "HTML",
    WhitespaceHandling = WhitespaceHandling.All,
    CaseFolding = Sgml.CaseFolding.ToLower,
    InputStream = reader
};

// create document
XmlDocument doc = new XmlDocument()
{
    PreserveWhitespace = true,
    XmlResolver = null
};
doc.Load(sgmlReader);
return doc;

If you want to extract some links from a page, as you mentioned, try using HTML Agility Pack.

This code gets a page from the web and extracts all links:

HtmlWeb web = new HtmlWeb();  
HtmlDocument document = web.Load("http://www.stackoverflow.com");  
HtmlNode[] links = document.DocumentNode.SelectNodes("//a").ToArray();

Open an html file from disk and get URL for specific link:

HtmlDocument document2 = new HtmlDocument();  
document2.Load(@"C:\Temp\page.html")  
HtmlNode link = document2.DocumentNode.SelectSingleNode("//a[@id='myLink']");
Console.WriteLine(link.Attributes["href"].Value);

HTML is not XML. HTML is based on SGML, and as such does not ensure that the markup is well-formed XML (XML is a subset of SGML itself). You can only parse XHTML, i.e. XML compatible HTML, as XML. But of course that is not the case for most of the websites.

To work with HTML, you need to use a HTML parser.

If you know the nodes you're interested in I would use regex to extract the links from the string.

继续阅读：html-parsing xml

How to read HTML as XML?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？