开发者

How to copy all data from a HTML doc and save it to a string using C#

I need to create a data index of HTML pages provided to a service by essentially grabbing all text on them and putting them in a string to go into a storage system.

If this were GUI based, I would simply Ctrl+A on the HTML page, copy it, then go to Notepad and Ctrl+V. Simples. If I can do it via good old point n' click, then surely there must be a way to do it programmatically, but I'm struggling to find anything useful.

The HTML docs in question are being loaded for rendering currently using the System.Windows.Controls.WebBrowser class, so I wonder if its somehow possible to grab the data from there?

I'm going to keep hunting, but any pointers would be very appreciated.

Note: We don't want the HTM开发者_JAVA百科L source code, and would also really rather not have to parse all the source code to get the text unless we absolutely have to.


If I understand your problem correctly, you will have to do a bit of work to get the data.

WebBrowser browser=new WebBrowser();  // This is what you have
HtmlDocument doc = browser.Document;  // This gives you the browser contents
String content = 
    (((mshtml.HTMLDocumentClass)(doc.DomDocument)).documentElement).innerText;

That last line is the browser's view of the rendered content.


This looks like it might be quite helpful.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜