How to copy all data from a HTML doc and save it to a string using C#
I need to create a data index of HTML pages provided to a service by essentially grabbing all text on them and putting them in a string to go into a storage system.
If this were GUI based, I would simply Ctrl+A on the HTML page, copy it, then go to Notepad and Ctrl+V. Simples. If I can do it via good old point n' click, then surely there must be a way to do it programmatically, but I'm struggling to find anything useful.
The HTML docs in question are being loaded for rendering currently using the System.Windows.Controls.WebBrowser class, so I wonder if its somehow possible to grab the data from there?
I'm going to keep hunting, but any pointers would be very appreciated.
Note: We don't want the HTM开发者_JAVA百科L source code, and would also really rather not have to parse all the source code to get the text unless we absolutely have to.
If I understand your problem correctly, you will have to do a bit of work to get the data.
WebBrowser browser=new WebBrowser(); // This is what you have
HtmlDocument doc = browser.Document; // This gives you the browser contents
String content =
(((mshtml.HTMLDocumentClass)(doc.DomDocument)).documentElement).innerText;
That last line is the browser's view of the rendered content.
This looks like it might be quite helpful.
精彩评论