In a .NET application, is it possible to get a representation of the DOM as a web browser would see it?
I need for a server side process to be able to produce the same view of the HTML dom for a web page as a web browser (I am aware that the dom representation is browser specific, so don't mind a non cross browser solution).
I need to be able to work my way back to a user selection on a web page at a later date. Since there is no firm relationship between the raw HTML for a page, and the Dom that a browser constructs, this is proving very difficult to say the least!
My thinking is now that if I can produce the same view of the doc开发者_StackOverflowument in a server side process, then I may be able to achieve this.
Does anyone have experience of this?
Thanks
OK, different angle. What about using the WebBrowser Control?
As far as I know, there's nothing preventing web application from adding reference to System.Windows assembly and using it.
Bit of a long shot, but IMO worth trying!
OK... for what it's worth, I was able to successfully use WebBrowser control (yeah, from System.Windows.Forms) to load remote page and iterate its DOM freely.
The bricks in the wall I faced and destroyed are below.
Full code, which for the sake of example show all images in the remote page:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Web;
using System.Web.UI;
using System.Web.UI.WebControls;
using System.Threading;
using System.Reflection;
using System.Windows.Forms;
using System.Text;
namespace TestZone
{
public partial class _Default : System.Web.UI.Page
{
private bool waiting = false;
private WebBrowser browser = null;
protected void Page_Load(object sender, EventArgs e)
{
Thread thread = new Thread(new ParameterizedThreadStart(LoadRemotePage));
thread.SetApartmentState(ApartmentState.STA);
waiting = true;
thread.Start(this);
while (waiting)
{
Thread.Sleep(10);
}
}
private void LoadRemotePage(object sender)
{
try
{
browser = new WebBrowser();
browser.Tag = sender;
browser.Navigate("http://stackoverflow.com/questions/4082249/in-a-net-application-is-it-possible-to-get-a-representation-of-the-dom-as-a-web/4085520");
browser.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(browser_DocumentCompleted);
while (browser.ReadyState != WebBrowserReadyState.Complete)
System.Windows.Forms.Application.DoEvents();
browser.Dispose();
}
catch (Exception ex)
{
litDebug.Text = "Error while initializing browser control: " + ex.ToString().Replace("\n", "<br />");
(sender as _Default).waiting = false;
}
finally
{
}
//hgcDebug.GetType().InvokeMember("InnerHtml", BindingFlags.SetProperty, null, hgcDebug, new object[] { "done" });
}
void browser_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
try
{
HtmlElementCollection collection = browser.Document.GetElementsByTagName("img");
StringBuilder sb = new StringBuilder();
sb.AppendFormat("Total of {0} images:<br />", collection.Count);
for (int i = 0; i < collection.Count; i++)
sb.AppendFormat("name: {0}, src: {1}<br />", collection[i].GetAttribute("name"), collection[i].GetAttribute("src"));
litDebug.Text = sb.ToString();
}
catch (Exception ex)
{
litDebug.Text = "Error while analyzing remote page: " + ex.ToString().Replace("\n", "<br />");
}
finally
{
((sender as WebBrowser).Tag as _Default).waiting = false;
}
}
}
}
Bumps along the way, if anyone is curious:
- Exception while creating the WebBrowser control.. thread was in wrong state. Fixed by moving the code to new thread explicitly setting the ApartmentState to STA.
- Document property of the WebBrowser was null. First step of the fix was using the DocumentCompleted event instead of tring to access the Document right after Navigating. Still no luck though, DocumentCompleted never occurred. To fix that I added the loop waiting until the ReadyState is complete. Done and working, but..
- All this done, changing the literal from within the new thread had no effect on the actual GUI.. had to wait in the main thread until everything was done.
Hope this will come handy someday for someone, if not for the OP here. :)
Best you can achieve is using WebRequest to read the raw response (HTML output) of the page and assuming it's valid XHTML throw it into XmlReader and you have kind of DOM at hand, at least the nodes.
I've previously used an HTML parsing library called SgmlReader, which worked well for getting HTML tag soup into a workable DOM. I would be surprised if it always produces a DOM identical to what a browser would produce though.
精彩评论