In a .NET application, is it possible to get a representation of the DOM as a web browser would see it?

2023-01-22 23:41 问答作者：

I need for a server side process to be able to produce the same view of the HTML dom for a web page as a web browser (I am aware that the dom representation is browser specific, so don't mind a non cross browser solution).

I need to be able to work my way back to a user selection on a web page at a later date. Since there is no firm relationship between the raw HTML for a page, and the Dom that a browser constructs, this is proving very difficult to say the least!

My thinking is now that if I can produce the same view of the doc开发者_StackOverflowument in a server side process, then I may be able to achieve this.

Does anyone have experience of this?

Thanks

OK, different angle. What about using the WebBrowser Control?

As far as I know, there's nothing preventing web application from adding reference to System.Windows assembly and using it.

Bit of a long shot, but IMO worth trying!

OK... for what it's worth, I was able to successfully use WebBrowser control (yeah, from System.Windows.Forms) to load remote page and iterate its DOM freely.

The bricks in the wall I faced and destroyed are below.

Full code, which for the sake of example show all images in the remote page:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Web;
using System.Web.UI;
using System.Web.UI.WebControls;
using System.Threading;
using System.Reflection;
using System.Windows.Forms;
using System.Text;

namespace TestZone
{
    public partial class _Default : System.Web.UI.Page
    {
        private bool waiting = false;
        private WebBrowser browser = null;
        protected void Page_Load(object sender, EventArgs e)
        {
            Thread thread = new Thread(new ParameterizedThreadStart(LoadRemotePage));
            thread.SetApartmentState(ApartmentState.STA);

            waiting = true;
            thread.Start(this);

            while (waiting)
            {
                Thread.Sleep(10);
            }
        }

        private void LoadRemotePage(object sender)
        {
            try
            {
                browser = new WebBrowser();
                browser.Tag = sender;
                browser.Navigate("http://stackoverflow.com/questions/4082249/in-a-net-application-is-it-possible-to-get-a-representation-of-the-dom-as-a-web/4085520");
                browser.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(browser_DocumentCompleted);
                while (browser.ReadyState != WebBrowserReadyState.Complete)
                    System.Windows.Forms.Application.DoEvents();
                browser.Dispose();
            }
            catch (Exception ex)
            {
                litDebug.Text = "Error while initializing browser control: " + ex.ToString().Replace("\n", "<br />");
                (sender as _Default).waiting = false;
            }
            finally
            {

            }
            //hgcDebug.GetType().InvokeMember("InnerHtml", BindingFlags.SetProperty, null, hgcDebug, new object[] { "done" });
        }

        void browser_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
        {
            try
            {
                HtmlElementCollection collection = browser.Document.GetElementsByTagName("img");
                StringBuilder sb = new StringBuilder();
                sb.AppendFormat("Total of {0} images:<br />", collection.Count);
                for (int i = 0; i < collection.Count; i++)
                    sb.AppendFormat("name: {0}, src: {1}<br />", collection[i].GetAttribute("name"), collection[i].GetAttribute("src"));
                litDebug.Text = sb.ToString();
            }
            catch (Exception ex)
            {
                litDebug.Text = "Error while analyzing remote page: " + ex.ToString().Replace("\n", "<br />");
            }
            finally
            {
                ((sender as WebBrowser).Tag as _Default).waiting = false;
            }
        }
    }
}

Bumps along the way, if anyone is curious:

Exception while creating the WebBrowser control.. thread was in wrong state. Fixed by moving the code to new thread explicitly setting the ApartmentState to STA.
Document property of the WebBrowser was null. First step of the fix was using the DocumentCompleted event instead of tring to access the Document right after Navigating. Still no luck though, DocumentCompleted never occurred. To fix that I added the loop waiting until the ReadyState is complete. Done and working, but..
All this done, changing the literal from within the new thread had no effect on the actual GUI.. had to wait in the main thread until everything was done.

Hope this will come handy someday for someone, if not for the OP here. :)

Best you can achieve is using WebRequest to read the raw response (HTML output) of the page and assuming it's valid XHTML throw it into XmlReader and you have kind of DOM at hand, at least the nodes.

I've previously used an HTML parsing library called SgmlReader, which worked well for getting HTML tag soup into a workable DOM. I would be surprised if it always produces a DOM identical to what a browser would produce though.

继续阅读：asp.net-mvc javascript

In a .NET application, is it possible to get a representation of the DOM as a web browser would see it?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？