开发者

Access IE Dom out of process in C#

Is there a way to access the IE DOM out of process, example is a webpage scraper that loads the currently displayed page and grabs data. I have seen a few ways of downloading the page and processing it, but this will not work when websites are giving back dynamic results a开发者_开发技巧nd require a login.

I am hoping not to have to write a bho to access the data and share it via wcf. I have seen some examples of grabbing the data using c++ and msaa server but that does not really help me in getting it as I would prefer not to use a C++ helper as I have not used c++ in years.

TIA.


Depending on how much stuff you need to do, you might want to consider using something simple like WatiN. It's a great tool for instantiating a browser instance and walking the tree. The DOM manipulation is quite easy and is well documented (with lots of examples on the web).


If you are only doing scraping and requests, you would probably be best off using the WebRequest object that ships with .NET to do your work.

WebRequest Class @ MSDN

However, if you must have exact access to what is represented in the IE DOM, you should use Microsoft Active Accessibility to gain access. Provided you can identify the window handle or reliable location for the target IE window, and it is visible in a user session, Active Accessibility is the best way to access the target IE window and dig into the DOM. It isn't absolutely necessary to use C++, but it will probably be easier to do most of this in C++.

Active Accessibility User Interface Services @ MSDN

You'll want to use EnumChildWindows to locate (or brute force query) the DOM window either from the desktop or a frame window's handle retrieved from enumerating processes. In .NET, enumeration of processes is available from the System.Process class.

EnumChildWindows @ MSDN

EnumWindows signature @ pinvoke.net
EnumChildWindows signature @ pinvoke.net

Process.GetProcesses() @ MSDN
Process.MainWindowHandle @ MSDN

To add the type declarations you need to be able to walk the DOM in C# and to talk to MSAA, add a COM reference to 'Microsoft HTML Object Library' to your project, and add P/Invoke signatures for MSAA.

AccessibleObjectFromWindow Signature @ pinvoke.net

Once you can call MSAA, retrieve an IDispatch through Active Accessibility from the window handle. You will want to send in OBJID_NATIVEOM, which will get you an IDispatch you can interrogate.

Retrieving an IAccessible Object @ MSDN
AccessibleObjectFromWindow() @ MSDN

From here, IDispatch may be cast to IHTMLWindow2 or IHTMLDocument2 (and derivatives), which has all of the DOM script model methods and more. Unfortunately I can't remember which one is returned via this method, but in any case, IHTMLWindow2 has the document property (same as window.document in script). Either can be resolved to provide access to the DOM, which is represented by IHTMLDocument2 and all derived interfaces.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜