Getting the HTML source from a WPF-WebBrowser-Control using IPersistStreamInit
I am trying to get the HTML source of a webpage that has been loaded into a WPF WebBrowser control. The only way to do this seems to be casting the instance of WebBrowser.Document to IPersistStreamInit (which I will have to define myself, as it is a COM interface) and call the IPersistStreamInit.Save method, passing an implementation of an IStream (again, a COM interface), which will persist the document to the stream. Well, sort of: I am always getting the first 4 kilobytes of the stream, not the entire document and I don't know why.
Here's the code of IPersistStreamInit:
using System;
using System.Runtime.InteropServices;
using System.Runtime.InteropServices.ComTypes;
using System.Security;
namespace PayPal.SkyNet.BpiTool.Interop
{
[ComImport, InterfaceType(ComInterfaceType.InterfaceIsIUnknown),
SuppressUnmanagedCodeSecurity,
Guid("7FD52380-4E07-101B-AE2D-08002B2EC713")]
public interface IPersistStreamInit
{
void GetClassID(out Guid pClassID);
[PreserveSig]
int IsDirty();
void Load([In, MarshalAs(UnmanagedType.Interface)] IStream pstm);
void Save([In, MarshalAs(UnmanagedType.Interface)] IStream pstm, [In, MarshalAs(UnmanagedType.Bool)] bool fClearDirty);
void GetSizeMax([Out, MarshalAs(UnmanagedType.LPArray)] long pcbSize);
void InitNew();
}
}
Here's the code of the IStream-Implementation:
using System;
using System.IO;
using System.Runtime.InteropServices.ComTypes;
namespace PayPal.SkyNet.BpiTool.Interop
{
public class ComStream : IStream
{
private Stream _stream;
public ComStream(Stream stream)
{
this._stream = stream;
}
public void Commit(int grfCommitFlags)
{
}
public void CopyTo(IStream pstm, long cb, IntPtr pcbRead, IntPtr pcbWritten)
{
}
public void LockRegion(long libOffset, long cb, int dwLockType)
{
}
public void Read(byte[] pv, int cb, IntPtr pcbRead)
{
this._stream.Read(pv, (int)this._stream.Position, cb);
}
public void Revert()
{
}
public void SetSize(long libNewSize)
{
this._stream.SetLength(libNewSize);
}
public void Stat(out System.Runtime.InteropServices.ComTypes.STATSTG pstatstg, int grfStatFlag)
{
pstatstg = new System.Runtime.InteropServices.ComTypes.STATSTG();
}
public void UnlockRegion(long libOffset, long cb, int dwLockType)
{
}
public void Write(byte[] pv, int cb, IntPtr pcbWritten)
{
this._stream.Write(pv, 0, cb);
}
public void Clone(out IStream outputStream)
{
outputStream = null;
}
public void Seek(long dlibMove, int dwOrigin, IntPtr plibNewPosition)
{
this._stream.Seek(dlibMove, (SeekOrigin)dwOrigin);
}
}
}
Now I have a class to wrap it all up. As I don't want to redistribute the mshtml-interop-assembly I chose late-binding - and as late binding is easier in VB I did it in VB. Here's the code:
Option Strict Off
Option Explicit Off
Imports System.IO
Public Class HtmlDocumentWrapper : Implements IDisposable
Private htmlDoc As Object
Public Sub New(ByVal htmlDoc As Object)
Me.htmlDoc = htmlDoc
End Sub
Public Property Document As Object
Get
Return Me.htmlDoc
End Get
Set(value As Object)
Me.htmlDoc = Nothing
Me.htmlDoc = value
End Set
End Property
Public ReadOnly Property DocumentStream As Stream
Get
Dim str As Stream = Nothing
Dim psi As IPersistStreamInit = CType(Me.htmlDoc, IPersistStreamInit)
If psi IsNot Nothing Then
str = New MemoryStream
Dim cStream As New ComStream(str)
psi.Save(cStream, False)
str.Position = 0
End If
Return str
End Get
End Property
End Class
Now I should be able to use all this:
private void Browser_Navigated(object sender, NavigationEventArgs e)
{
HtmlDocumentWrapper doc = new HtmlDocumentWrapper();
doc.Document = Browser.Document;
using (StreamReader sr = new StreamReader(doc.DocumentStream))
{
using (StreamWriter sw = new StreamWriter("test.txt"))
{
//BOOM! Only 4kb of HTML source
sw.WriteLine(sr.ReadToEnd());
sw.Flush();
}
}
}
Anybody knows, why I don't 开发者_开发百科get the entire HTML souce? Any help is greatly appreciated.
Regards
Arne
Move your code from Browser.Navigated to Browser.LoadCompleted as Sheng Jiang correctly notes above and it works
This is just a guess:
The stream does not have a known length, since it may still be downloading. You'll need to read it until it says EOF.
精彩评论