Recursive HTTP calls exhibit differing behavior in IDE versus deployed executable
The code is making HTTP calls to an exposed representation of an SVN tree. It is then parsing the HTML and adding files for reference later to pull down and push to the user. This is being done within a WPF application. Below is the code along with an image showing the directory structure.
private readonly String _baseScriptURL = @"https://xxxxxxxxxx/svn/repos/xxxxxxxxxx/trunk/scripts/vbs/web/";
private void FindScripts(String url, ref ICollection<String> files)
{
//MyFauxMethod();
StringBuilder output = new StringBuilder();
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
request.Credentials = new Credentials().GetCredentialCache(url);
_logger.Log("Initiating request [" + url + "]", EventType.Debug);
try
{
using (HttpWebResponse response = (HttpWebResponse)request.GetResponse())
using (Stream stream = response.GetResponseStream())
{
_logger.Log("Response received for request [" + url + "]", EventType.Debug);
int count = 0;
byte[] buffer = new byte[256];
while ((count = stream.Read(buffer, 0, buffer.Length)) > 0)
{
if (count < 256)
{
List<byte> trimmedBuffer = buffer.ToList();
trimmedBuffer.RemoveRange(count, 256 - count);
String data = Encoding.ASCII.GetString(trimmedBuffer.ToArray());
output.Append(data);
}
else
{
String data = Encoding.ASCII.GetString(buffer);
output.Append(data);
}
}
}
String html = output.ToString();
HTMLDocument doc = new HTMLDocumentClass();
IHTMLDocument2 doc2 = (IHTMLDocument2)doc;
doc2.write(new object[] { html });
IHTMLElementCollection ul = doc.getElementsByTagName("li");
doc2.close();
doc.close();
foreach (IHTMLElement item in ul)
{
if (item != null &&
item.innerText != null)
{
String element = item.innerText.Trim().Replace(" ", "%20");
//nothing to do with going up a dir
if (element == "..")
continue;
_logger.Log("Interrogating [" + element + "]", EventType.Debug);
String filename = System.IO.Path.GetFileName(element);
if (String.IsNullOrEmpty(filename))
开发者_开发知识库 {
//must be a directory; recursively search if honored dir
if (!_ignoredDirectories.Contains(element))
{
_logger.Log("Searching directory [" + element + "]", EventType.Debug);
FindScripts(url + System.IO.Path.GetDirectoryName(element) + "/", ref files);
}
else
_logger.Log("Ignoring directory [" + element + "]", EventType.Debug);
}
else
{
//add honored files to list for parsing meta data later
if (_honoredExtensions.Contains(System.IO.Path.GetExtension(filename)))
{
files.Add(url + filename);
_logger.Log("Added file [" + (url + filename) + "]", EventType.Debug);
}
}
//MyFauxMethod();
}
//MyFauxMethod();
}
}
catch (Exception e)
{
_logger.Log(e);
}
//MyFauxMethod();
}
private void MyFauxMethod()
{
int one = 1;
int two = 2;
int three = one + two;
}
First off apologies for the lengthy code block; however I wanted to make certain the full method was understood. The problem that exists is only applicable when using the generated Release executable outside of the IDE. If the Release build is ran within the IDE, it functions without any problems.
In addition the problem does not exist when executing the generated Debug build outside of the IDE or within the IDE; it functions appropriately in both scenarios.
The problem is that the recursive calls stop the code continues on past the recursion method. No exception is thrown within the thread; everything simply stops before moving into each directory as it does in the other builds.
The log lines of the Release build look like this...
Initiating request [https://xxxxxxxxx/svn/repos/xxxxxxxxx/trunk/scripts/vbs/web/]
Response received for request [https://xxxxxxxxx/svn/repos/xxxxxxxxx/trunk/scripts/vbs/web/] Interrogating [beq/] Searching directory [beq/] Initiating request [https://xxxxxxxxx/svn/repos/xxxxxxxxx/trunk/scripts/vbs/web/beq/] Response received for request [https://xxxxxxxxx/svn/repos/xxxxxxxxx/trunk/scripts/vbs/web/beq/] Interrogating [core/] Searching directory [core/] Initiating request [https://xxxxxxxxx/svn/repos/xxxxxxxxx/trunk/scripts/vbs/web/beq/core/] Response received for request [https://xxxxxxxxx/svn/repos/xxxxxxxxx/trunk/scripts/vbs/web/beq/core/] Interrogating [BEQ-Core%20Library.vbs] Added file [https://xxxxxxxxx/svn/repos/xxxxxxxxx/trunk/scripts/vbs/web/beq/core/BEQ-Core%20Library.vbs] Interrogating [one-offs/] Searching directory [one-offs/] Initiating request [https://xxxxxxxxx/svn/repos/xxxxxxxxx/trunk/scripts/vbs/web/beq/one-offs/] Response received for request [https://xxxxxxxxx/svn/repos/xxxxxxxxx/trunk/scripts/vbs/web/beq/one-offs/] Recursively finding scripts took [6]s [140]ms for [1 ] Parsing metadata took [0]m [0]s [906]ms for [1 ] Total time took [0]m [7]s [46]ms
UPDATE:
After adding in approximately 3 additional log lines during debugging, it is now functioning as it should. The outstanding question is why? Attempting to isolate the problem code in a separate application produces no negative results.
Any ideas on why this would be happening?
UPDATE:
Changing the log lines to call a faux method produced the same results. I have added the calls to the faux method and the faux method in the above source, 1 at the entry of the method and 3 near the bottom. The calls themselves are commented to make it easier to locate; they are NOT commented in the actual code.
If I comment out any one of the 4 added faux method calls, it will revert to not functioning. Again this is only in Release via CTRL+F5 or outside of the IDE in its entirety.
UPDATE:
Added .close() on the HtmlDocument
instances per fubaar; same behavior is still being exhibited.
UPDATE:
Added explicit calls to the GC per fubaar; same behavior is still being exhibited.
I noticed that you aren't calling .close() on the HtmlDocument instances you are creating - that would be the first thing I would try, to ensure that mshtml is cleaning up correctly after the write().
You could also try building your list of filenames from the Html and releasing the HtmlDocument before you recurse into the filename list - so that you aren't creating an ever growing number of HtmlDocument instances - in pure .NET world that would just be a memory issue, but when you involve mshtml and therefore COM interop it's our experience that it's often worth treading much more carefully.
Added as an edit as it was too long for a comment:
In our Wpf app we hit an issue with processing large numbers of HtmlDocument instances where Wpf doesn't routinely pump the initialization / termination messages from COM. In our implementation that resulted in a memory leak and eventual COM errors. That's clearly not the behavior you are seeing but I wonder if there could be an issue with the COM interop not being able to clean up correctly. What might be worth trying is adding these lines when you are done with (and fully released) your COM objects:
GC.Collect()
GC.WaitForPendingFinalizers()
That will add any COM interop objects to the finalizer queue (the .Collect call) and will cause .NET to pump any COM messages (a side effect of the WaitForPendingFinalizers call).
It is a bit of a stab in the dark I know, but this stuff is essentially COM interop (even though it's covered with .NET objects) and this might well be the issue.
Not sure if this is right as it is purely speculation. The COM API in mshtml.dll doesn't appear to be documented anywhere and the .Net wrappers are vague regarding allowed behaviors.
The .Net documentation assumes you are getting an instance of HtmlDocument from a hosted WebBrowser control. There's nothing in the documentation that suggests you should be creating your own instances. Always be leery around adapter patterns as their underlying data management may be compromised when being used in an unexpected manner. You would hope that you would be prevented from doing this, but it may be an omission or an artifact of how .Net wraps COM.
The IHTMLElementCollection points directly back into the HtmlDocument. If that class is using a shared instance of a WebControl (you never created your own web control, so implementation is up to the framework), the collection would become invalidated when the document changes.
Why different behavior between debug and release along with log line changes. No idea. Could be all sorts of interesting memory/cache management in the Internet Explorer code. Your GC calls only affect the wrapper, not the native IE code.
As a test, make a copy of the relevant list data, before walking it, allowing you to be free from any reuse under HtmlDocument.
I still think this usage of the API is potentially wrong and that working from the WebControl down may be the only correct usage. However, I can't find any documentation that describes how the COM interfaces in mshtml should behave (I found several web pages indicating it was an undocumented SDK).
Just a thought, but could your code be negatively affected by a compiler optimization? Do you still get the error if you turn optimizations off?
精彩评论