开发者

HTTP Protocol violation when downloading webpage using HtmlAgilityPack

I'm trying to parse download pages from www.mediafire.com, but i really often get a System.Net.WebException with the following message, when i try to load a page to a 开发者_JS百科HtmlDocument:

The server committed a protocol violation. Section=ResponseStatusLine

This is my code:

HtmlAgilityPack.HtmlWeb web = new HtmlAgilityPack.HtmlWeb();

HtmlAgilityPack.HtmlDocument doc = null;

string url = www.mediafire.com/?abcdefghijkl //There are many different links

try
{
    doc = web.Load(url); //From 30 links, usually only 10 load properly
}

catch (WebException)
{

}

Any ideas why only 10 of 30 links work (the links change everytime, because my program is a "search engine") and how i can resolve the problem?

When i load those sites in my browser, everything works fine.


I've tried to add the following lines to my app.config, but that doesn't help either

<system.net>
    <settings>
        <httpWebRequest useUnsafeHeaderParsing="true" />
    </settings>
</system.net>


This is not related to the Html Agility Pack directly, but rather to the underlying HTTP/socket layer. This error means the server is not sending back a correct HTTP status line.

The status line is defined in HTTP RFC available here: http://www.w3.org/Protocols/rfc2616/rfc2616-sec6.html

I quote:

The first line of a Response message is the Status-Line, consisting of the protocol version followed by a numeric status code and its associated textual phrase, with each element separated by SP characters. No CR or LF is allowed except in the final CRLF sequence.

   Status-Line = HTTP-Version SP Status-Code SP Reason-Phrase CRLF

You can add socket traces with full hex report to check this:

<configuration>
    <system.diagnostics>
        <sources>
            <source name="System.Net.Sockets" tracemode="includehex">
                <listeners>
                    <add name="System.Net.Sockets" type="System.Diagnostics.TextWriterTraceListener" initializeData="SocketTrace.log" />
                </listeners>
            </source>
        </sources>
        <switches>
            <add name="System.Net.Sockets" value="Verbose"/>
        </switches>
        <trace autoflush="true" />
    </system.diagnostics>
</configuration>

This will create a SocketTrace.log file in the current executing directory. Have a look in there, the protocol violation should be visible. You can post it here if it's not too big :-)

Unfortunately, if you don't own the server, there is not much you can do (if you already added the useUnsafeHeaderParsing setting, which is good) but fail gracefully in these cases.


Setting keep alive property to false will fix this issue. But I am not sure if htmlagilitypack has this property. So using WebClient would be a better alternative.

This worked for me. Instead of directly loading the url with web.Load, download the html of desired url using your custom WebClient. In your custom WebClient override GetWebRequest method to make HttpWebRequest.KeepAlive = false. Now load the downloaded file in web.Load().

MyWebClient client = new MyWebClient();
client.DownloadFile(searchURL, @"C:\\index.html");
var doc = web.Load("C:\\index.html");

Overriding GetWebRequest

using System;
using System.Net;

namespace MyProject
{
    internal class CustomWebClient : WebClient
    {
        protected override WebRequest GetWebRequest(Uri address)
        {
            WebRequest request = base.GetWebRequest(address);
            if (request is HttpWebRequest)
            {
                (request as HttpWebRequest).KeepAlive = false;
            }
            return request;
        }
    }
}
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜