Optimizing download of multiple web pages. C#
I am developing an app where I need to download a bunch of web pages, preferably as fast as possible. The way that I do that right now is tha开发者_运维知识库t I have multiple threads (100's) that have their own System.Net.HttpWebRequest
. This sort of works, but I am not getting the performance I would like. Currently I have a beefy 600+ Mb/s connection to work with, and this is only utilized at most 10% (at peaks). I guess my strategy is flawed, but I am unable to find any other good way of doing this.
Also: If the use of HttpWebRequest
is not a good way to download web pages, please say so :)
The code has been semi-auto-converted from java.
Thanks :)
Update:
public String getPage(String link){
myURL = new System.Uri(link);
myHttpConn = (System.Net.HttpWebRequest)System.Net.WebRequest.Create(myURL);
myStreamReader = new System.IO.StreamReader(new System.IO.StreamReader(myHttpConn.GetResponse().GetResponseStream(),
System.Text.Encoding.Default).BaseStream,
new System.IO.StreamReader(myHttpConn.GetResponse().GetResponseStream(),
System.Text.Encoding.Default).CurrentEncoding);
System.Text.StringBuilder buffer = new System.Text.StringBuilder();
//myLineBuff is a String
while ((myLineBuff = myStreamReader.ReadLine()) != null)
{
buffer.Append(myLineBuff);
}
return buffer.toString();
}
One problem is that it appears you're issuing each request twice:
myStreamReader = new System.IO.StreamReader(
new System.IO.StreamReader(
myHttpConn.GetResponse().GetResponseStream(),
System.Text.Encoding.Default).BaseStream,
new System.IO.StreamReader(myHttpConn.GetResponse().GetResponseStream(),
System.Text.Encoding.Default).CurrentEncoding);
It makes two calls to GetResponse
. For reasons I fail to understand, you're also creating two stream readers. You can split that up and simplify it, and also do a better job of error handling...
var response = (HttpWebResponse)myHttpCon.GetResponse();
myStreamReader = new StreamReader(response.GetResponseStream(), Encoding.Default)
That should double your effective throughput.
Also, you probably want to make sure to dispose of the objects you're using. When you're downloading a lot of pages, you can quickly run out of resources if you don't clean up after yourself. In this case, you should call response.Close()
. See http://msdn.microsoft.com/en-us/library/system.net.httpwebresponse.close.aspx
I am adding this answer as another possibility which people may encounter when
- downloading from multiple servers using multi-threaded apps
- using Windows XP or Vista as the operating system
The tcpip.sys
driver for these operating systems has a limit of 10 outbound connections per second. This is a rate limit, not a connection limit, so you can have hundreds of connections, but you cannot initiate more than 10/s. The limit was imposed by Microsoft to curtail the spread of certain types of virus/worm. Whether such methods are effective is outside the scope of this answer.
In a multi-threaded application that downloads from multitudes of servers, this limitation can manifest as a series of timeouts. Windows puts into a queue all of the "half-open" (newly open but not yet established) connections once the 10/s limit is reached. In my application, for example, I had 20 threads ready to process connections, but I found that sometimes I would get timeouts from servers I knew were operating and reachable.
To verify that this is happening, check the operating system's event log, under System. The error is:
EventID 4226: TCP/IP has reached the security limit imposed on the number of concurrent TCP connect attempts.
There are many references to this error and plenty of patches and fixes to apply to remove the limit. However because this problem is frequently encountered by P2P (Torrent) users, there's quite a prolific amount of malware disguised as this patch.
I have a requirement to collect data from over 1200 servers (that are actually data sensors) on 5-minute intervals. I initially developed the application (on WinXP) to reuse 20 threads repeatedly to crawl the list of servers and aggregate the data into a SQL database. Because the connections were initiated based on a timer tick event, this error happened often because at their invocation, none of the connections are established, thus 10 are immediately queued.
Note that this isn't a problem necessarily, because as connections are established, those queued are then processed. However if non-queued connections are slow to establish, that time can negatively impact the timeout limits of the queued connections (in my experience). The result, looking at my application log file, was that I would see a batch of connections that timed out, followed by a majority of connections that were successful. Opening a web browser to test "timed out" connections was confusing, because the servers were available and quick to respond.
I decided to try HEX editing the tcpip.sys file, which was suggested on a guide at speedguide.net. The checksum of my file differed from the guide (I had SP3 not SP2) and comments in the guide weren't necessarily helpful. However, I did find a patch that worked for SP3 and noticed an immediate difference after applying it.
From what I can find, Windows 7 does not have this limitation, and since moving the application to a Windows 7-based machine, the timeout problem has remained absent.
I do this very same thing, but with thousands of sensors that provide XML and Text content. Factors that will definitely affect performance are not limited to the speed and power of your bandwidth and computer, but the bandwidth and response time of each server you are contacting, the timeout delays, the size of each download, and the reliability of the remote internet connections.
As comments indicate, hundreds of threads is not necessarily a good idea. Currently I've found that running between 20 and 50 threads at a time seems optimal. In my technique, as each thread completes a download, it is given the next item from a queue.
I run a custom ThreaderEngine Class on a separate thread that is responsible for maintaining the queue of work items and assigning threads as needed. Essentially it is a while loop that iterates through an array of threads. As the threads finish, it grabs the next item from the queue and starts the thread again.
Each of my threads are actually downloading several separate items, but the method call is the same (.NET 4.0):
public static string FileDownload(string _ip, int _port, string _file, int Timeout, int ReadWriteTimeout, NetworkCredential _cred = null)
{
string uri = String.Format("http://{0}:{1}/{2}", _ip, _port, _file);
string Data = String.Empty;
try
{
HttpWebRequest Request = (HttpWebRequest)WebRequest.Create(uri);
if (_cred != null) Request.Credentials = _cred;
Request.Timeout = Timeout; // applies to .GetResponse()
Request.ReadWriteTimeout = ReadWriteTimeout; // applies to .GetResponseStream()
Request.Proxy = null;
Request.CachePolicy = new System.Net.Cache.RequestCachePolicy(System.Net.Cache.RequestCacheLevel.NoCacheNoStore);
using (HttpWebResponse Response = (HttpWebResponse)Request.GetResponse())
{
using (Stream dataStream = Response.GetResponseStream())
{
if (dataStream != null)
using (BufferedStream buffer = new BufferedStream(dataStream))
using (StreamReader reader = new StreamReader(buffer))
{
Data = reader.ReadToEnd();
}
}
return Data;
}
}
catch (AccessViolationException ave)
{
// ...
}
catch (Exception exc)
{
// ...
}
}
Using this I am able to download about 60KB each from 1200+ remote machines (72MB) in less than 5 minutes. The machine is a Core 2 Quad with 2GB RAM and utilizes four bonded T1 connections (~6Mbps).
精彩评论