Grabbing content from a website in C#
New to C# here, but I've used Java for years. I tried googling this and got a couple of answers that were not quite what I need. I'd开发者_运维百科 like to grab the (X)HTML from a website and then use DOM (actually, CSS selectors are preferable, but whatever works) to grab a particular element. How exactly is this done in C#?
To get the HTML you can use the WebClient object.
To parse the HTML you can use HTMLAgility librrary.
// prepare the web page we will be asking for
HttpWebRequest request = (HttpWebRequest)
WebRequest.Create("http://www.stackoverflow.com");
// execute the request
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
// we will read data via the response stream
Stream resStream = response.GetResponseStream();
string tempString = null;
int count = 0;
do
{
// fill the buffer with data
count = resStream.Read(buf, 0, buf.Length);
// make sure we read some data
if (count != 0)
{
// translate from bytes to ASCII text
tempString = Encoding.ASCII.GetString(buf, 0, count);
// continue building the string
sb.Append(tempString);
}
}
while (count > 0); // any more data to read?
Then use Xquery expressions or Regex to grab the element you need
You could use System.Net.WebClient
or System.Net.HttpWebrequest
to fetch the page but parsing for the elements is not supported by the classes.
Use HtmlAgilityPack (http://html-agility-pack.net/)
HtmlWeb htmlWeb = new HtmlWeb();
htmlWeb.UseCookies = true;
HtmlDocument htmlDocument = htmlWeb.Load(url);
// after getting the document node
// you can do something like this
foreach (HtmlNode item in htmlDocument.DocumentNode.Descendants("input"))
{
// item mathces your req
// take the item.
}
I hear you want to use the HtmlAgilityPack
for working with HTML files. This will give you Linq access, with is A Good Thing (tm). You can download the file with System.Net.WebClient
.
You can use Html Agility Pack to load html and find the element you need.
To get you started, you can fairly easily use HttpWebRequest to get the contents of a URL. From there, you will have to do something to parse out the HTML. That is where it starts to get tricky. You can't use a normal XML parser, because many (most?) web site HTML pages aren't 100% valid XML. Web browsers have specially implemented parsers to work around the invalid portions. In Ruby, I would use something like Nokogiri to parse the HTML, so you might want to look for a .NET port of it, or another parser specificly designed to read HTML.
Edit:
Since the topic is likely to come up: WebClient vs. HttpWebRequest/HttpWebResponse
Also, thanks to the others that answered for noting HtmlAgility. I didn't know it existed.
Look into using the html agility pack, which is one of the more common libraries for parsing html.
http://htmlagilitypack.codeplex.com/
精彩评论