开发者

Grabbing content from a website in C#

New to C# here, but I've used Java for years. I tried googling this and got a couple of answers that were not quite what I need. I'd开发者_运维百科 like to grab the (X)HTML from a website and then use DOM (actually, CSS selectors are preferable, but whatever works) to grab a particular element. How exactly is this done in C#?


To get the HTML you can use the WebClient object.

To parse the HTML you can use HTMLAgility librrary.


// prepare the web page we will be asking for
        HttpWebRequest  request  = (HttpWebRequest)
            WebRequest.Create("http://www.stackoverflow.com");

        // execute the request
        HttpWebResponse response = (HttpWebResponse)request.GetResponse();

        // we will read data via the response stream
        Stream resStream = response.GetResponseStream();

        string tempString = null;
        int    count      = 0;
        do
        {
            // fill the buffer with data
            count = resStream.Read(buf, 0, buf.Length);

            // make sure we read some data
                if (count != 0)
            {
            // translate from bytes to ASCII text
            tempString = Encoding.ASCII.GetString(buf, 0, count);

            // continue building the string
            sb.Append(tempString);
            }
        }
        while (count > 0); // any more data to read?

Then use Xquery expressions or Regex to grab the element you need


You could use System.Net.WebClient or System.Net.HttpWebrequest to fetch the page but parsing for the elements is not supported by the classes.

Use HtmlAgilityPack (http://html-agility-pack.net/)

HtmlWeb htmlWeb = new HtmlWeb();
htmlWeb.UseCookies = true;


HtmlDocument htmlDocument = htmlWeb.Load(url);


// after getting the document node
// you can do something like this
foreach (HtmlNode item in htmlDocument.DocumentNode.Descendants("input"))
{ 
    // item mathces your req
    // take the item.
}


I hear you want to use the HtmlAgilityPack for working with HTML files. This will give you Linq access, with is A Good Thing (tm). You can download the file with System.Net.WebClient.


You can use Html Agility Pack to load html and find the element you need.


To get you started, you can fairly easily use HttpWebRequest to get the contents of a URL. From there, you will have to do something to parse out the HTML. That is where it starts to get tricky. You can't use a normal XML parser, because many (most?) web site HTML pages aren't 100% valid XML. Web browsers have specially implemented parsers to work around the invalid portions. In Ruby, I would use something like Nokogiri to parse the HTML, so you might want to look for a .NET port of it, or another parser specificly designed to read HTML.


Edit:

Since the topic is likely to come up: WebClient vs. HttpWebRequest/HttpWebResponse

Also, thanks to the others that answered for noting HtmlAgility. I didn't know it existed.


Look into using the html agility pack, which is one of the more common libraries for parsing html.

http://htmlagilitypack.codeplex.com/

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜