Grabbing content from a website in C#

2023-03-16 00:14 问答作者：

New to C# here, but I've used Java for years. I tried googling this and got a couple of answers that were not quite what I need. I'd开发者_运维百科 like to grab the (X)HTML from a website and then use DOM (actually, CSS selectors are preferable, but whatever works) to grab a particular element. How exactly is this done in C#?

To get the HTML you can use the WebClient object.

To parse the HTML you can use HTMLAgility librrary.

// prepare the web page we will be asking for
        HttpWebRequest  request  = (HttpWebRequest)
            WebRequest.Create("http://www.stackoverflow.com");

        // execute the request
        HttpWebResponse response = (HttpWebResponse)request.GetResponse();

        // we will read data via the response stream
        Stream resStream = response.GetResponseStream();

        string tempString = null;
        int    count      = 0;
        do
        {
            // fill the buffer with data
            count = resStream.Read(buf, 0, buf.Length);

            // make sure we read some data
                if (count != 0)
            {
            // translate from bytes to ASCII text
            tempString = Encoding.ASCII.GetString(buf, 0, count);

            // continue building the string
            sb.Append(tempString);
            }
        }
        while (count > 0); // any more data to read?

Then use Xquery expressions or Regex to grab the element you need

You could use System.Net.WebClient or System.Net.HttpWebrequest to fetch the page but parsing for the elements is not supported by the classes.

Use HtmlAgilityPack (http://html-agility-pack.net/)

HtmlWeb htmlWeb = new HtmlWeb();
htmlWeb.UseCookies = true;


HtmlDocument htmlDocument = htmlWeb.Load(url);


// after getting the document node
// you can do something like this
foreach (HtmlNode item in htmlDocument.DocumentNode.Descendants("input"))
{ 
    // item mathces your req
    // take the item.
}

I hear you want to use the HtmlAgilityPack for working with HTML files. This will give you Linq access, with is A Good Thing (tm). You can download the file with System.Net.WebClient.

You can use Html Agility Pack to load html and find the element you need.

To get you started, you can fairly easily use HttpWebRequest to get the contents of a URL. From there, you will have to do something to parse out the HTML. That is where it starts to get tricky. You can't use a normal XML parser, because many (most?) web site HTML pages aren't 100% valid XML. Web browsers have specially implemented parsers to work around the invalid portions. In Ruby, I would use something like Nokogiri to parse the HTML, so you might want to look for a .NET port of it, or another parser specificly designed to read HTML.

Edit:

Since the topic is likely to come up: WebClient vs. HttpWebRequest/HttpWebResponse

Also, thanks to the others that answered for noting HtmlAgility. I didn't know it existed.

Look into using the html agility pack, which is one of the more common libraries for parsing html.

http://htmlagilitypack.codeplex.com/

Grabbing content from a website in C#

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？