开发者

Getting displayed text only from HTML

Is there a simple way, using C#, to open an arbitrary URL, read in the text, and reduce it down to that which would be displayed in a web page? I suppose I could get the < body > content, and iterate char by char over that content, ripping out anything that is in betwee < and >(inclusive). I looked briefly at HTML Agiligy Pack, and that may be a solution, but it seemed very开发者_开发问答 heavy for what I am trying to do.

Again, all I want is a string of text that represents the text that would be displayed on screen for an arbitrary URL.


I would still opt for the HTML Agility pack - it is a bit more work at the beginning, but it is more flexible and a better design at the end, as it will offer a lot more - e.g. XPath style queries.


If you just need the text representation of the HTML, this should do the work:

using System.Net;
...

public string GetSiteStringContents(string url)
{
    StringBuilder sb  = new StringBuilder();
    byte[] buf = new byte[8192];
    HttpWebRequest  request  = (HttpWebRequest) WebRequest.Create(url);
    HttpWebResponse response = (HttpWebResponse) request.GetResponse();

    Stream resStream = response.GetResponseStream();
    string tempString = null;
    int count = 0;
    do
    {
        count = resStream.Read(buf, 0, buf.Length);
        if (count != 0)
        {
            tempString = Encoding.ASCII.GetString(buf, 0, count);
            sb.Append(tempString);
        }
    }
    while (count > 0);

    return sb.ToString();
}
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜