开发者

Reading only HTML Content from a Web site page

I'm using C#, and I'd like to scrape all the content on a site (but not the images, scripts, or files that may be attached to th开发者_如何学编程e page). How do I do that with C# and ASP.NET?


Hi you can use the following code snippet from HERE to do that:

StringBuilder sb  = new StringBuilder();
byte[]        buf = new byte[8192];

HttpWebRequest  request  = (HttpWebRequest)WebRequest.Create("http://www.your-url.com");
HttpWebResponse response = (HttpWebResponse)request.GetResponse();

Stream resStream = response.GetResponseStream();

string tempString = null;
int    count      = 0;
do
{
    count = resStream.Read(buf, 0, buf.Length);

    if (count != 0)
    {
        tempString = Encoding.ASCII.GetString(buf, 0, count);
        sb.Append(tempString);
    }
}
while (count > 0);

Console.WriteLine(sb.ToString());


You can also get the HTML at Render method of the Page as following.

protected override void Render(System.Web.UI.HtmlTextWriter writer)
        {

            StringBuilder sb = new StringBuilder();
            StringWriter sw = new StringWriter(sb);

            HtmlTextWriter writer = new HtmlTextWriter(sw);
            base.Render(writer);
            string markupText = sb.ToString();
            // markupText will contain the HTML of the Page
            writer.Write(markupText);
        }
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜