Reading only HTML Content from a Web site page
I'm using C#, and I'd like to scrape all the content on a site (but not the images, scripts, or files that may be attached to th开发者_如何学编程e page). How do I do that with C# and ASP.NET?
Hi you can use the following code snippet from HERE to do that:
StringBuilder sb = new StringBuilder();
byte[] buf = new byte[8192];
HttpWebRequest request = (HttpWebRequest)WebRequest.Create("http://www.your-url.com");
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
Stream resStream = response.GetResponseStream();
string tempString = null;
int count = 0;
do
{
count = resStream.Read(buf, 0, buf.Length);
if (count != 0)
{
tempString = Encoding.ASCII.GetString(buf, 0, count);
sb.Append(tempString);
}
}
while (count > 0);
Console.WriteLine(sb.ToString());
You can also get the HTML at Render
method of the Page
as following.
protected override void Render(System.Web.UI.HtmlTextWriter writer)
{
StringBuilder sb = new StringBuilder();
StringWriter sw = new StringWriter(sb);
HtmlTextWriter writer = new HtmlTextWriter(sw);
base.Render(writer);
string markupText = sb.ToString();
// markupText will contain the HTML of the Page
writer.Write(markupText);
}
精彩评论