开发者

Read Text From Web Page

Please note: I do not want to read the HTML content of a page, rather, I am looking to read the text from a web page. Imagine the following example, if you will -

A PHP script echos back "Hello User X" onto the current page, so that the user is now looki开发者_Python百科ng at a page (mainly blank) with the words "Hello User X" printed in the top left corner. From my C# Application, I would like to read the text onto a string.

String strPageData = functionToReadPageData("http://www.myURL.com/file.php");

Console.WriteLine(strPageData); // Outputs "Hello User X" to the Console.

In VB6 I was able to do this by using the following API:

  1. InternetOpen
  2. InternetOpenURL
  3. InternetReadFile
  4. InternetCloseHandle

I attempted to port my VB6 code to C# but I am having no luck - so I would very much appreciate a C# method for completing the above task.


I am not aware of any parts of the .NET framework that lets you automagically extract all the text from a HTML file. I very much doubt it exists.

You can try the HtmlAgilityPack (3rd party) for accessing text elements etc in a HTML document.

You will still need to write logic to find the correct HTML element though. A HTML page like this:

<html>
     <body>Some text</body>
</html>

Then you would need to locate the body tag with an xpath and read its content.

HtmlNode body = doc.DocumentElement.SelectNodes("//body");
string bodyContent = body.InnerText;

Following that pattern you can read every element on the page. You might need to do some post processing to remove breaks, comments etc.

http://htmlagilitypack.codeplex.com/wikipage?title=Examples


I know this is an older post, but I'm surprised no one has mentioned using microsoft.mshtml which works rather well for this sort of thing. You'll need to add a reference to microsoft.mshtml

[Right click on References in your project in Solution Explorer. Then click Add Reference.... In Assemblies type in search 'HTML' and you'll see Microsoft.mshtml.]

then:

using System.Net;
using mshtml;

using (var client = new WebClient())
{
    var s = client.DownloadString(@"https://stackoverflow.com/questions/7264659/read-text-from-web-page");
    var htmldoc2 = (IHTMLDocument2)new HTMLDocument();
    htmldoc2.write(s);
    var plainText = htmldoc2.body.outerText;
    Console.WriteLine(plainText);
}

Which will return the, "OuterText" of the webpage, which is basically the text displayed when you visit it with a web browser. Hope this helps.


You should use WebClient class to do this.


The below code may help you.

string result = "";
try
{
     using (StreamReader sr = new StreamReader(IOParams.ConfigPath +"SUCCESSEMPTMP.HTML"))
     {
           result = sr.ReadToEnd();
           result = result.Replace("<body/>", "<body>");
           result = result.Replace("</body>", "<body>");
           List<string> body = new List<string>(result.Split(new string[] { "<body>" }, StringSplitOptions.None));
           if (body.Count > 2)
           {
                result = body[1];
           }
      }
}
catch (Exception e)
{
    throw e;
}

return result;
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜