Read Text From Web Page
Please note: I do not want to read the HTML content of a page, rather, I am looking to read the text from a web page. Imagine the following example, if you will -
A PHP script echos back "Hello User X" onto the current page, so that the user is now looki开发者_Python百科ng at a page (mainly blank) with the words "Hello User X" printed in the top left corner. From my C# Application, I would like to read the text onto a string.
String strPageData = functionToReadPageData("http://www.myURL.com/file.php");
Console.WriteLine(strPageData); // Outputs "Hello User X" to the Console.
In VB6 I was able to do this by using the following API:
- InternetOpen
- InternetOpenURL
- InternetReadFile
- InternetCloseHandle
I attempted to port my VB6 code to C# but I am having no luck - so I would very much appreciate a C# method for completing the above task.
I am not aware of any parts of the .NET framework that lets you automagically extract all the text from a HTML file. I very much doubt it exists.
You can try the HtmlAgilityPack (3rd party) for accessing text elements etc in a HTML document.
You will still need to write logic to find the correct HTML element though. A HTML page like this:
<html>
<body>Some text</body>
</html>
Then you would need to locate the body tag with an xpath and read its content.
HtmlNode body = doc.DocumentElement.SelectNodes("//body");
string bodyContent = body.InnerText;
Following that pattern you can read every element on the page. You might need to do some post processing to remove breaks, comments etc.
http://htmlagilitypack.codeplex.com/wikipage?title=Examples
I know this is an older post, but I'm surprised no one has mentioned using microsoft.mshtml
which works rather well for this sort of thing. You'll need to add a reference to microsoft.mshtml
[Right click on References
in your project in Solution Explorer
. Then click Add Reference...
. In Assemblies
type in search 'HTML' and you'll see Microsoft.mshtml
.]
then:
using System.Net;
using mshtml;
using (var client = new WebClient())
{
var s = client.DownloadString(@"https://stackoverflow.com/questions/7264659/read-text-from-web-page");
var htmldoc2 = (IHTMLDocument2)new HTMLDocument();
htmldoc2.write(s);
var plainText = htmldoc2.body.outerText;
Console.WriteLine(plainText);
}
Which will return the, "OuterText" of the webpage, which is basically the text displayed when you visit it with a web browser. Hope this helps.
You should use WebClient class to do this.
The below code may help you.
string result = "";
try
{
using (StreamReader sr = new StreamReader(IOParams.ConfigPath +"SUCCESSEMPTMP.HTML"))
{
result = sr.ReadToEnd();
result = result.Replace("<body/>", "<body>");
result = result.Replace("</body>", "<body>");
List<string> body = new List<string>(result.Split(new string[] { "<body>" }, StringSplitOptions.None));
if (body.Count > 2)
{
result = body[1];
}
}
}
catch (Exception e)
{
throw e;
}
return result;
精彩评论