How to scrape text from an html page using C#?
I have a web page that when navigated to only returns a simple text value, like the number 100. I need to grab that value from the page, so I can use it in my application. The application is a simple Windows Forms app, with a web browser control on it.
I have tried numerous things, but it's not grabbing the text, as if it doesn't exist. Yet if I right click and view source, it's there.
This can't be that difficult...It's just some text.
Just to clarify the document contains NO html, only a number. When using WebClient or WebRequest, it doesn't return the value.
private void RegisterWindow_Load(object sender, EventArgs e)
{
webBrowser1.Navigate("MYURL");
webBrowser1.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(webBrowser1_DocumentCompleted);
}
void webBrowser1_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
// Check a开发者_如何转开发nd see if we have navigated to the final page.
string registeredUrl = "MYURL";
string currentPage = webBrowser1.Url.ToString();
string response = string.Empty;
if (webBrowser1.Url.ToString() == registeredUrl)
{
// Now parse the authkey from the url
response = GetWebRequest(currentPage);
MessageBox.Show(response);
}
}
/// <summary>
/// Send a Web Request and get a Web Response back.
/// This respons can be a valid URL, simple text response, or
/// HTML response.
/// </summary>
/// <param name="url"></param>
/// <returns></returns>
public string GetWebRequest(string url)
{
var client = new WebClient();
var content = client.DownloadString(url);
return content;
}
If the document contains only number without any HTML, this should work:
public string GetWebRequest()
{
return webBrowser1.Document.Body.InnerText;
}
You should be able to do something as straightforward as:
var client = new WebClient();
var content = client.DownloadString("<YOUR URL>");
var number = Int32.Parse(content);
MSDN documentation for DownloadString(string).
I wrote a blog post on Web scraping in .NET several years ago. You could try the techniques there. Hopefully they're not obsolete.
For example:
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
request.UserAgent = _UserAgent;
request.CookieContainer = cookies; // optional
using (WebResponse response = request.GetResponse())
{
using (Stream responseStream = response.GetResponseStream())
{
using (StreamReader reader = new StreamReader(responseStream))
{
html = reader.ReadToEnd();
}
}
}
Remember that your browser is sending a User-Agent header, may be sending cookies, may be going through a configured proxy server, etc. Particularly for secured or intranet sites, a simple WebClient call may be insufficient. You may need to do some checking with Fiddler as @SLaks suggested.
You can load the page HTML/TXT content to a string then use a string function to extract the number.
精彩评论