Fetching an external page and parsing meta-tags without Regex in C#?
Consider the following code:
public ActionResult Index(String URLQuery = "http://www.google.com")
{
HttpWebRequest webRequest;
HttpWebResponse webResponse;
int bufCount = 0;
byte[] byteBuf = new byte[1024];
String queryContent = "";
webRequest = (HttpWebRequest) WebRequest.Create(URLQuery);
webRequest.Timeout = 10*1000;
webRequest.KeepAlive = false;
webRequest.ContentType = "text/html";
webResponse = (HttpWebResponse) webRequest.GetResponse();
StreamReader responseStream = new StreamReader(webResponse.GetResponseStream(), System.Text.Encoding.UTF8);
queryContent = responseStream.ReadToEnd();
ViewData["StreamResult"] = queryContent;
return View();
}
Essentially, this simply grabs a web page and spits it out as-is. What I'd like to do is take the resulting fetched data from the screen, and parse开发者_高级运维 it much like PHP allows you to do using some sort of built-in DOM object/framework. I have seen many examples of Regex to accomplish this task but I feel like that is inefficient and results in too many weird edge-cases that might result in corrupt data on my end.
Is this even possible? Am I doomed to use Regex for this?
You should use a parser for this - it looks like HTML agility pack will do what you want.
Using HtmlAgility Pack you can do this very easily. Below a sample using XPath, the newer version does support Linq syntax as well, but I haven't tried that yet personally.
StreamReader responseStream = new StreamReader(webResponse.GetResponseStream(),
System.Text.Encoding.UTF8);
queryContent = responseStream.ReadToEnd();
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(queryContent);
HtmlNode bodyNode = doc.DocumentNode.SelectSingleNode("//body | //BODY");
/* do processing here */
精彩评论