开发者

Any suggestions in a way to parse headers and links from blog pages using C#?

I'm currently self-studying C# in my free time and thought of a "little" project to get me going (and one th开发者_JAVA百科at I or others will actually find useful). It ended up being more complicated than I thought. Or maybe I'm just thinking it is?

Anyway, this project would parse the homepages of the blogs (most of them are Wordpress blogs) I frequent to, take the post headers and the links within those posts and notify me via a balloon tip in the task bar. I can handle the rest except for the way of getting C# to parse the HTML pages for the items I need. C# doesn't seem to have any built-in way to do this. Could anyone point me to the right direction? I just looked into the HTML Agility Pack but I'm still trying to figure it out. Some example code will help much too. Thanks in advance!


You are doing the right thing if you are using the HTML Agility Pack.

Here is selecting all of the links on a page (from here):

HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm");
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[@href"])
{
   HtmlAttribute att = link["href"];
   att.Value = FixLink(att);
}
doc.Save("file.htm");

You may want to brush up on your XPath, if you want to learn how to query the HtmlDocument.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜