开发者

c# and regEx to pull href part of links on html pages

i have this code in c# to pull links from a web page and wanted to make it smarter in that i want to be able to add small additions in the fuure to exclude links based on 2 criteria.

first i want to exclude certain file extentions found on pages such as links to pdf files or ppt files...

next i want to be able to exclude links on the first part of th开发者_如何转开发e url to such things as ftp and images.google... or maps.google.... and mailto...

this is my current code that needs help:

MatchCollection m1 = Regex.Matches(file, @"(?i)(<A[^>]*href\s*=\s*['""](?!mailto|[^'""]*\.(?:pdf|doc|ppt))[^>]*>.*?</A>)", RegexOptions.Singleline);


Have you considered the Html Agility Pack?

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜