c# and regEx to pull href part of links on html pages
i have this code in c# to pull links from a web page and wanted to make it smarter in that i want to be able to add small additions in the fuure to exclude links based on 2 criteria.
first i want to exclude certain file extentions found on pages such as links to pdf files or ppt files...
next i want to be able to exclude links on the first part of th开发者_如何转开发e url to such things as ftp and images.google... or maps.google.... and mailto...
this is my current code that needs help:
MatchCollection m1 = Regex.Matches(file, @"(?i)(<A[^>]*href\s*=\s*['""](?!mailto|[^'""]*\.(?:pdf|doc|ppt))[^>]*>.*?</A>)", RegexOptions.Singleline);
Have you considered the Html Agility Pack?
精彩评论