Help with Regex. Need to extract `<A HREF`
i have <A HREF="f110111.ZIP">
and f11开发者_运维知识库0111
- is an arbitrary char sequence.
I need C# regex match expression to extract all above.
E. g. input is
<A HREF="f110111.ZIP"><A HREF="qqq.ZIP"><A HREF="gygu.ZIP">
I want the list:
- f110111.ZIP
- qqq.ZIP
- gygu.ZIP
What you need is the htmlagility pack/! That will allow you to read HTML in an easy manner and provide an easy way to retrieve links.
If you can have multiple dots in the filename:
<A HREF="(^["]+?).zip
If you do not have dots in the filename (just one before the zip
), you can use a faster one:
<A HREF="(^[".]+)
C# example:
Pattern pattern = Pattern.compile("<A HREF=\"(^[\"]+?).zip");
Matcher matcher = pattern.matcher(buffer);
while (matcher.find()) {
// do something with: matcher.group(1)
}
NO NO! Do not use Regex to parse HTML!
Try an XML Parser. Or XPath perhaps.
Try this one:
/<a href="([^">]+.ZIP)/gi
I think Regular Expressions are a great way to filter text out of a given text.
This regex gets the File, Filename and Extension from the given text.
href="(?<File>(?<Filename>.*?)(?<Ext>\.\w{1,3}))"
Regex above expects an extension that exists out of word characters a-z A-Z 0-9, between 1 and 3 characters.
C# Code sample:
string regex = "href=\"(?<File>(?<Filename>.*?)(?<Ext>\\.\\w{1,3}))\"";
RegexOptions options = ((RegexOptions.IgnorePatternWhitespace | RegexOptions.Multiline) | RegexOptions.IgnoreCase);
Regex reg = new Regex(regex, options);
精彩评论