Need help modifying regular expression
One of these days I'll get good at regex but for now...
I'm parsing an HTML page looking for MP3 files u开发者_StackOverflow中文版sing the following expression (which works):
"<A HREF=\"([^\"]+)\"[^>]*>([^<]+?)\\.mp3</A>"
I now want to search for both MP3 and OGG files. Seems like a simple OR modification (.mp3 || .ogg), but I'm not quite sure how I put that in there? See Trying to parse links in an HTML directory listing using Java regex for more info.
Understanding the pattern
You have the following Java string literal:
// Java string literal
"<A HREF=\"([^\"]+)\"[^>]*>([^<]+?)\\.mp3</A>"
The pattern represented by this string, when all escape sequences are processed, is this:
// the regex pattern
<A HREF="([^"]+)"[^>]*>([^<]+?)\.mp3</A>
Now let's break this pattern apart:
_________ _ _ E________
<A HREF="([^"]+)"[^>]*>([^<]+?)\.mp3</A>
\_____/ \______/
1 2
So the parts of this regex are:
<A HREF="
matched literally([^"]+)
, i.e. everything but doublequotes, captured in group 1"
mached literally[^>]*
, i.e. everything but>
>
matched literally([^<]+?)
, i.e. everything but<
, as few as possible, captured in group 2.mp3</A>
matched literally (the.
is escaped by backslash)
So looking at this, we can observe that the regex makes the following assumptions:
- The
href
attribute value is matched by part 2; it must be enclosed in doublequotes, and itself can not contain any escaped doublequotes. This match is captured into group 1. - Any remaining attributes is matched by part 4. The
href
must be the first attribute, or else the regex wouldn't match. - Part 6 matches the filename, capturing into group 2.
- Part 7 matches the extension, and immediately after, the closing element. The reluctance of part 6 is probably not necessary.
Parsing HTML with regex is a tricky business, but given numerous assumptions, the above regex seems capable of doing the job most of the time.
Modifying the pattern
Alternation in regex is done using the vertical bar. It's important to understand its precedence, and how grouping can be useful.
this|that
matches one of these two strings:"this"
"that"
this|that thing
matches one of these two strings:"this"
"that thing"
(this|that) thing
matches one of these two strings:"this thing"
"that thing"
(this|that) (thing|stuff)
matches one of these four strings:"this thing"
"that thing"
"this stuff"
"that stuff"
So to allow both mp3
and ogg
extension, we can modify the mp3
in the pattern to (mp3|ogg)
. Note that this group will match and capture the extension into group 3.
The final pattern, therefore, is:
<A HREF="([^"]+)"[^>]*>([^<]+)\.(mp3|ogg)</A>
\_____/ \_____/ \_______/
1:url 2:filename 3:ext
As a Java string literal, this is:
"<A HREF=\"([^\"]+)\"[^>]*>([^<]+)\\.(mp3|ogg)</A>"
Appendix
The […]
is a character class. Something like [aeiou]
matches one of any of the lowercase vowels. [^…]
is a negated character class. [^aeiou]
matches one of anything but the lowercase vowels.
The (…)
is a capturing group. It allows the string that it matched to be retrieved later.
The *
and +
are repetition specifiers. By default, repetition is greedy (i.e. match as much as possible). The ?
in +?
makes it reluctant (i.e. match as few as possible).
Note that ?
may also serve as optional repetition specifier in other contexts.
The .
is a metacharacter that matches (almost) any character. Since we want a literal period, we escape it by preceding with doubleslash.
Note that regex pattern is by default case sensitive. In Java, you may want to use Pattern.CASE_INSENSITIVE
flag (embeddable as (?i)
in the pattern).
Replace
\.mp3
with
\.((mp3)|(ogg))
And beware of parsing HTML with regex.
精彩评论