Need help modifying regular expression

2023-01-12 19:54 问答作者：

One of these days I'll get good at regex but for now...

I'm parsing an HTML page looking for MP3 files u开发者_StackOverflow中文版sing the following expression (which works):

"<A HREF=\"([^\"]+)\"[^>]*>([^<]+?)\\.mp3</A>"

I now want to search for both MP3 and OGG files. Seems like a simple OR modification (.mp3 || .ogg), but I'm not quite sure how I put that in there? See Trying to parse links in an HTML directory listing using Java regex for more info.

Understanding the pattern

You have the following Java string literal:

// Java string literal
"<A HREF=\"([^\"]+)\"[^>]*>([^<]+?)\\.mp3</A>"

The pattern represented by this string, when all escape sequences are processed, is this:

// the regex pattern
<A HREF="([^"]+)"[^>]*>([^<]+?)\.mp3</A>

Now let's break this pattern apart:

_________       _     _        E________
<A HREF="([^"]+)"[^>]*>([^<]+?)\.mp3</A>
         \_____/       \______/
            1              2

So the parts of this regex are:

<A HREF=" matched literally
([^"]+), i.e. everything but doublequotes, captured in group 1
" mached literally
[^>]*, i.e. everything but >
> matched literally
([^<]+?), i.e. everything but <, as few as possible, captured in group 2
.mp3</A> matched literally (the . is escaped by backslash)

So looking at this, we can observe that the regex makes the following assumptions:

The href attribute value is matched by part 2; it must be enclosed in doublequotes, and itself can not contain any escaped doublequotes. This match is captured into group 1.
Any remaining attributes is matched by part 4. The href must be the first attribute, or else the regex wouldn't match.
Part 6 matches the filename, capturing into group 2.
Part 7 matches the extension, and immediately after, the closing element. The reluctance of part 6 is probably not necessary.

Parsing HTML with regex is a tricky business, but given numerous assumptions, the above regex seems capable of doing the job most of the time.

Modifying the pattern

Alternation in regex is done using the vertical bar. It's important to understand its precedence, and how grouping can be useful.

this|that matches one of these two strings:
- "this"
- "that"
this|that thing matches one of these two strings:
- "this"
- "that thing"
(this|that) thing matches one of these two strings:
- "this thing"
- "that thing"
(this|that) (thing|stuff) matches one of these four strings:
- "this thing"
- "that thing"
- "this stuff"
- "that stuff"

So to allow both mp3 and ogg extension, we can modify the mp3 in the pattern to (mp3|ogg). Note that this group will match and capture the extension into group 3.

The final pattern, therefore, is:

<A HREF="([^"]+)"[^>]*>([^<]+)\.(mp3|ogg)</A>
         \_____/       \_____/  \_______/
          1:url      2:filename   3:ext

As a Java string literal, this is:

"<A HREF=\"([^\"]+)\"[^>]*>([^<]+)\\.(mp3|ogg)</A>"

Appendix

The […] is a character class. Something like [aeiou] matches one of any of the lowercase vowels. [^…] is a negated character class. [^aeiou] matches one of anything but the lowercase vowels.

The (…) is a capturing group. It allows the string that it matched to be retrieved later.

The * and + are repetition specifiers. By default, repetition is greedy (i.e. match as much as possible). The ? in +? makes it reluctant (i.e. match as few as possible).

Note that ? may also serve as optional repetition specifier in other contexts.

The . is a metacharacter that matches (almost) any character. Since we want a literal period, we escape it by preceding with doubleslash.

Note that regex pattern is by default case sensitive. In Java, you may want to use Pattern.CASE_INSENSITIVE flag (embeddable as (?i) in the pattern).

Replace 
    \.mp3
with
    \.((mp3)|(ogg))

And beware of parsing HTML with regex.

继续阅读：regex

Need help modifying regular expression

Understanding the pattern

Modifying the pattern

Appendix

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

Understanding the pattern

Modifying the pattern

Appendix

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？