开发者

Regex in Python to find words that follow pattern: vowel, consonant, vowel, consonant

Trying to learn Regex in Python to find words that have consecutive vowel-开发者_运维技巧consonant or consonant-vowel combinations. How would I do this in regex? If it can't be done in Regex, is there an efficient way of doing this in Python?


I believe you should be able to use a regular expression like this:

r"([aeiou][bcdfghjklmnpqrstvwxz])+"

for matching vowel followed by consonant and:

r"([bcdfghjklmnpqrstvwxz][aeiou])+"

for matching consonant followed by vowel. For reference, the + means it will match the largest repetition of this pattern that it can find. For example, applying the first pattern to "ababab" would return the whole string, rather than single occurences of "ab".

If you want to match one or more vowels followed by one or more consonants it might look like this:

r"([aeiou]+[bcdfghjklmnpqrstvwxz]+)+"

Hope this helps.


^(([aeiou][^aeiou])+|([^aeiou][aeiou])+)$

>>> import re
>>> consec_re = re.compile(r'^(([aeiou][^aeiou])+|([^aeiou][aeiou])+)$')
>>> consec_re.match('bale')
<_sre.SRE_Match object at 0x01DBD1D0>
>>> consec_re.match('bail')
>>>


If you map consonantal digraphs into single consonants, the longest such word is anatomicopathological as a 10*VC string.

If you correctly map y, then you get complete strings like acetylacetonates as 8*VC and hypocotyledonary as 8*CV.

If you don’t need the string to be whole, you get a 9*CV pattern in chemicomineralogical and a 9*VC pattern in overimaginativeness.

There are many 10* words if runs of consecutive consonants or vowels are allowed to alternate, as in (C+V+)+. These include laparocolpohysterotomy and ureterocystanastomosis.

The main trick is to first map all consonants to C and all vowels to V, then do a VC or CV match. For Y, you have to do lookaheads and/or lookbehinds to determine whether it maps to C or V in that position.

I could show you the patterns I used, but you probably won’t be pleased with me. :) For example:

 (?<= \p{IsVowel} )     [yY] (?= \p{IsVowel} )  # counts as a C
 (?<= \p{IsConsonant} ) [yY]                    # counts as a V
                        [yY] (?= \p{IsVowel} )  # counts as a C

The main trick then becomes one of looking for overlapping matches of VC or CV alternations via

 (?= ( (?:  \p{IsVowel}       \p{IsConsonant} )  )+ ) )

and

 (?= ( (?:  \p{IsConsonant}   \p{IsVowel}     )  )+ ) )

Then you count all those up and see which ones are the longest.

However, since Python support doesn’t (by default/directly) support properties in regexes the way I just used them for my own program, this makes it even more important to first preprocess the string into nothing but C’s and V’s. Otherwise your patterns look really ugly.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜