Searching and capturing a character using regular expressions Python
While going through one of the problems in Python Challenge, I am trying to solve it as follows:
Read the input in a text file with characters as follows:
DQheAbsaMLjTmAOKmNsLziVMenFxQdATQIjItwtyCHyeMwQTNxbbLXWZnGmDqHhXnLHfEyvzxMhSXzd
BEBaxeaPgQPttvqRvxHPEOUtIsttPDeeuGFgmDkKQcEYjuSuiGROGfYpzkQgvcCDBKrcYwHFlvPzDMEk
MyuPxvGtgSvWgrybKOnbEGhqHUXHhnyjFwSfTfaiWtAOMBZEScsOSumwPssjCPlLbLsPIGffDLpZzMKz
jarrjufhgxdrzywWosrblPRasvRUpZLaUbtDHGZQt开发者_Python百科vZOvHeVSTBHpitDllUljVvWrwvhpnVzeWVYhMPs
kMVcdeHzFZxTWocGvaKhhcnozRSbWsIEhpeNfJaRjLwWCvKfTLhuVsJczIYFPCyrOJxOPkXhVuCqCUgE
luwLBCmqPwDvUPuBRrJZhfEXHXSBvljqJVVfEGRUWRSHPeKUJCpMpIsrV.......
What I need is to go through this text file and pick all lower case letters that are enclosed by only three upper-case letters on each side.
The python script that I wrote to do the above is as follows:
import re
pattern = re.compile("[a-z][A-Z]{3}([a-z])[A-Z]{3}[a-z]")
f = open('/Users/Dev/Sometext.txt','r')
for line in f:
result = pattern.search(line)
if result:
print result.groups()
f.close()
The above given script, instead of returning the capture(list of lower case characters), returns all the text blocks that meets the regular expression criteria, like
aXCSdFGHj
vCDFeTYHa
nHJUiKJHo
.........
.........
Can somebody tell me what exactly I am doing wrong here? And instead of looping through the entire file, is there an alternate way to run the regular expression search on the entire file?
Thanks
Change result.groups()
to result.group(1)
and you will get just the single letter match.
A second problem with your code is that it will not find multiple results on one line. So instead of using re.search
you'll need re.findall
or re.finditer
. findall
will return strings or tuples of strings, whereas finditer
returns match objects.
Here's where I approached the same problem:
import urllib
import re
pat = re.compile('[a-z][A-Z]{3}([a-z])[A-Z]{3}[a-z]')
print ''.join(pat.findall(urllib.urlopen(
"http://www.pythonchallenge.com/pc/def/equality.html").read()))
Note that re.findall
and re.finditer
return non-overlapping results. So when using the above pattern with re.findall
searching against string 'aBBBcDDDeFFFg'
, your only match will be 'c'
, but not 'e'
. Fortunately, this Python Challenge problem contains no such such examples.
I'd suggest using lookaround:
(?<=[A-Z]{3})(?<![A-Z].{3})([a-z])(?=[A-Z]{3})(?!.{3}[A-Z])
This will have no problem with overlapping matches.
Explanation:
(?<=[A-Z]{3}) # assert that there are 3 uppercase letters before the current position
(?<![A-Z].{3}) # assert that there is no uppercase letter 4 characters before the current position
([a-z]) # match a lowercase character (all characters in the example are ASCII)
(?=[A-Z]{3}) # assert that there are 3 uppercase letter after the current position
(?!.{3}[A-Z]) # assert that there is no uppercase letter 4 characters after the current position
import re
with open('/Users/Dev/Sometext.txt','r') as f:
tokens = re.findall(r'[a-z][A-Z]{3}([a-z])[A-Z]{3}[a-z]', f.read())
for token ins tokens:
print token
What findall
does:
Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.
Maybe the most useful function in the re
module.
The read() function reads the whole file into on big string. This is especially useful if you need to match a regular expression against the whole file.
Warning: Depending on the size of the file, you may prefer iterating over the file line by line as you did in your first approach.
精彩评论