开发者

Searching and capturing a character using regular expressions Python

While going through one of the problems in Python Challenge, I am trying to solve it as follows:

Read the input in a text file with characters as follows:

DQheAbsaMLjTmAOKmNsLziVMenFxQdATQIjItwtyCHyeMwQTNxbbLXWZnGmDqHhXnLHfEyvzxMhSXzd
BEBaxeaPgQPttvqRvxHPEOUtIsttPDeeuGFgmDkKQcEYjuSuiGROGfYpzkQgvcCDBKrcYwHFlvPzDMEk
MyuPxvGtgSvWgrybKOnbEGhqHUXHhnyjFwSfTfaiWtAOMBZEScsOSumwPssjCPlLbLsPIGffDLpZzMKz
jarrjufhgxdrzywWosrblPRasvRUpZLaUbtDHGZQt开发者_Python百科vZOvHeVSTBHpitDllUljVvWrwvhpnVzeWVYhMPs
kMVcdeHzFZxTWocGvaKhhcnozRSbWsIEhpeNfJaRjLwWCvKfTLhuVsJczIYFPCyrOJxOPkXhVuCqCUgE
luwLBCmqPwDvUPuBRrJZhfEXHXSBvljqJVVfEGRUWRSHPeKUJCpMpIsrV.......

What I need is to go through this text file and pick all lower case letters that are enclosed by only three upper-case letters on each side.

The python script that I wrote to do the above is as follows:

import re

pattern = re.compile("[a-z][A-Z]{3}([a-z])[A-Z]{3}[a-z]")
f = open('/Users/Dev/Sometext.txt','r')
for line in f:
    result = pattern.search(line)
    if result:
       print result.groups()

 f.close()

The above given script, instead of returning the capture(list of lower case characters), returns all the text blocks that meets the regular expression criteria, like

aXCSdFGHj
vCDFeTYHa
nHJUiKJHo
.........
.........

Can somebody tell me what exactly I am doing wrong here? And instead of looping through the entire file, is there an alternate way to run the regular expression search on the entire file?

Thanks


Change result.groups() to result.group(1) and you will get just the single letter match.

A second problem with your code is that it will not find multiple results on one line. So instead of using re.search you'll need re.findall or re.finditer. findall will return strings or tuples of strings, whereas finditer returns match objects.

Here's where I approached the same problem:

import urllib
import re    

pat = re.compile('[a-z][A-Z]{3}([a-z])[A-Z]{3}[a-z]')
print ''.join(pat.findall(urllib.urlopen(
    "http://www.pythonchallenge.com/pc/def/equality.html").read())) 

Note that re.findall and re.finditer return non-overlapping results. So when using the above pattern with re.findall searching against string 'aBBBcDDDeFFFg', your only match will be 'c', but not 'e'. Fortunately, this Python Challenge problem contains no such such examples.


I'd suggest using lookaround:

(?<=[A-Z]{3})(?<![A-Z].{3})([a-z])(?=[A-Z]{3})(?!.{3}[A-Z])

This will have no problem with overlapping matches.

Explanation:

(?<=[A-Z]{3})  # assert that there are 3 uppercase letters before the current position
(?<![A-Z].{3}) # assert that there is no uppercase letter 4 characters before the current position
([a-z])        # match a lowercase character (all characters in the example are ASCII)
(?=[A-Z]{3})   # assert that there are 3 uppercase letter after the current position
(?!.{3}[A-Z])  # assert that there is no uppercase letter 4 characters after the current position


import re

with open('/Users/Dev/Sometext.txt','r') as f: 
    tokens = re.findall(r'[a-z][A-Z]{3}([a-z])[A-Z]{3}[a-z]', f.read())

    for token ins tokens:
        print token

What findall does:

Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.

Maybe the most useful function in the re module.

The read() function reads the whole file into on big string. This is especially useful if you need to match a regular expression against the whole file.

Warning: Depending on the size of the file, you may prefer iterating over the file line by line as you did in your first approach.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜