Searching and capturing a character using regular expressions Python

2023-01-25 08:37 问答作者：

While going through one of the problems in Python Challenge, I am trying to solve it as follows:

Read the input in a text file with characters as follows:

DQheAbsaMLjTmAOKmNsLziVMenFxQdATQIjItwtyCHyeMwQTNxbbLXWZnGmDqHhXnLHfEyvzxMhSXzd
BEBaxeaPgQPttvqRvxHPEOUtIsttPDeeuGFgmDkKQcEYjuSuiGROGfYpzkQgvcCDBKrcYwHFlvPzDMEk
MyuPxvGtgSvWgrybKOnbEGhqHUXHhnyjFwSfTfaiWtAOMBZEScsOSumwPssjCPlLbLsPIGffDLpZzMKz
jarrjufhgxdrzywWosrblPRasvRUpZLaUbtDHGZQt开发者_Python百科vZOvHeVSTBHpitDllUljVvWrwvhpnVzeWVYhMPs
kMVcdeHzFZxTWocGvaKhhcnozRSbWsIEhpeNfJaRjLwWCvKfTLhuVsJczIYFPCyrOJxOPkXhVuCqCUgE
luwLBCmqPwDvUPuBRrJZhfEXHXSBvljqJVVfEGRUWRSHPeKUJCpMpIsrV.......

What I need is to go through this text file and pick all lower case letters that are enclosed by only three upper-case letters on each side.

The python script that I wrote to do the above is as follows:

import re

pattern = re.compile("[a-z][A-Z]{3}([a-z])[A-Z]{3}[a-z]")
f = open('/Users/Dev/Sometext.txt','r')
for line in f:
    result = pattern.search(line)
    if result:
       print result.groups()

 f.close()

The above given script, instead of returning the capture(list of lower case characters), returns all the text blocks that meets the regular expression criteria, like

aXCSdFGHj
vCDFeTYHa
nHJUiKJHo
.........
.........

Can somebody tell me what exactly I am doing wrong here? And instead of looping through the entire file, is there an alternate way to run the regular expression search on the entire file?

Thanks

Change result.groups() to result.group(1) and you will get just the single letter match.

A second problem with your code is that it will not find multiple results on one line. So instead of using re.search you'll need re.findall or re.finditer. findall will return strings or tuples of strings, whereas finditer returns match objects.

Here's where I approached the same problem:

import urllib
import re    

pat = re.compile('[a-z][A-Z]{3}([a-z])[A-Z]{3}[a-z]')
print ''.join(pat.findall(urllib.urlopen(
    "http://www.pythonchallenge.com/pc/def/equality.html").read()))

Note that re.findall and re.finditer return non-overlapping results. So when using the above pattern with re.findall searching against string 'aBBBcDDDeFFFg', your only match will be 'c', but not 'e'. Fortunately, this Python Challenge problem contains no such such examples.

I'd suggest using lookaround:

(?<=[A-Z]{3})(?<![A-Z].{3})([a-z])(?=[A-Z]{3})(?!.{3}[A-Z])

This will have no problem with overlapping matches.

Explanation:

(?<=[A-Z]{3})  # assert that there are 3 uppercase letters before the current position
(?<![A-Z].{3}) # assert that there is no uppercase letter 4 characters before the current position
([a-z])        # match a lowercase character (all characters in the example are ASCII)
(?=[A-Z]{3})   # assert that there are 3 uppercase letter after the current position
(?!.{3}[A-Z])  # assert that there is no uppercase letter 4 characters after the current position

import re

with open('/Users/Dev/Sometext.txt','r') as f: 
    tokens = re.findall(r'[a-z][A-Z]{3}([a-z])[A-Z]{3}[a-z]', f.read())

    for token ins tokens:
        print token

What findall does:

Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.

Maybe the most useful function in the re module.

The read() function reads the whole file into on big string. This is especially useful if you need to match a regular expression against the whole file.

Warning: Depending on the size of the file, you may prefer iterating over the file line by line as you did in your first approach.

继续阅读：python regex search text-files

Searching and capturing a character using regular expressions Python

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？