开发者

Regex optional match in python fails

tickettypepat = (r'MIS Notes:.*(//p//)?.*')
retype = re.search(tickettypepat,line)
if retype:
  print retype.group(0)
  print retype.group(1)

Given the input.

MIS Notes: //p//

Can anyone tell me why group(0) is

MIS Notes: //p// 

and group(1) is returning as None?

I was originally using regex because, before I ran into problems the matching was more complex than just matching //p// here's the full code. I'm fairly new at this so forgive my noobness, I'm sure there are better ways of accomplishing much of this and if anyonee feels like pointing those out that would be awesome. But aside from the problem with the regex for //[pewPEW]// being too greedy it seems to be functional. I appreciate the help.


Takes Text and cleans up / converts some things.

filename = (r'.\4-12_4-26.txt')
import re
import sys
#Clean up output from the web to ensure that you have one catagory per line
f = open(filename)
w = open('cleantext.txt','w')

origdatepat = (r'(Ticket Date: )([0-9]+/[0-9]+/[0-9]+),( [0-9]+:[0-9]+ [PA]M)')
tickettypepat = (r'MIS Notes:.*(//[pewPEW]//)?.*')

print 'Begining Blank Line Removal'
for line in f:
    redate = re.search(origdatepat,line)
    retype = re.search(tickettypepat,line)
    if line == ' \n':
        line = ''
        print 'Removing blank Line'
#remove ',' from time and date line    
    elif redate:
        line = redate.group(1) + redate.group(2)+ redate.group(3)+'\n'
        print 'Redating... ' + line

    elif retype:
        print retype.group(0)
        print retype.group(1)
        
        if retype.group(1) == '//p//':
            line = line + 'Type: Phone\n'
            print 'Setting type for... ' + line
        elif retype.group(1) == '//e//':
            line = line + 'Type: Email\n'
            print 'Setting type for... ' + line
        elif retype.group(1) == '//w//':
            line = line + 'Type开发者_JAVA百科: Walk-in\n'
            print 'Setting type for... ' + line
        elif retype.group(1) == ('' or None):
            line = line + 'Type: Ticket\n'
            print 'Setting type for... ' + line

    w.write(line)

print 'Closing Files'                 
f.close()
w.close()

And here's some sample input.

Ticket No.: 20100426132 
Ticket Date: 04/26/10, 10:22 AM 
Close Date:  
Primary User: XXX
Branch: XXX
Help Tech: XXX
Status: Pending  
Priority: Medium  
Application: xxx
Description: some issue
Resolution: some resolution
MIS Notes: some random stuff //p// followed by more stuff
Key Words:  

Ticket No.: 20100426132 
Ticket Date: 04/26/10, 10:22 AM 
Close Date:  
Primary User: XXX
Branch: XXX
Help Tech: XXX
Status: Pending  
Priority: Medium  
Application: xxx
Description: some issue
Resolution: some resolution
MIS Notes: //p//
Key Words:  

Ticket No.: 20100426132 
Ticket Date: 04/26/10, 10:22 AM 
Close Date:  
Primary User: XXX
Branch: XXX
Help Tech: XXX
Status: Pending  
Priority: Medium  
Application: xxx
Description: some issue
Resolution: some resolution
MIS Notes: //e// stuff....
Key Words:  


Ticket No.: 20100426132 
Ticket Date: 04/26/10, 10:22 AM 
Close Date:  
Primary User: XXX
Branch: XXX
Help Tech: XXX
Status: Pending  
Priority: Medium  
Application: xxx
Description: some issue
Resolution: some resolution
MIS Notes:
Key Words:  


MIS Notes:.*(//p//)?.* works like this, on the example of "MIS Notes: //p//" as the target:

  1. MIS Notes: matches "MIS Notes:", no surprises here.
  2. .* immediately runs to the end of the string (match so far "MIS Notes: //p//")
  3. (//p//)? is optional. Nothing happens.
  4. .* has nothing left to match, we are at the end of the string already. Since the star allows zero matches for the preceding atom, the regex engine stops reporting the entire string as a match, and the sub-group as empty.

Now when you change the regex to MIS Notes:.*(//p//).*, the behavior changes:

  1. MIS Notes: matches "MIS Notes:", still no surprises here.
  2. .* immediately runs to the end of the string (match so far "MIS Notes: //p//")
  3. (//p//) is necessary. The engine starts to backtrack character by character in order to fulfill this requirement. (Match so far "MIS Notes: ")
  4. (//p//) can match. Sub-group one is saved and contains "//p//".
  5. .* runs to the end of the string. Hint: If you are not interested in what it matches, it is superfluous and you can remove it.

Now when you change the regex to MIS Notes:.*?//(p)//, the behavior changes again:

  1. MIS Notes: matches "MIS Notes:", and still no surprises here.
  2. .*? is non-greedy and checks the following atom before it proceeds (match so far "MIS Notes: ")
  3. //(p)// can match. Sub-group one is saved and contains "p".
  4. Done. Note that no backtracking occurs, this saves time.

Now if you know that there can be no / before the //p//, you can use: MIS Notes:[^/]*//(p)//:

  1. MIS Notes: matches "MIS Notes:", you get the idea.
  2. [^/]* can fast-forward to the first slash (this is faster than .*?)
  3. //(p)// can match. Sub-group one is saved and contains "p".
  4. Done. Note that no backtracking occurs, this saves time. This should be faster than version #3.


Regex are greedy, which means that .* matches as much as it can, the entire string. So there is nothing left to match for the optional group. group(0) is always the entire matched sting.

From you comment, why do you event want regex? Isn't something like this enough:

if line.startswith('MIS Notes:'): # starts with that string
    data = line[len('MIS Notes:'):] # the rest in the interesting part
    if '//p//' in data:
        stuff, sep, rest = data.partition('//p//') # or sothing like that
    else:
        pass #other stuff


The pattern is ambiguous for your purposes. It would be good to group them by prefix or suffix. In the example here, I've chosen prefix grouping. Basically, if //p// occurs in the line, then prefix is non-empty. Suffix will everything after the //p// item, or everything in the line if it doesn't exist.

import re
lines = ['MIS Notes: //p//',
    'MIS Notes: prefix//p//suffix']

tickettypepat = (r'MIS Notes: (?:(.*)//p//)?(.*)')
for line in lines:
    m = re.search(tickettypepat,line)
    print 'line:', line
    if m: print 'groups:', m.groups()
    else: print 'groups:', m

results:

line: MIS Notes: //p//
groups: ('', '')
line: MIS Notes: prefix//p//suffix
groups: ('prefix', 'suffix')
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜