开发者

python findall, group and pipe

x = "type='text'"
re.findall("([A-Za-z])='(.*?)')", x) # this will work like a charm and produce
                                     # ['type', 'text']

However, my problem is that I'd like to implement a pipe (alternation) so that the same regex will apply to

x = 'type="text"' # see the quotes

Basically, the following regex should work but with findall it results in something strange:

([A-Za-z])=('(.*?)')|"(.*?)")

And I can't use ['"] instead of a pipe because it may end with bad results:

value="hey there what's up?"

Now, how can I b开发者_如何学运维uild such a regex that would apply to either single or double quotes? By the way, please do not suggest any html or xml parsers as I'm not interested in them.


shlex would do a better job here, but if you insist on re, use ([A-Za-z]+)=(?P<quote>['"])(.+?)(?P=quote)


The problem is, that in ([A-Za-z]+)=('(.*?)'|"(.*?)") you have four groups and you need only two (this is probably where you found results strange). If you use ([A-Za-z]+)=('.*?'|".*?") then should be all right. Remember you can exclude grouping by putting (?:), so this would be equivalent: ([A-Za-z]+)=('(?:.*?)')|"(?:.*?)").

EDIT: I've just realised that this solution would include surrounding quotes which you don't want. You can easily strip them off though. You could also use backreference, but then you would have one extra group, which should be removed at the end, for example:

import re
from operator import itemgetter

x = "type='text' TYPE=\"TEXT\""
print map(itemgetter(0,2), re.findall("([A-Za-z]+)=(['\"])(.*?)\\2", x)) 

gives [('type', 'text'), ('TYPE', 'TEXT')].

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜