开发者

What kind of regex would I use to match this?

I have several strings which look li开发者_运维知识库ke the following:

<some_text> TAG[<some_text>@11.22.33.44] <some_text>

I want to get the ip_address and only the ip_address from this line. (For the sake of this example, assume that the ip address will always be in this format xx.xx.xx.xx)

Edit: I'm afraid I wasn't clear.

The strings will look something like this:

<some_text> TAG1[<some_text>@xx.xx.xx.xx] <some_text> TAG2[<some_text>@yy.yy.yy.yy] <some_text>

Note that the 'some_text' can be a variable length. I need to associate different regex's to different tags so that when r.group() is called, the ip address will be returned. In the above case the regex's would not be different but it is a bad example.

The regexes I have tried so far have been inadequate.

Ideally, I would like something like this:

r = re.search('(?<=TAG.*@)(\d\d.\d\d.\d\d.\d\d)', line)

where line is in the format specified above. However, this does not work because you need to have a fixed width look-behind assertion.

Additionally, I have tried non-capturing groups as such:

r = re.search('(?<=TAG\[)(?:.*@)(\d\d.\d\d.\d\d.\d\d)', line)

However, I cannot use this because r.group() will return some_text@xx.xx.xx.xx

I understand that r.group(1) will return just the ip address. Unfortunately, the script I am writing requires that all my regex will return the correct result after calling r.group().

What kind of regex could I use for this situation?

The code is in python.

Note: All of the some_text can be variable length


Try re.search('(?<=@)\d\d\.\d\d\.\d\d\.\d\d(?=\])', line).

In fact, re.search('\d\d\.\d\d\.\d\d\.\d\d', line) may get you what you need if the only occurrence of the xx.xx.xx.xx format in the strings being checked is in those IP address sections.

EDIT: As stated in my comment, to find all occurrences of the wanted pattern in a string, you just do re.findall(pattern_to_match, line). So in this case, re.findall('\d\d\.\d\d\.\d\d\.\d\d', line) (or more generally, re.findall('\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}', line)).

EDIT 2: From your comment, this should work (with tagname being the tag of the IP address you currently want).

r = re.search(tagname + '\[.+?@(?P<ip>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})', line)

And then you'd just refer to it with r.group("ip") like psmears said.

...In fact, there's an easy way to make the regex a bit more concise.

r = re.search(tagname + r'\[.+?@(?P<ip>(?:\d{1,3}\.?){4})', line)

In fact, you could even do this:

r = re.findall('(?P<tag>\S+)\[.+?@(?P<ip>(?:\d{1,3}\.?){4})', line)

Which would return you a list containing the tags and their associated IP addresses, and so you wouldn't have to recheck any one string once you found the matches if you wanted to refer to the IP address of a different tag from the same string.

...In fact, going two steps further (farther?), you could do the following:

r = dict((m.group("tag"), m.group("ip")) for m in re.finditer('(?P<tag>\S+)\[.+?@(?P<ip>(?:\d{1,3}\.?){4})', line))

Or in Python 3:

r = {(m.group("tag"), m.group("ip")) for m in re.finditer('(?P<tag>\S+)\[.+?@(?P<ip>(?:\d{1,3}\.?){4})', line)}

And then r would be a dict with the tags as keys and the IP addresses as the respective values.


I don't think it's possible to do that - r.group() will always return the whole string that matched, so you're forced to use lookbehind, which as you say must be fixed width.

Instead, I'd suggest modifying the script that you're writing. I'm guessing that you have a whole load of regexps that it matches, and you don't want to have to specify for each one "this one uses r.group(0)", "this one uses r.group(3)" etc.

In that case, you could use Python's named groups facility: you can name a group in a regular expression like this:

(?P<name>CONTENTS)

then retrieve what matched with r.group("name").

What I suggest doing in your script is: match the regular expression, then test if r.group("usethis") is set. If so - use that; if not - then use r.group() as before.

That way you can cope with awkward situations like this by specifying the group name usethis in the regexp - but your other regexps don't have to know or care.


Why do you want to use groups or look behinds at all? What is wrong with re.search('TAG\[.*@(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\]')?


Almost but I think that you need to change the .* at the start to .*? since you may have multiple TAGs on a single line (I believe - as there is in the example)

re.search('TAG(\d+)\[.*?@(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})]')

The Tag ID will be in the first backreference and the IP address will be in the second back reference

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜