Python regular expressions: search and replace weirdness

2023-01-17 16:02 问答作者：

I could really use some help with a Python regular expression problem. You'd expect the result of

import re
re.sub("s (.*?) s", "no", "this is a string")

to be "this is no string", right? But in reality it's "thinotring". The sub function uses the entire pattern as the group to replace, instead of just the group I actually want to replace.

All re.sub examples deal with simple word replaceme开发者_StackOverflownt, but what if you want to change something depending on the rest of the string? Like in my example...

Any help would be greatly appreciated.

Edit:

The look-behind and look-forward tricks won't work in my case, as those need to be fixed width. Here is my actual expression:

re.sub(r"<a.*?href=['\"]((?!http).*?)['\"].*?>", 'test', string)

I want to use it to find all links in a string that don't begin with http, so I can but a prefix in front of those links (to make them absolute rather then relative).

Your regex matches everything from the first s to the last s, so if you replace the match with "no", you get "thinotring".

The parentheses don't limit the match, they capture the text matched by whatever is inside them in a special variable called backreference. In your example, backreference number 1 would contain is a. You can refer to a backreference later in the same regex using backslashes and the number of the backreference: \1.

What you probably want is lookaround:

re.sub(r"(?<=s ).*?(?= s)", "no", "this is a string")

(?<=s ) means: Assert that it is possible to match s before the current position in the string, but don't make it part of the match.

Same for (?= s), but it asserts that the string will continue with s after the current position.

Be advised that lookbehind in Python is limited to strings of fixed length. So if that is a problem, you can sort of work around this using...backreferences!

re.sub(r"(s ).*?( s)", r"\1no\2", "this is a string")

OK, this is a contrived example, but it shows what you can do. From your edit, it's becoming apparent that you're trying to parse HTML with regex. Now that is not such a good idea. Search SO for "regex html" and you'll see why.

If you still want to do it:

re.sub(r"(<a.*?href=['"])((?!http).*?['"].*?>)", r'\1http://\2', string)

might work. But this is extremely brittle.

Use (?<=...) and (?=...) to match parts of the string but not replace them:

re.sub("(?<=s )(.*?)(?= s)", "no", "this is a string")

EDIT: This returns this no string, so not quite what you want... :-(

For your updated question, try this:

re.sub(r"(?<=href=['\"])((?!http).*?)(?=['\"].*?>)", 'test', string)

Isn't it enough to check href=" before a link?

Your expression, while nasty looking, does work but you are not capturing the result of re.sub which returns the replaced string and doesn't perform the replacement on the string passed as a parameter.

import re

new_string = re.sub(r"<a.*?href=['\"]((?!http).*?)['\"].*?>", 'test', string)
print new_string

Check it here on IDEone.com: http://ideone.com/ufaTw

BTW, you're probably better off using Beautiful Soup or similar to systematically search and replace HTML, using regex is a bad idea.

It's a pretty standard regex system - the only problem with it is that the syntax is much wordier than Perl. O:-)

Another option you could would be to use [^>]* instead of .*, since you only want results that are contained within a single link. That could fail if you have a link that has multiple hrefs (as far as I know that shouldn't happen), but otherwise it would work.

Ok, look-around was possible, just needed a small rewrite. This works:

def absolutize(string, prefix):
    return re.sub(r"(?<=href=['\"])((?!http).*?)(?=['\"])", prefix+r'\1', string)

Still, stupid Python regex system... :(

继续阅读：python regex replace

Python regular expressions: search and replace weirdness

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？