regex regarding symbols in urls

2023-01-15 16:48 问答作者：

I want to replace consecutive symbols just one such as;

this is a dog???

开发者_开发问答this is a dog?

I'm using

str = re.sub("([^\s\w])(\s*\1)+", "\\1",str)

however I notice that this might replace symbols in urls that might happen in my text.

like http://example.com/this--is-a-page.html

Can someone give me some advice how to alter my regex?

So you want to unleash the power of regular expressions on an irregular language like HTML. First of all, search SO for "parse HTML with regex" to find out why that might not be such a good idea.

Then consider the following: You want to replace duplicate symbols in (probably user-entered) text. You don't want to replace them inside a URL. How can you tell what a URL is? They don't always start with http – let's say ars.userfriendly.org might be a URL that is followed by a longer path that contains duplicate symbols.

Furthermore, you'll find lots of duplicate symbols that you definitely don't want to replace (think of nested parentheses (like this)), some of them maybe inside a <script> on the page you're working on (||, && etc. come to mind.

So you might come up with something like

(?<!\b(?:ftp|http|mailto)\S+)([^\\|&/=()"'\w\s])(?:\s*\1)+

which happens to work on the source code of this very page but will surely fail in other cases (for example if URLs don't start with ftp, http or mailto). Plus, it won't work in Python since it uses variable repetition inside lookbehind.

All in all, you probably won't get around parsing your HTML with a real parser, locating the body text, applying a regex to it and writing it back.

EDIT:

OK, you're already working on the parsed text, but it still might contain URLs.

Then try the following:

result = re.sub(
    r"""(?ix) # case-insensitive, verbose regex

    # Either match a URL 
    # (protocol optional (if so, URL needs to start with www or ftp))
    (?P<URL>\b(?:(?:https?|ftp|file)://|www\.|ftp\.)[-A-Z0-9+&@#/%=~_|$?!:,.]*[A-Z0-9+&@#/%=~_|$])

    # or
    |

    # match repeated non-word characters
    (?P<rpt>[^\s\w])(?:\s{0,100}(?P=rpt))+""", 

    # and replace with both captured groups (one will always be empty)
    r"\g<URL>\g<rpt>", subject)

Re-EDIT: Hm, Python chokes on the (?:\s*(?P=rpt))+ part, saying the + has nothing to repeat. Looks like a bug in Python (reproducible with (.)(\s*\1)+ whereas (.)(\s?\1)+ works)...

Re-Re-EDIT: If I replace the * with {0,100}, then the regex compiles. But now Python complains about an unmatched group. Obviously you can't reference a group in a replacement if it hasn't participated in the match. I give up... :(

继续阅读：python regex

regex regarding symbols in urls

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？