Match all "http" only URLs without additional characters

2023-01-07 03:56 问答作者：

I have tried the below expressions.

(http:\/\/.*?)['\开发者_开发知识库"\< \>]


(http:\/\/[-a-zA-Z0-9+&@#\/%?=~_|!:,.;\"]*[-a-zA-Z0-9+&@#\/%=~_|\"])

The first one is doing well but always gives the last extra character with the matched URLs.

Eg:

http://domain.com/path.html" 

http://domain.com/path.html<

Notice

" <

I don't want them with URLs.

You can use lookahead instead of making ['\"\< >] part of your match, i.e.:

(http:\/\/.*?)(?=['\"\< >])

Generally speaking, whereas ab matches ab, a(?=b) matches a (if it's followed by b).

References

regular-expressions.info/Lookarounds

Capturing group option

Lookarounds are not supported by all flavors. More widely supported are capturing groups.

Generally speaking, whereas (a)b still matches ab, it also captures a in group 1.

References

regular-expressions.info/Round Brackets for Grouping

Negated character class option

Depending on the need, often times using a negated character class is much better than using a reluctant .*? (followed by a lookahead to assert the terminator pattern in this case).

Let's consider the problem of matching "everything between A and ZZ". As it turns out, this specification is ambiguous: we will come up with 3 patterns that does this, and they will yield different matches. Which one is "correct" depends on the expectation, which is not properly conveyed in the original statement.

We use the following as input:

eeAiiZooAuuZZeeeZZfff

We use 3 different patterns:

A(.*)ZZ yields 1 match: AiiZooAuuZZeeeZZ (as seen on ideone.com)
- This is the greedy variant; group 1 matched and captured iiZooAuuZZeee
A(.*?)ZZ yields 1 match: AiiZooAuuZZ (as seen on ideone.com)
- This is the reluctant variant; group 1 matched and captured iiZooAuu
A([^Z]*)ZZ yields 1 match: AuuZZ (as seen on ideone.com)
- This is the negated character class variant; group 1 matched and captured uu

Here's a visual representation of what they matched:

         ___n
        /   \              n = negated character class
eeAiiZooAuuZZeeeZZfff      r = reluctant
  \_________/r   /         g = greedy
   \____________/g

References

regular-expressions.info/Character Class and Repetition: An Alternative to Laziness

Related questions

Difference between .*? and .* for regex

You need to use "(?=regex)" (lookahead), which lookups a particular pattern, but doesn't include it in the result:

http:\/\/.*?(?=['\"\< >])

Hmmm, I'd probably do this simply by saying "keep going until you get an unwanted character", like so:

http://[^'"< >]*

Escaped version (based on Q - not sure what engine this is):

http:\/\/[^'\"\< >]*

However the lookahead solution by polygenelubricants is a more flexible way, if you might have some of those characters in the URL (but not at the end).

继续阅读：regex screen-scraping

Match all "http" only URLs without additional characters

References

Related questions

Capturing group option

References

Related questions

Negated character class option

References

Related questions

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

References

Related questions

Capturing group option

References

Related questions

Negated character class option

References

Related questions

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？