How can I match a URL but exclude terminators from the match?

2023-02-19 21:02 问答作者：

I want to match urls in text and replace them with anchor tags, but I want to exclude some terminators just like how Twitter matches urls in tweets.

So far I've got this, but it's obviously not working too well.

(http[s]?\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?)

EDIT: Some example urls. In all cases below I only want to match "http://www.example.com"

http://www.example.com.

http://www.example.com:

"http://www.example.com"

http://www.example.com;

http://www.example.c开发者_运维问答om!

[http://www.example.com]

{http://www.example.com}

http://www.example.com*

I looked into this very issue last year and developed a solution that you may want to look at - See: URL Linkification (HTTP/FTP) This link is a test page for the Javascript solution with many examples of difficult-to-linkify URLs.

My regex solution, written for both PHP and Javascript - (but could easily be translated to Ruby) is not simple (but neither is the problem as it turns out.) For more information I would recommend also reading:

The Problem With URLs by Jeff Atwood, and
An Improved Liberal, Accurate Regex Pattern for Matching URLs by John Gruber

The comments following Jeff's blog post are a must read if you want to do this right...

Ruby's URI module has a extract method that is used to parse out URLs from text. Parsing the returned values lets you piggyback on the heuristics in the module to extract the scheme and host information from a URL, avoiding reinventing the wheel.

text = '
http://www.example.com.
http://www.example.com:
"http://www.example.com"
http://www.example.com;
http://www.example.com!
[http://www.example.com]
{http://www.example.com}
http://www.example.com*
http://www.example.com/foo/bar?q=foobar
http://www.example.com:81
'

require 'uri'

puts URI::extract(text).map{ |u| uri = URI.parse(u); "#{ uri.scheme }://#{ uri.host[/(^.+?)\.?$/, 1] }" }

# >> http://www.example.com
# >> http://www.example.com
# >> http://www.example.com
# >> http://www.example.com
# >> http://www.example.com
# >> http://www.example.com
# >> http://www.example.com
# >> http://www.example.com
# >> http://www.example.com
# >> http://www.example.com

The only gotcha, is that a period '.' is a legitimate character in a host name, so URI#host won't strip it. Those get caught in the map statement where the URL is rebuilt. Note that URI is stripping off the path and query information.

A pragmatic and easy understandable solution is:

regex = %r!"(https?://[-.\w]+\.\w{2,6})"!

Some notes:

With %r we can choose the start and end delimiter. In this case I used exclamation mark, since I want to use slash unescaped in the regex.
The optional quantifier (i.e. '?') binds only to the preceding expression, in this case 's'. There's no need to put the 's' in a character class [s]?. It's the same as s?.
Inside the character class [-.\w] we don't need to escape dash and dot in order to make them match dot and dash literally. Dash should be first, however, to not mean range.
\w matches [A-Za-z0-9_] in Ruby. It's not exactly the full definition of URL characters, but combined with dash and dot it may be enough for our needs.
Top domains are between 2 and 6 characters long, e.g. '.se' and '.travel'
I'm not sure what you mean by I want to exclude some terminators but this regex matches only the wanted one in your example.
We want to use the first capture group, e.g. like this:

if input =~ %r!"(https?://[-.\w]+.\w{2,6})"!

match = $~[1]

else

match = ""

end

What about this?

%r|https?://[-\w.]*\w|

继续阅读：regex ruby

How can I match a URL but exclude terminators from the match?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？