Find a url for a file in html using a regular expression

2023-01-31 00:16 问答作者：

I've set myself a somewhat ambitious first task in learning regular expressions (and one which relates to a problem I'm trying to solve). I need to find any instance of a url that ends in .m4v, in a big html string.

My first attempt was this for jpg files

http.*jpg

Which of course seems correct on first glance, but of course returns stuf开发者_StackOverflowf like this:

http://domain.com/page.html" title="Misc"><img src="http://domain.com/image.jpg

Which does match the expression in theory. So really, I need to put something in http.*m4v that says 'only the closest instance between http and m4v'. Any ideas?

As you've noticed, an expression such as the following is greedy:

http:.*\.jpg

That means it reads as much input as possible while satisfying the expression.

It's the "*" operator that makes it greedy. There's a well-defined regex technique to making this non-greedy… use the "?" modifier after the "*".

http:.*?\.jpg

Now it will match as little as possible while still satisifying the expression (i.e. it will stop searching at the first occurrence of ".jpg".

Of course, if you have a .jpg in the middle of a URL, like:

http://mydomain.com/some.jpg-folder/foo.jpg

It will not match the full URL.

You'll want to define the end of the URL as something that can't be considered part of the URL, such as a space, or a new line, or (if the URL in nested inside parentheses), a closing parenthesis. This can't be solved with just one little regex however if it's included in written language, since URLs are often ambiguous.

Take for example:

At this page, http://mysite.com/puppy.html, there's a cute little puppy dog.

The comma could technically be a part of a URL. You have to deal with a lot of ambiguities like this when looking for URLs in written text, and it's hard not to have bugs due to the ambiguities.

EDIT | Here's an example of a regex in PHP that is a quick and dirty solution, being greedy only where needed and trying to deal with the English language:

<?php

$str = "Checkout http://www.foo.com/test?items=bat,ball, for info about bats and balls";

preg_match('/https?:\/\/([a-zA-Z0-9][a-zA-Z0-9-]*)(\.[a-zA-Z0-9-]+)*((\/[^\s]*)(?=[\s\.,;!\?]))\b/i', $str, $matches);

var_dump($matches);

It outputs:

array(5) {
  [0]=>
  string(38) "http://www.foo.com/test?items=bat,ball"
  [1]=>
  string(3) "www"
  [2]=>
  string(4) ".com"
  [3]=>
  string(20) "/test?items=bat,ball"
  [4]=>
  string(20) "/test?items=bat,ball"
}

The explanation is in the comments.

Perl, ruby, php and javascript should all work with these:

/(http:\/\/(?:(?:(?!\http:\/\/).))+\.jpg)/

The URLs will be stored in the matched groups. Tested this out against "http://a.com/b.jpg-folder/c.jpg http://mydomain.com/some.jpg-folder/foo.jpg" and it worked correctly without being too greedy.

继续阅读：objective-c regex

Find a url for a file in html using a regular expression

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？