Problem in getting all urls in a page using regex ?

2023-03-29 03:10 问答作者：

I have a webpage source code stored in $page and I need to extract all urls from it

the problem that some urls which are not in <a> tag, but in javascript codes.

for example, I have this source code that I want to extract all urls from

    Click <a style="vertical-align:middle;cursor:pointer;text-decoration:underline;color:red;" onClick="return downme('http://www.AAAAA.com/atnbc1i7b/part1.html')">

            Here</a> to go to download page

<a href="http://www.UUUU.com/register">Hi all</a>

and I use this regex code

$regexp = "<a\s[^>]*href=(\"??)([^\开发者_如何学C" >]*?)\\1[^>]*>(.*)<\/a>";
    if(preg_match_all("/$regexp/siU", $page, $matches, PREG_SET_ORDER))
               {
                 foreach($matches as $match)
                   {
                      print_r($match)
                   }

               }

the output will print to me just

http://www.UUUU.com/register

but the other link

http://www.AAAAA.com/atnbc1i7b/part1.htm

will not appear !!

Help please

thanks

In first example you have:

<a href="http://www.UUUU.com/register">

so this regexp working

but in second:

<a style="vertical-align:middle;cursor:pointer;text-decoration:underline;color:red;" onClick="return downme('http://www.AAAAA.com/atnbc1i7b/part1.html')">

so this not working because:

$regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>";

as you see you have href= in regexp

change href= to onClick= and try, this tip should be resolve this problem.

if you need href and onClick use (href|onClick)

Instead of matching on the <a href, try just matching on the URL:

$regexp = "(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’]))"

I haven't tested this out due to a current lack of facilities, but if you run your file through it, it should match anything that resembles a URL, whether it's in a href, an onclick, or just in the text.

EDIT: found a better regex at http://daringfireball.net/2010/07/improved_regex_for_matching_urls

URL: Find in full text (protocol optional) Matches URLs like www.domain.com and ftp.domain.com without the http: or ftp: protocol. The final character class makes sure that if an URL is part of some text, punctuation such as a comma or full stop after the URL is not interpreted as part of the URL.

$html = <<< EOF
Click <a style="vertical-align:middle;cursor:pointer;text-decoration:underline;color:red;" onClick="return downme('http://www.AAAAA.com/atnbc1i7b/part1.html')">
Here</a> to go to download page
<a href="http://www.UUUU.com/register">Hi all</a>
EOF;

preg_match_all('/\b(?:(?:https?|ftp|file):\/\/|www\.|ftp\.)[-A-Z0-9+&@#\/%=~_|$?!:,.]*[A-Z0-9+&@#\/%=~_|$]/i', $html, $urls, PREG_PATTERN_ORDER);
for ($i = 0; $i < count($urls[0]); $i++) {
   echo  $urls[0][$i];
}

/* echo's :  
http://www.AAAAA.com/atnbc1i7b/part1.html
http://www.UUUU.com/register
*/

继续阅读：curl php regex

Problem in getting all urls in a page using regex ?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？