开发者

Regex to take urls

I have several web pages to parse, and there are links like

<a href="/news/monde/0,,3204267-VU5WX0lEIDUy,00.html" class="S48">Jean-Paul II opéré "avec succès" (24/02/2005)</a>

<a href="javascript:VerifCookie('4','/news/economie/0,,3204461-VU5WX0lEIDUy,00.html',700,600,52);" class="S48">Que peut-il se passer si le pape est incapable d'assurer sa tâche ? (24/02/2005)</a>

As you can see the second one has a leading JavaScript stuff and I want to get rid开发者_如何学Go of it and also be compatible with the first type. So I wrote a regex in perl:

/<a href="[^\/]*?([^<']+?)[^"]*?" class="S48">([^<>]+?)<\/a>/

to catch the URL part without the javascript stuff and also the title part. But this regex takes only the title part for me, the url taken are just "/" or "j".

Any suggestion?


This regex :

!<a\s*href\s*=\s*".*?(/.*\.html).*"\s+class="S48">([^<>]+?)</a>!i

applied to your input produces these results for group 1 :

/news/monde/0,,3204267-VU5WX0lEIDUy,00.html
/news/economie/0,,3204461-VU5WX0lEIDUy,00.html

and these for group 2 :

Jean-Paul II opéré "avec succès" (24/02/2005)
Que peut-il se passer si le pape est incapable d'assurer sa tâche ? (24/02/2005)

Of course this works with your specific input. I would strongly suggest to avoid using regex to for .xml, .html, .xsl etc. There are far more better tools for this job.

Also much shorter version :

/.*?(\/.*\.html).*?>([^<]+)/i

Will produce same results.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜