Regex to take urls
I have several web pages to parse, and there are links like
<a href="/news/monde/0,,3204267-VU5WX0lEIDUy,00.html" class="S48">Jean-Paul II opéré "avec succès" (24/02/2005)</a>
<a href="javascript:VerifCookie('4','/news/economie/0,,3204461-VU5WX0lEIDUy,00.html',700,600,52);" class="S48">Que peut-il se passer si le pape est incapable d'assurer sa tâche ? (24/02/2005)</a>
As you can see the second one has a leading JavaScript stuff and I want to get rid开发者_如何学Go of it and also be compatible with the first type. So I wrote a regex in perl:
/<a href="[^\/]*?([^<']+?)[^"]*?" class="S48">([^<>]+?)<\/a>/
to catch the URL part without the javascript stuff and also the title part. But this regex takes only the title part for me, the url taken are just "/" or "j".
Any suggestion?
This regex :
!<a\s*href\s*=\s*".*?(/.*\.html).*"\s+class="S48">([^<>]+?)</a>!i
applied to your input produces these results for group 1 :
/news/monde/0,,3204267-VU5WX0lEIDUy,00.html
/news/economie/0,,3204461-VU5WX0lEIDUy,00.html
and these for group 2 :
Jean-Paul II opéré "avec succès" (24/02/2005)
Que peut-il se passer si le pape est incapable d'assurer sa tâche ? (24/02/2005)
Of course this works with your specific input. I would strongly suggest to avoid using regex to for .xml, .html, .xsl etc. There are far more better tools for this job.
Also much shorter version :
/.*?(\/.*\.html).*?>([^<]+)/i
Will produce same results.
精彩评论