What regex can I use to extract URLs from a Google search?

2022-12-17 17:27 问答作者：

I'm using Delphi with the JCLRegEx and want to capture all the result URL's from a google search. I looked at HackingSearch.com and they have an example RegEx that looks right, but I cannot get any results when I try it.

I'm using it similar to:

Var re:JVCLRegEx;
    I:Integer; 
Begin
  re := TJclRegEx.Create;

  With re do try
    Compile('class="?r"?>.+?href="(.+?)".*?>(.+?)<\/a>.+?class="?s"?>(.+?)<cite>.+?class="?gl"?><a href="(.+?)"><\/开发者_如何转开发div><[li|\/ol]',false,false);  
    If match(memo1.lines.text) then begin
      For I := 0 to captureCount -1 do
        memo2.lines.add(captures[1]);
    end;
  finally free;
  end;
  freeandnil(re);
end;

Regex is available at hackingsearch.com

I'm using the Delphi Jedi version, since everytime I install TPerlRegEx I get a conflict with the two...

Offtopic: You can try Google AJAX Search API: http://code.google.com/apis/ajaxsearch/documentation/

Below is a relevant section from Google search results for the term python tuple. (I modified it to fit the screen here by adding new lines here and there, but I tested your regex on the raw string obtained from Google's source as revealed by Firebug). Your regex gave no matches for this string.

<li class="g w0">
  <h3 class="r">
    <a onmousedown="return rwt(this,'','','res','2','AFQjCNG5WXSP8xy6BkJFyA2Emg8JrFW2_g','&amp;sig2=4MpG_Ib3MrwYmIG6DbZjSg','0CBUQFjAB')" 
      class="l" href="http://www.korokithakis.net/tutorials/python">Learn <em>Python</em> in 10 minutes | Stavros's Stuff</a>
  </h3>
  <span style="display: inline-block;">
    <button class="w10">
    </button>
    <button class="w20">
    </button>
  </span>
  <span class="m">&nbsp;<span dir="ltr">- 2 visits</span>&nbsp;<span dir="ltr">- Jan 21</span></span>
  <div class="s">
  The data structures available in <em>python</em> are lists, <em>tuples</em>
   and dictionaries. Sets are available in the sets library (but are built-in in <em>
  Python</em> 2.5 and <b>...</b><br>
  <cite>
    www.korokithakis.net/tutorials/<b>
    python</b>
     - 
  </cite>
  <span class="gl">
    <a onmousedown="return rwt(this,'','','clnk','2','AFQjCNFVaSJCprC5enuMZ9Nt7OZ8VzDkMg','&amp;sig2=4qxw5AldSTW70S01iulYeA')" 
      href="http://74.125.153.132/search?q=cache:oeYpHokMeBAJ:www.korokithakis.net/tutorials/python+python+tuple&amp;cd=2&amp;hl=en&amp;ct=clnk&amp;client=firefox-a">
      Cached
    </a>
     - <button title="Comment" class="wci">
    </button>
    <button class="w4" title="Promote">
    </button>
    <button class="w5" title="Remove">
    </button>
  </span>
  </div>
  <div class="wce">
  </div>
  <!--n-->
  <!--m-->
</li>

FWIW, I guess one of the many reasons is that there is no <Va> in this result at all. I copied the full html source from Firebug and tried to match it with your regex - didn't get any match at all.

Google might change the way they display the results from time to time - at a given time, it can vary depending on factors like your logged in status, web history etc. The particular regex you came up with might be working for you for now, but in the long run it will become difficult to maintain. People suggest using html parser instead of giving a regex because they know that the solution won't be stable.

If you need to debug regular expressions in any language you need to look at RegExBuddy, its not free but it will pay for itself in a day.

class=r?>.+?href="(.+?)".*?>(.+?)<\/a>.+?class="?s"?>(.+?)<cite>.+?class="?gl"?>

works for now.

继续阅读：delphi html-parsing jvcl regex

What regex can I use to extract URLs from a Google search?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？