Extract Google Search Results

2023-01-29 18:52 问答作者：

I would like to periodically check what sub-domains are being listed by Google.

To obtain list of sub-domains, I type 'site:example.com' in Google search box - this lists all the sub-domain results (over 20 pages for our domain).

What is the best way to extract only the URL of the addresses returned by the 'site:example.com' search?

I was thinki开发者_如何学运维ng of writing a little python script that will do the above search and regex the URLs from the search results (repeat on all result pages). Is this a good start? Could there be a better methodology?

Cheers.

Regex is a bad idea for parsing HTML. It's cryptic to read and relies of well-formed HTML.

Try BeautifulSoup for Python. Here's an example script that returns URLs from the first 10 pages of a site:domain.com Google query.

import sys # Used to add the BeautifulSoup folder the import path
import urllib2 # Used to read the html document

if __name__ == "__main__":
    ### Import Beautiful Soup
    ### Here, I have the BeautifulSoup folder in the level of this Python script
    ### So I need to tell Python where to look.
    sys.path.append("./BeautifulSoup")
    from BeautifulSoup import BeautifulSoup

    ### Create opener with Google-friendly user agent
    opener = urllib2.build_opener()
    opener.addheaders = [('User-agent', 'Mozilla/5.0')]

    ### Open page & generate soup
    ### the "start" variable will be used to iterate through 10 pages.
    for start in range(0,10):
        url = "http://www.google.com/search?q=site:stackoverflow.com&start=" + str(start*10)
        page = opener.open(url)
        soup = BeautifulSoup(page)

        ### Parse and find
        ### Looks like google contains URLs in <cite> tags.
        ### So for each cite tag on each page (10), print its contents (url)
        for cite in soup.findAll('cite'):
            print cite.text

Output:

stackoverflow.com/
stackoverflow.com/questions
stackoverflow.com/unanswered
stackoverflow.com/users
meta.stackoverflow.com/
blog.stackoverflow.com/
chat.meta.stackoverflow.com/
...

Of course, you could append each result to a list so you can parse it for subdomains. I just got into Python and scraping a few days ago, but this should get you started.

The Google Custom Search API can deliver results in ATOM XML format

Getting Started with Google Custom Search

Another way of doing it using requests, bs4:

import requests, lxml
from bs4 import BeautifulSoup

headers = {
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
}

params = {'q': 'site:minecraft.fandom.com'}

html = requests.get(f'https://www.google.com/search?q=',
                    headers=headers,
                    params=params).text
soup = BeautifulSoup(html, 'lxml')

for container in soup.findAll('div', class_='tF2Cxc'):
   link = container.find('a')['href']
   print(link)

Output:

https://minecraft.fandom.com/wiki/Podzol
https://minecraft.fandom.com/wiki/Pumpkin
https://minecraft.fandom.com/wiki/Swimming
https://minecraft.fandom.com/wiki/Polished_Blackstone
https://minecraft.fandom.com/wiki/Nether_Quartz_Ore
https://minecraft.fandom.com/wiki/Blacksmith
https://minecraft.fandom.com/wiki/Grindstone
https://minecraft.fandom.com/wiki/Spider
https://minecraft.fandom.com/wiki/Crash
https://minecraft.fandom.com/wiki/Tuff

To get these results from each page using pagination:

from bs4 import BeautifulSoup
import requests, urllib.parse
import lxml

def print_extracted_data_from_url(url):

    headers = {
        "User-Agent":
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
    }
    response = requests.get(url, headers=headers).text

    soup = BeautifulSoup(response, 'lxml')

    print(f'Current page: {int(soup.select_one(".YyVfkd").text)}')
    print(f'Current URL: {url}')
    print()

    for container in soup.findAll('div', class_='tF2Cxc'):
        head_link = container.a['href']
        print(head_link)

    return soup.select_one('a#pnnext')


def scrape():
    next_page_node = print_extracted_data_from_url(
        'https://www.google.com/search?hl=en-US&q=site:minecraft.fandom.com')

    while next_page_node is not None:
        next_page_url = urllib.parse.urljoin('https://www.google.com', next_page_node['href'])

        next_page_node = print_extracted_data_from_url(next_page_url)

scrape()

Part of the output:

Results via beautifulsoup

Current page: 1
Current URL: https://www.google.com/search?hl=en-US&q=site:minecraft.fandom.com

https://minecraft.fandom.com/wiki/Podzol
https://minecraft.fandom.com/wiki/Pumpkin
https://minecraft.fandom.com/wiki/Swimming
https://minecraft.fandom.com/wiki/Polished_Blackstone
https://minecraft.fandom.com/wiki/Nether_Quartz_Ore
https://minecraft.fandom.com/wiki/Blacksmith
https://minecraft.fandom.com/wiki/Grindstone
https://minecraft.fandom.com/wiki/Spider
https://minecraft.fandom.com/wiki/Crash
https://minecraft.fandom.com/wiki/Tuff

Alternatively, you can use Google Search Engine Results API from SerpApi. It's a paid API with a free trial of 5,000 searches.

Code to integrate:

from serpapi import GoogleSearch
import os

params = {
  "engine": "google",
  "q": "site:minecraft.fandom.com",
  "api_key": os.getenv('API_KEY')
}

search = GoogleSearch(params)
results = search.get_dict()

for result in results['organic_results']:
  link = result['link']
  print(link)

Output:

https://minecraft.fandom.com/wiki/Podzol
https://minecraft.fandom.com/wiki/Pumpkin
https://minecraft.fandom.com/wiki/Swimming
https://minecraft.fandom.com/wiki/Polished_Blackstone
https://minecraft.fandom.com/wiki/Nether_Quartz_Ore
https://minecraft.fandom.com/wiki/Blacksmith
https://minecraft.fandom.com/wiki/Grindstone
https://minecraft.fandom.com/wiki/Spider
https://minecraft.fandom.com/wiki/Crash
https://minecraft.fandom.com/wiki/Tuff

Using pagination:

import os
from serpapi import GoogleSearch

def scrape():
  
  params = {
    "engine": "google",
    "q": "site:minecraft.fandom.com",
    "api_key": os.getenv("API_KEY"),
  }

  search = GoogleSearch(params)
  results = search.get_dict()

  print(f"Current page: {results['serpapi_pagination']['current']}")

  for result in results["organic_results"]:
      print(f"Title: {result['title']}\nLink: {result['link']}\n")

  while 'next' in results['serpapi_pagination']:
      search.params_dict["start"] = results['serpapi_pagination']['current'] * 10
      results = search.get_dict()

      print(f"Current page: {results['serpapi_pagination']['current']}")

      for result in results["organic_results"]:
          print(f"Title: {result['title']}\nLink: {result['link']}\n")
scrape()

Disclaimer, I work for SerpApi.

继续阅读：extract python regex subdomain

Extract Google Search Results

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？