Extract Google Search Results
I would like to periodically check what sub-domains are being listed by Google.
To obtain list of sub-domains, I type 'site:example.com' in Google search box - this lists all the sub-domain results (over 20 pages for our domain).
What is the best way to extract only the URL of the addresses returned by the 'site:example.com' search?
I was thinki开发者_如何学运维ng of writing a little python script that will do the above search and regex the URLs from the search results (repeat on all result pages). Is this a good start? Could there be a better methodology?
Cheers.
Regex is a bad idea for parsing HTML. It's cryptic to read and relies of well-formed HTML.
Try BeautifulSoup for Python. Here's an example script that returns URLs from the first 10 pages of a site:domain.com Google query.
import sys # Used to add the BeautifulSoup folder the import path
import urllib2 # Used to read the html document
if __name__ == "__main__":
### Import Beautiful Soup
### Here, I have the BeautifulSoup folder in the level of this Python script
### So I need to tell Python where to look.
sys.path.append("./BeautifulSoup")
from BeautifulSoup import BeautifulSoup
### Create opener with Google-friendly user agent
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
### Open page & generate soup
### the "start" variable will be used to iterate through 10 pages.
for start in range(0,10):
url = "http://www.google.com/search?q=site:stackoverflow.com&start=" + str(start*10)
page = opener.open(url)
soup = BeautifulSoup(page)
### Parse and find
### Looks like google contains URLs in <cite> tags.
### So for each cite tag on each page (10), print its contents (url)
for cite in soup.findAll('cite'):
print cite.text
Output:
stackoverflow.com/
stackoverflow.com/questions
stackoverflow.com/unanswered
stackoverflow.com/users
meta.stackoverflow.com/
blog.stackoverflow.com/
chat.meta.stackoverflow.com/
...
Of course, you could append each result to a list so you can parse it for subdomains. I just got into Python and scraping a few days ago, but this should get you started.
The Google Custom Search API can deliver results in ATOM XML format
Getting Started with Google Custom Search
Another way of doing it using requests
, bs4
:
import requests, lxml
from bs4 import BeautifulSoup
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {'q': 'site:minecraft.fandom.com'}
html = requests.get(f'https://www.google.com/search?q=',
headers=headers,
params=params).text
soup = BeautifulSoup(html, 'lxml')
for container in soup.findAll('div', class_='tF2Cxc'):
link = container.find('a')['href']
print(link)
Output:
https://minecraft.fandom.com/wiki/Podzol
https://minecraft.fandom.com/wiki/Pumpkin
https://minecraft.fandom.com/wiki/Swimming
https://minecraft.fandom.com/wiki/Polished_Blackstone
https://minecraft.fandom.com/wiki/Nether_Quartz_Ore
https://minecraft.fandom.com/wiki/Blacksmith
https://minecraft.fandom.com/wiki/Grindstone
https://minecraft.fandom.com/wiki/Spider
https://minecraft.fandom.com/wiki/Crash
https://minecraft.fandom.com/wiki/Tuff
To get these results from each page using pagination:
from bs4 import BeautifulSoup
import requests, urllib.parse
import lxml
def print_extracted_data_from_url(url):
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
response = requests.get(url, headers=headers).text
soup = BeautifulSoup(response, 'lxml')
print(f'Current page: {int(soup.select_one(".YyVfkd").text)}')
print(f'Current URL: {url}')
print()
for container in soup.findAll('div', class_='tF2Cxc'):
head_link = container.a['href']
print(head_link)
return soup.select_one('a#pnnext')
def scrape():
next_page_node = print_extracted_data_from_url(
'https://www.google.com/search?hl=en-US&q=site:minecraft.fandom.com')
while next_page_node is not None:
next_page_url = urllib.parse.urljoin('https://www.google.com', next_page_node['href'])
next_page_node = print_extracted_data_from_url(next_page_url)
scrape()
Part of the output:
Results via beautifulsoup
Current page: 1
Current URL: https://www.google.com/search?hl=en-US&q=site:minecraft.fandom.com
https://minecraft.fandom.com/wiki/Podzol
https://minecraft.fandom.com/wiki/Pumpkin
https://minecraft.fandom.com/wiki/Swimming
https://minecraft.fandom.com/wiki/Polished_Blackstone
https://minecraft.fandom.com/wiki/Nether_Quartz_Ore
https://minecraft.fandom.com/wiki/Blacksmith
https://minecraft.fandom.com/wiki/Grindstone
https://minecraft.fandom.com/wiki/Spider
https://minecraft.fandom.com/wiki/Crash
https://minecraft.fandom.com/wiki/Tuff
Alternatively, you can use Google Search Engine Results API from SerpApi. It's a paid API with a free trial of 5,000 searches.
Code to integrate:
from serpapi import GoogleSearch
import os
params = {
"engine": "google",
"q": "site:minecraft.fandom.com",
"api_key": os.getenv('API_KEY')
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results['organic_results']:
link = result['link']
print(link)
Output:
https://minecraft.fandom.com/wiki/Podzol
https://minecraft.fandom.com/wiki/Pumpkin
https://minecraft.fandom.com/wiki/Swimming
https://minecraft.fandom.com/wiki/Polished_Blackstone
https://minecraft.fandom.com/wiki/Nether_Quartz_Ore
https://minecraft.fandom.com/wiki/Blacksmith
https://minecraft.fandom.com/wiki/Grindstone
https://minecraft.fandom.com/wiki/Spider
https://minecraft.fandom.com/wiki/Crash
https://minecraft.fandom.com/wiki/Tuff
Using pagination:
import os
from serpapi import GoogleSearch
def scrape():
params = {
"engine": "google",
"q": "site:minecraft.fandom.com",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
print(f"Current page: {results['serpapi_pagination']['current']}")
for result in results["organic_results"]:
print(f"Title: {result['title']}\nLink: {result['link']}\n")
while 'next' in results['serpapi_pagination']:
search.params_dict["start"] = results['serpapi_pagination']['current'] * 10
results = search.get_dict()
print(f"Current page: {results['serpapi_pagination']['current']}")
for result in results["organic_results"]:
print(f"Title: {result['title']}\nLink: {result['link']}\n")
scrape()
Disclaimer, I work for SerpApi.
精彩评论