Python 3 web scraping options
I'm new to Python so I'm sorry if this is a newbie question.
I'm trying to build a program involving webscraping and I've noticed that Python 3 seems to have significantly fewer web-scraping modules than the Python 2.x开发者_StackOverflow中文版 series.
Beautiful Soup, mechanize, and scrapy -- the three modules recommended to me -- all seem to be incompatible.
I'm wondering if anyone on this forum has a good option for webscraping using python 3.
Any suggestions would be greatly appreciated.
Thanks, Will
lxml.html
works on Python 3, and gets you html parsing, at least.
BeautifulSoup 4, which is in the works, should support Python 3 (I've done some work on this).
I'm kind of new to, but I found BeautifulSoup 4
to be really good and I'm learning and using this one with requests
and lxml
modules. requests module is for getting url and lxml (also you can use built in html.parser
for parsing, but lxml
is faster I guess) is for parsing.
Simple usage is:
import requests
from bs4 import BeautifulSoup
url = 'someUrl'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
Not simple example how to get the href's from html:
links = set()
for link in soup.find_all('a'):
if 'href' in link.attrs:
links.add(link)
Then you will get the set
with unique links from your url.
Other example how you can parse the specific parts of html, e.g. if you wish to pars all <p>
tags that has class of testClass
:
list_of_p = []
for p in soup.find_all('p', {'class': 'testClass'}):
for item in p:
list_of_p.append(item)
and many more you can do with it as easy as it seems.
精彩评论