开发者

Python 3 web scraping options

I'm new to Python so I'm sorry if this is a newbie question.

I'm trying to build a program involving webscraping and I've noticed that Python 3 seems to have significantly fewer web-scraping modules than the Python 2.x开发者_StackOverflow中文版 series.

Beautiful Soup, mechanize, and scrapy -- the three modules recommended to me -- all seem to be incompatible.

I'm wondering if anyone on this forum has a good option for webscraping using python 3.

Any suggestions would be greatly appreciated.

Thanks, Will


lxml.html works on Python 3, and gets you html parsing, at least.

BeautifulSoup 4, which is in the works, should support Python 3 (I've done some work on this).


I'm kind of new to, but I found BeautifulSoup 4 to be really good and I'm learning and using this one with requests and lxml modules. requests module is for getting url and lxml (also you can use built in html.parser for parsing, but lxml is faster I guess) is for parsing.

Simple usage is:

import requests
from bs4 import BeautifulSoup

url = 'someUrl'

response = requests.get(url)

soup = BeautifulSoup(response.text, 'lxml')

Not simple example how to get the href's from html:

links = set()
for link in soup.find_all('a'):
    if 'href' in link.attrs:
        links.add(link)

Then you will get the set with unique links from your url.

Other example how you can parse the specific parts of html, e.g. if you wish to pars all <p> tags that has class of testClass:

list_of_p = []
for p in soup.find_all('p', {'class': 'testClass'}):
    for item in p:
        list_of_p.append(item)

and many more you can do with it as easy as it seems.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜