Getting BeautifulSoup to find a specific <p>
I'm trying to put together a basic HTML scraper for a variety of scientific journal websites, specifically trying to get the abstract or introductory paragraph.
The current journal I'm working on is Nature, and the article I've been using as my sample can be seen at http://www.nature.com/nature/journal/v463/n7284/abs/nature08715.html.
I can't get the abstract out of that page, however. I'm searching for everything between the <p class="lead">...</p>
tags, but I can't seem to figure out how t开发者_开发技巧o isolate them. I thought it would be something simple like
from BeautifulSoup import BeautifulSoup
import re
import urllib2
address="http://www.nature.com/nature/journal/v463/n7284/full/nature08715.html"
html = urllib2.urlopen(address).read()
soup = BeautifulSoup(html)
abstract = soup.find('p', attrs={'class' : 'lead'})
print abstract
Using Python 2.5, BeautifulSoup 3.0.8, running this returns 'None'. I have no option of using anything else that needs to be compiled/installed (like lxml). Is BeautifulSoup confused, or am I?
That html is pretty much malformed, and xml.dom.minidom cannot parse, and BeautiFulSoup parsing not working well.
I removed some <!-- ... -->
parts and parse again with BeautiFulSoup, then its seems better and able to run soup.find('p', attrs={'class' : 'lead'})
Here's the code I tried
>>> html =re.sub(re.compile("<!--.*?-->",re.DOTALL),"",html)
>>>
>>> soup=BeautifulSoup(html)
>>>
>>> soup.find('p', attrs={'class' : 'lead'})
<p class="lead">The class of exotic Jupiter-mass planets that orb .....
here's a non BS way to get the abstract.
address="http://www.nature.com/nature/journal/v463/n7284/full/nature08715.html"
html = urllib2.urlopen(address).read()
for para in html.split("</p>"):
if '<p class="lead">' in para:
abstract=para.split('<p class="lead">')[1:][0]
print ' '.join(abstract.split("\n"))
to_p_tag = soup.findAll('p', class_='lead')
if(len(to_p_tag) == 0):
print("<p class='lead' /> not found")
else:
for p in to_p_tag:
recursively_translate(translator, p, input_lang)
# translated_p = translator.translate(to_p_tag.text, dest=input_lang)
# lxml1 = lxml1.replace(to_p_tag.text,translated_p.text)
精彩评论