I cannot scrape anything with BeautifulSoup
Im using BeautifulSoup to scrape some web contents.
Im lear开发者_StackOverflow中文版ning with this example code,but I always get a "None" response.
Code:
import urllib2
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen('http://www.velocidadcuchara.com/2011/08/helado-platano-light.html').read())
post = soup.find('div', attrs={'id': 'topmenucontainer'})
print post
Any idea what Im doing wrong ?
Thanks!!
I don't think you are doing anything wrong.
It is the second script tag that is confusing BeautifulSoup. The tag looks like this:
<script type='text/javascript'>
<!--//--><![CDATA[//><!--
var arVersion = navigator.appVersion.split("MSIE")
var version = parseFloat(arVersion[1])
function fixPNG(myImage)
{
if ((version >= 5.5) && (version < 7) && (document.body.filters))
{
var imgID = (myImage.id) ? "id='" + myImage.id + "' " : ""
var imgClass = (myImage.className) ? "class='" + myImage.className + "' " : ""
var imgTitle = (myImage.title) ?
"title='" + myImage.title + "' " : "title='" + myImage.alt + "' "
var imgStyle = "display:inline-block;" + myImage.style.cssText
var strNewHTML = "<span " + imgID + imgClass + imgTitle
+ " style=\"" + "width:" + myImage.width
+ "px; height:" + myImage.height
+ "px;" + imgStyle + ";"
+ "filter:progid:DXImageTransform.Microsoft.AlphaImageLoader"
+ "(src=\'" + myImage.src + "\', sizingMethod='scale');\"></span>"
myImage.outerHTML = strNewHTML
}
}
//--><!]]>
</script>
but BeatifulSoup seems to think it is still in a comment or something and includes the rest of the file as content of the script tag.
Try:
print str(soup.findAll('script')[1])[:2000]
and you'll see what I mean.
If you remove the CDATA then you should find the page parses correctly:
soup = BeautifulSoup(
urllib2.urlopen('http://www.velocidadcuchara.com/2011/08/helado-platano-light.html')
.read()
.replace('<![CDATA[', '').replace('<!]]>', ''))
Something weird with your HTML. BeautifulSoup tries its best, but sometimes it just can't parse it.
Try moving the first <link>
element inside the <head>
, that might help.
You could try to use lxml lib.
lxml article
from lxml.html import parse
doc = parse('http://java.sun.com').getroot()
post = doc.cssselect('div#topmenucontainer')
精彩评论