I cannot scrape anything with BeautifulSoup

2023-03-29 03:31 问答作者：

Im using BeautifulSoup to scrape some web contents.

Im lear开发者_StackOverflow中文版ning with this example code,but I always get a "None" response.

Code:

import urllib2
from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup(urllib2.urlopen('http://www.velocidadcuchara.com/2011/08/helado-platano-light.html').read())

post = soup.find('div', attrs={'id': 'topmenucontainer'})

print post

Any idea what Im doing wrong ?

Thanks!!

I don't think you are doing anything wrong.

It is the second script tag that is confusing BeautifulSoup. The tag looks like this:

<script type='text/javascript'>
<!--//--><![CDATA[//><!--
var arVersion = navigator.appVersion.split("MSIE")
var version = parseFloat(arVersion[1])

function fixPNG(myImage) 
{
    if ((version >= 5.5) && (version < 7) && (document.body.filters)) 
    {
       var imgID = (myImage.id) ? "id='" + myImage.id + "' " : ""
       var imgClass = (myImage.className) ? "class='" + myImage.className + "' " : ""
       var imgTitle = (myImage.title) ? 
                     "title='" + myImage.title  + "' " : "title='" + myImage.alt + "' "
       var imgStyle = "display:inline-block;" + myImage.style.cssText
       var strNewHTML = "<span " + imgID + imgClass + imgTitle
                  + " style=\"" + "width:" + myImage.width 
                  + "px; height:" + myImage.height 
                  + "px;" + imgStyle + ";"
                  + "filter:progid:DXImageTransform.Microsoft.AlphaImageLoader"
                  + "(src=\'" + myImage.src + "\', sizingMethod='scale');\"></span>"
       myImage.outerHTML = strNewHTML     
    }
}
//--><!]]>
</script>

but BeatifulSoup seems to think it is still in a comment or something and includes the rest of the file as content of the script tag.

Try:

print str(soup.findAll('script')[1])[:2000]

and you'll see what I mean.

If you remove the CDATA then you should find the page parses correctly:

soup = BeautifulSoup(
    urllib2.urlopen('http://www.velocidadcuchara.com/2011/08/helado-platano-light.html')
    .read()
    .replace('<![CDATA[', '').replace('<!]]>', ''))

Something weird with your HTML. BeautifulSoup tries its best, but sometimes it just can't parse it.

Try moving the first <link> element inside the <head>, that might help.

You could try to use lxml lib.

lxml article

from lxml.html import parse
doc = parse('http://java.sun.com').getroot()
post = doc.cssselect('div#topmenucontainer')

继续阅读：python

I cannot scrape anything with BeautifulSoup

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？