Cannot prettify scraped html in BeautifulSoup

2022-12-15 14:17 问答作者：

I have a small script that use urllib2 to get contents of a site, find all the link tags, appends a small piece of HTML in on the top and bottom, and then I try to prettify it. It keeps returning TypeError: sequence item 1: expected string, Tag found. I have looked around,i can't really find the issue. As always, any help, much appreciated.

import urllib2
from BeautifulSoup import BeautifulSoup
import re

reddit = 'http://www.reddit.com'
pre = '<html><head><title>Page title</title></head>'
post = '</html>'
site = urllib2.urlopen(reddit)
html=site.read()
soup = BeautifulSoup(html)
tags = soup.findAll('a')
tags.insert(0,pre)
tags.append(post)
soup1 = BeautifulSoup(''.join(tags))
print soup1.prettify()

This is the trace back:

Traceback (most recent call last): File "C:\Python26\bea.py", line 21, in <module>
        soup1 = BeautifulSoup(''.join(t开发者_JS百科ags))
TypeError: sequence item 1: expected string, Tag found

This works for me:

soup1 = BeautifulSoup(''.join(str(t) for t in tags))

This pyparsing solution gives some decent output, too:

from pyparsing import makeHTMLTags, originalTextFor, SkipTo, Combine

# makeHTMLTags defines HTML tag patterns for given tag string
aTag,aEnd = makeHTMLTags("A")

# makeHTMLTags by default returns a structure containing
# the tag's attributes - we just want the original input text
aTag = originalTextFor(aTag)
aEnd = originalTextFor(aEnd)

# define an expression for a full link, and use a parse action to
# combine the returned tokens into a single string
aLink = aTag + SkipTo(aEnd) + aEnd
aLink.setParseAction(lambda tokens : ''.join(tokens))

# extract links from the input html
links = aLink.searchString(html)

# build list of strings for output
out = []
out.append(pre)
out.extend(['  '+lnk[0] for lnk in links])
out.append(post)

print '\n'.join(out)

prints:

<html><head><title>Page title</title></head>
  <a href="http://www.reddit.com/r/pics/" >pics</a>
  <a href="http://www.reddit.com/r/reddit.com/" >reddit.com</a>
  <a href="http://www.reddit.com/r/politics/" >politics</a>
  <a href="http://www.reddit.com/r/funny/" >funny</a>
  <a href="http://www.reddit.com/r/AskReddit/" >AskReddit</a>
  <a href="http://www.reddit.com/r/WTF/" >WTF</a>
  .
  .
  .
  <a href="http://reddit.com/help/privacypolicy" >Privacy Policy</a>
  <a href="#" onclick="return hidecover(this)">close this window</a>
  <a href="http://www.reddit.com/feedback" >volunteer to translate</a>
  <a href="#" onclick="return hidecover(this)">close this window</a>
</html>

soup1 = BeautifulSoup(''.join(unicode(tag) for tag in tags))

A bit of a syntax error on Jonathans answer, here's the correct one:

    soup1 = BeautifulSoup(''.join([unicode(tag) for tag in tags]))

继续阅读：parsing python

Cannot prettify scraped html in BeautifulSoup

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？