开发者

How to remove new line and linefeed from title tags of a page? (Google App Engine - Python)

I have this code to extract title:

soup = BeautifulSoup.BeautifulSoup(urllib.urlopen(url))
title = str(soup.html.head.title.string).lstrip("\r\n").rstrip("\r\n")

Some sites add return or newline before and after title tags (why?) and to remove them I added

.lstrip("\r\n").rstrip("\r\n")

This works for instance with http://www.readwriteweb.com/ but not with http://poundwire.com/. Can you tell why one is working and the other is not?

Update

Following up on comment by Steve Jessop; I'm using replace and it seems to work:

title = str(soup.html.head.title.string).replace("\t", "").replace("\r", "").replace("\n", "")

Let me know if there is a better way. Thanks.

Update 2

I found this answer and it seems better:

开发者_运维百科
title = " ".join(str(soup.html.head.title.string).split())


Try using str(title).strip() which will trim all whitespace from the start and end of the string.


On poundwire, there's a tab character inside the <title> tag. There are also some spaces (the indenting that you'll probably see if you "view source") which you probably want removed too.

Like samplebias says, use strip() to remove whitespace at both ends of the string. And get a text editor with a "visible whitespace" mode, switch that mode on, and never turn it off again, ever :-)

Btw, if you're on Google App Engine that means you're on Python 2.5, which in turn means str is a non-Unicode string type. BeautifulSoup goes to some lengths to coerce its input into Unicode, so it seems a shame to throw an exception when you hit a page whose title contains non-ASCII characters.

[Edit: third case

$ python
Python 2.6.5 (r265:79063, Jun 12 2010, 17:07:01)
[GCC 4.3.4 20090804 (release) 1] on cygwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from BeautifulSoup import BeautifulSoup
>>> import urllib
>>> soup = BeautifulSoup(urllib.urlopen('http://code.google.com/p/google-refine/'))
>>> soup.html.head.title.string
u'\\n google-refine -\\n \\n \\n Google Refine, a power tool for working with messy data (formerly Freebase Gridworks) - Google Project Hosting\\n '
>>>

So, the space right at the end means that your rstrip doesn't remove the \n near the end.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜