开发者

simple python question about urlopen

I am trying to make a program开发者_运维技巧 that deletes all the tags in html document. So I made a program like this.

import urllib
loc_left = 0
while loc_left != -1 :
    html_code = urllib.urlopen("http://www.python.org/").read()

    loc_left = html_code.find('<')
    loc_right = html_code.find('>')

    str_in_braket = html_code[loc_left, loc_right + 1]

    html_code.replace(str_in_braket, "")

but It showes the error message like below

lee@Lee-Computer:~/pyt$ python html_braket.py
Traceback (most recent call last):
  File "html_braket.py", line 1, in <module>
    import urllib
  File "/usr/lib/python2.6/urllib.py", line 25, in <module>
    import string
  File "/home/lee/pyt/string.py", line 4, in <module>
    html_code = urllib.urlopen("http://www.python.org/").read()
AttributeError: 'module' object has no attribute 'urlopen'

And one thing that is interesting is, what if I typed the code into python, the error above wouldn't show up.


You've named a script string.py. The urllib module imports this, thinking that it's the same string module that's in the stdlib, and then your code uses an attribute on the now partially-defined urllib module that doesn't yet exist. Name your script something else.


Step one is to download the document so you can have it contained in a string:

import urllib
html_code = urllib.urlopen("http://www.python.org/").read() # <-- Note: this does not give me any sort of error

Then you have two pretty nice options which will be robust since they actually parse the HTML document, rather than simply looking for '<' and '>' characters:

Option 1: Use Beautiful Soup

from BeautifulSoup import BeautifulSoup

''.join(BeautifulSoup(page).findAll(text=True))

Option 2: Use the built-in Python HTMLParser class

from HTMLParser import HTMLParser

class TagStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def get_data(self):
        return ''.join(self.fed)

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

Example using option 2:

In [22]: strip_tags('<html>hi</html>')
Out[22]: 'hi'

If you already have BeautifulSoup available, then that's pretty simple. Pasting in the TagStripper class and strip_tags function is also pretty straightforward.

Good luck!

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜