开发者

Python: Fetching and parsing text from html files

I'm trying to work on a project about page ranking.

I want to make an index (dictionary) which looks like this:

file1.html -> [[cat, ate, food, drank, milk], [file2.html, file3.html]]

file2.html -> [[dog, barked, ran, away], [file1.html, file4.html]]

Fetching links is easy - look for anchor tags.

My question is - how do I fetch text? The text in the h开发者_运维知识库tml files is not enclosed within any tags like <p>

Thanks in advance for all the help


Use an HTML parser - something like BeautifulSoup.


If the text isn't enclosed in tags is it really HTML?
As Amber says, you'll have an easier job of this using some HTML parser like BeautifulSoup.

The example below demonstrates a simple method for returning text within tags.
This method works for any tag AFAIK.

>>> from BeautifulSoup import BeautifulSoup as bs
>>> html = '''
... <div><a href="/link1">link1 contents</a></div>
... <div><a href="/link2">link2 contents</a></div>
... '''
>>> soup = bs(html)
>>> for anchor_tag in soup.findAll('a'):
...   print anchor_tag.contents[0]
... 
link1 contents
link2 contents

Apart from that I can imagine that you'd want a dictionary with a count of how many times a certain term appeared in some HTML document. defaultdict is good for that kind of thing:

>>> from collections import defaultdict
>>> d = defaultdict(int)
>>> for anchor_tag in soup.findAll('a'):
...   d[anchor_tag.contents[0]] += 1
... 
>>> d
defaultdict(<type 'int'>, {u'link1 contents': 1, u'link2 contents': 1})

Hopefully that gives you some ideas to run with. Come back and open another question if you run into other issues.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜