Python: Fetching and parsing text from html files
I'm trying to work on a project about page ranking.
I want to make an index (dictionary) which looks like this:
file1.html -> [[cat, ate, food, drank, milk], [file2.html, file3.html]] file2.html -> [[dog, barked, ran, away], [file1.html, file4.html]]Fetching links is easy - look for anchor tags.
My question is - how do I fetch text? The text in the h开发者_运维知识库tml files is not enclosed within any tags like <p>
Thanks in advance for all the help
Use an HTML parser - something like BeautifulSoup.
If the text isn't enclosed in tags is it really HTML?
As Amber says, you'll have an easier job of this using some HTML parser like BeautifulSoup.
The example below demonstrates a simple method for returning text within tags.
This method works for any tag AFAIK.
>>> from BeautifulSoup import BeautifulSoup as bs
>>> html = '''
... <div><a href="/link1">link1 contents</a></div>
... <div><a href="/link2">link2 contents</a></div>
... '''
>>> soup = bs(html)
>>> for anchor_tag in soup.findAll('a'):
... print anchor_tag.contents[0]
...
link1 contents
link2 contents
Apart from that I can imagine that you'd want a dictionary with a count of how many times a certain term appeared in some HTML document. defaultdict
is good for that kind of thing:
>>> from collections import defaultdict
>>> d = defaultdict(int)
>>> for anchor_tag in soup.findAll('a'):
... d[anchor_tag.contents[0]] += 1
...
>>> d
defaultdict(<type 'int'>, {u'link1 contents': 1, u'link2 contents': 1})
Hopefully that gives you some ideas to run with. Come back and open another question if you run into other issues.
精彩评论