Processing a HTML file using Python
I wanted to remove all the tags in HTML file. For that I used re module of python.
For example, consider the line <h1>Hello World!</h1>
.I want to retain only "Hello World!". In order to remove the tags, I used re.sub('<.*>','',string)
. For obvious reasons the result I get is an empty string (The regexp identifies the first and last angle brackets and removes everything in between). How could I get over this 开发者_如何学Cissue?
You can make the match non-greedy: '<.*?>'
You also need to be careful, HTML is a crafty beast, and can thwart your regexes.
Parse the HTML using BeautifulSoup, then only retrieve the text.
make it non-greedy: http://docs.python.org/release/2.6/howto/regex.html#greedy-versus-non-greedy
off-topic: the approach that uses regular expressions is error prone. it cannot handle cases when angle brackets do not represent tags. I recommend http://lxml.de/
Use a parser, either lxml or BeautifulSoup:
import lxml.html
print lxml.html.fromstring(mystring).text_content()
Related questions:
Using regular expressions to parse HTML: why not?
Why it's not possible to use regex to parse HTML/XML: a formal explanation in layman's terms
Beautiful Soup is great for parsing html!
You might not require it now, but it's worth learning to use it. Will help you in the future too.
精彩评论