Python regular expression slicing
I am trying to get a web page using the following sample code:
from urllib import urlopen
print urlopen("http://www.php.net/manual/en/function.gettext.php").read()
Now I can get the whole web page in a variable. I wanna get a part of the page containing something like this
<div class="methodsynopsis dc-description">
<span class="type">string</span>&l开发者_开发问答t;span class="methodname"><b>gettext</b></span> ( <span class="methodparam"><span class="type">string</span> <tt class="parameter">$message</tt></span>
)</div>
So that i can generate a file to implement in another application. I wanna be able to extract the words "string", "gettext" and "$message".
Why don't you try using BeautifulSoup
- http://www.crummy.com/software/BeautifulSoup/
Example code :
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(htmldoc)
allSpans = soup.findAll('span', class="type")
for element in allSpans:
....
When extracting information from HTML, it isn't recommended to just hack some regexes together. The right way to do it is to use a proper HTML parsing module. Python has several good modules for this purpose - in particular I recommend BeautifulSoup.
Don't be put off by the name - it's a serious module used by a lot of people with great success. The documentation page has a lot of examples that should help you get started with your particular needs.
精彩评论