开发者

Python regular expression slicing

I am trying to get a web page using the following sample code:

from urllib import urlopen
print urlopen("http://www.php.net/manual/en/function.gettext.php").read()

Now I can get the whole web page in a variable. I wanna get a part of the page containing something like this

<div class="methodsynopsis dc-description">
   <span class="type">string</span>&l开发者_开发问答t;span class="methodname"><b>gettext</b></span> ( <span class="methodparam"><span class="type">string</span> <tt class="parameter">$message</tt></span>
   )</div>

So that i can generate a file to implement in another application. I wanna be able to extract the words "string", "gettext" and "$message".


Why don't you try using BeautifulSoup

  • http://www.crummy.com/software/BeautifulSoup/

Example code :

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(htmldoc)
allSpans = soup.findAll('span', class="type")
for element in allSpans:
    ....


When extracting information from HTML, it isn't recommended to just hack some regexes together. The right way to do it is to use a proper HTML parsing module. Python has several good modules for this purpose - in particular I recommend BeautifulSoup.

Don't be put off by the name - it's a serious module used by a lot of people with great success. The documentation page has a lot of examples that should help you get started with your particular needs.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜