开发者

Extract data from HTML in PHP or Python

I need to extract this data and display a simple graph out of it.

Something like Equity Share Capital -> array (30.36, 17, 17 .... etc) would help.

<html:tr>
<html:td>Equity Share Capital</html:td>
<html:td class="numericalColumn">30.36</html:td>
<html:td class="numericalColumn">17.17</html:td>
<html:td class="numericalColumn">15.22</html:td>
<html:td class="numericalColumn">9.82</html:td>
<html:td class="numericalColumn">9.82</ht开发者_JAVA技巧ml:td>
</html:tr>

How do I go about this task in PHP or Python?


A good place to start looking would be the python module BeautifulSoup which extracts the text and places it into a table.

Assuming you've loaded the data into a variable called raw:

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(raw)

for x in soup.findAll("html:td"):
   if x.string == "Equity share capital":
       VALS = [y.string for y in x.parent.findAll() if y.has_key("class")]

print VALS

This gives:

[u'30.36', u'17.17', u'15.22', u'9.82', u'9.82']

Which you'll note is a list of unicode strings, make sure to convert them to whatever type you desire before processing.

There are many ways to do this via BeautifulSoup. The nice thing I've found however is the quick hack is often good enough (TM) to get the job done!


BeautifulSoup


Don't forget lxml in Python. It also works well to extract data. It's harder to install but faster. http://pypi.python.org/pypi/lxml/2.2.8

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜