Extract data from HTML in PHP or Python
I need to extract this data and display a simple graph out of it.
Something like Equity Share Capital -> array (30.36, 17, 17 .... etc)
would help.
<html:tr>
<html:td>Equity Share Capital</html:td>
<html:td class="numericalColumn">30.36</html:td>
<html:td class="numericalColumn">17.17</html:td>
<html:td class="numericalColumn">15.22</html:td>
<html:td class="numericalColumn">9.82</html:td>
<html:td class="numericalColumn">9.82</ht开发者_JAVA技巧ml:td>
</html:tr>
How do I go about this task in PHP or Python?
A good place to start looking would be the python module BeautifulSoup which extracts the text and places it into a table.
Assuming you've loaded the data into a variable called raw
:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(raw)
for x in soup.findAll("html:td"):
if x.string == "Equity share capital":
VALS = [y.string for y in x.parent.findAll() if y.has_key("class")]
print VALS
This gives:
[u'30.36', u'17.17', u'15.22', u'9.82', u'9.82']
Which you'll note is a list of unicode strings, make sure to convert them to whatever type you desire before processing.
There are many ways to do this via BeautifulSoup. The nice thing I've found however is the quick hack is often good enough (TM) to get the job done!
BeautifulSoup
Don't forget lxml in Python. It also works well to extract data. It's harder to install but faster. http://pypi.python.org/pypi/lxml/2.2.8
精彩评论