开发者

Printing certain HTML Python Mechanize

Im making a small python script for auto logon to a website. But i'm stuck.

I'm looking to print into terminal a small part of the html, located within th开发者_运维知识库is tag in the html file on the site:

<td class=h3 align='right'>&nbsp;&nbsp;John Appleseed</td><td>&nbsp;<a href="members_myaccount.php"><img border=0 src="../tbs_v7_0/images/myaccount.gif" alt="My Account"></a></td>

But how do I extract and print just the name, John Appleseed?

I'm using Pythons' Mechanize on a mac, by the way.


Mechanize is only good for fetching the html. Once you want to extract information from the html, you could use for example BeautifulSoup. (See also my answer to a similar question: Web mining or scraping or crawling? What tool/library should I use?)

Depending on where the <td> is located in the html (it's unclear from your question), you could use the following code:

html = ... # this is the html you've fetched

from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup(html)
# use this (gets all <td> elements)
cols = soup.findAll('td')
# or this (gets only <td> elements with class='h3')
cols = soup.findAll('td', attrs={"class" : 'h3'})
print cols[0].renderContents() # print content of first <td> element


As you have not provided the full HTML of the page, the only option right now is either using string.find() or regular expressions.

But, the standard way of finding this is using xpath. See this question: How to use Xpath in Python?

You can obtain the xpath for an element using "inspect element" feature of firefox.

For ex, if you want to find the XPATH for username in stackoverflow site.

  • Open firefox and login to the website & RIght-click on username(shadyabhi in my case) and select Inspect Element.
  • Keep your mouse over tag or right click it and "Copy xpath".

Printing certain HTML Python Mechanize


You can use a parser to extract any information in a document. I suggest you to use lxml module.

Here you have an example:

from lxml import etree
from StringIO import StringIO

parser = etree.HTMLParser()

tree = etree.parse(StringIO("""<td class=h3 align='right'>&nbsp;&nbsp;John Appleseed</td><td>&nbsp;<a href="members_myaccount.php"><img border=0 src="../tbs_v7_0/images/myaccount.gif" alt="My Account"></a></td>"""),parser)


>>> tree.xpath("string()").strip()
u'John Appleseed'

More information about lxml here

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜