开发者

How to extract certain parts of a web page in Python

Target web page: http://www.immi.gov.au/skilled/general-skilled-migration/estimated-allocation-times.htm

The section I want to extract:

  <tr>
  <td>Skilled &ndash; Independent (Residence) subclass 885<br />online</td>
  <td>N/A</td>
  <td>N/A</td>
  <td>N/A</td>
  <td>15 May 2011</td>
  <td>N/A</td>
 开发者_开发百科 </tr>

Once the code finds this section by searching the keyword "subclass 885

online", it should then print the date which is within the 5th tag which is "15 May 2011" as shown above.

It's just a monitor for myself to keep an eye on the progress of my immigration application.


"Beau--ootiful Soo--oop!

Beau--ootiful Soo--oop!

Soo--oop of the e--e--evening,

Beautiful, beauti--FUL SOUP!"

--Lewis Carroll, Alice's Adventures in Wonderland

I think this is exactly what he had in mind!

The Mock Turtle would probably do something like this:

>>> from BeautifulSoup import BeautifulSoup
>>> import urllib2
>>> url = 'http://www.immi.gov.au/skilled/general-skilled-migration/estimated-allocation-times.htm'
>>> page = urllib2.urlopen(url)
>>> soup = BeautifulSoup(page)
>>> for row in soup.html.body.findAll('tr'):
...     data = row.findAll('td')
...     if data and 'subclass 885online' in data[0].text:
...         print data[4].text
... 
15 May 2011

But I'm not sure it would help, since that date has already passed!

Good luck with the application!


You might want to use this as a starting point:

Python 2.6.7 (r267:88850, Jun 13 2011, 22:03:32) 
[GCC 4.6.1 20110608 (prerelease)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import urllib2, re
>>> from BeautifulSoup import BeautifulSoup
>>> urllib2.urlopen('http://www.immi.gov.au/skilled/general-skilled-migration/estimated-allocation-times.htm')
<addinfourl at 139158380 whose fp = <socket._fileobject object at 0x84aa2ac>>
>>> html = _.read()
>>> soup = BeautifulSoup(html)
>>> soup.find(text = re.compile('\\bsubclass 885\\b')).parent.parent.find('td', text = re.compile(' [0-9]{4}$'))
u'15 May 2011'


There is a library called Beautiful Soup which does the job you asked for. http://www.crummy.com/software/BeautifulSoup/

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜