开发者

python fetching multiple pages using post and cookies

Got a test site that I'm fetching. The site uses the POST method, as well as cookies. (Not sure that the cookies are critical, but i'm inclined to think they are..)

The app presents a page, with a "next button" to generate the subsequent pages. I've used LiveHttpHeaders/Firefof to determine what the post data should be in the query, as well as the fact that the cookies are being set. I've also verified that the page doesn't work if cookies are dosabled by the browser.

I'm trying to figure out what I've missed/screwed up in my test. The sample code presents the query/post data for both the 1st and 2nd page that I'm trying to fetch.

I've searched the net, as well as triedn numerous different possible attempts, so I'm pretty sure that I missing something simple..

Any thoughts/comments are appreciated..

#!/usr/bin/python

#test python script
import re
import urllib
import urllib2
import sys, string, os
from  mechanize import Browser
import mechanize
import cookielib            
########################
#
# Parsing App Information
########################

# datafile

cj = "p"
COOKIEFILE = 'cookies.lwp'
#cookielib = 1

urlopen = urllib2.urlopen
#cj = urllib2.cookielib.LWPCookieJar()       
cj = cookielib.LWPCookieJar()       
#cj = ClientCookie.LWPCookieJar()       
Request = urllib2.Request
br = Browser()

if cj != None:
  print "sss"
#install the CookieJar for the default CookieProcessor
  if os.path.isfile(COOKIEFILE):
      cj.load(COOKIEFILE)
      print "foo\n"
  if cookielib:
      opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
      urllib2.install_opener(opener)
      print "foo2\n"

user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
values1 = {'name' : 'Michael Foord',
          'location' : 'Northampton',
          'language' : 'Python' }
headers = { 'User-Agent' : user_agent }



if __name__ == "__main__":
# main app

  baseurl="https://pisa.ucsc.edu/class_search/index.php"
  print "b = ",baseurl
  print "b = ",headers
  query="action=results&binds%5B%3Aterm%5D=2100&binds%5B%3Areg_status%5D=O&binds%5B%3Asubject%5D=&binds%5B%3Acatalog_nbr_op%5D=%3D&binds%5B%3Acatalog_nbr%5D=&binds%5B%3Atitle%5D=&binds%5B%3Ainstr_name_op%5D=%3D&binds%5B%3Ainstructor%5D=&binds%5B%3Age%5D=&binds%5B%3Acrse_units_op%5D=%3D&binds%5B%3Acrse_units_from%5D=&binds%5B%3Acrse_units_to%5D=&binds%5B%3Acrse_units_exact%5D=&binds%5B%3Adays%5D=&binds%5B%3Atimes%5D=&binds%5B%3Aacad_career%5D="


  request = urllib2.Request(baseurl, query, headers)
  response = urllib2.urlopen(request)

  print "gggg \n"
  #print req
  print "\n gggg 555555\n"

  print "res = ",response
  x1 = response.read()
  #x1 = res.read()
  print x1
  #sys.exit()

  cj.save(COOKIEFILE)    # resave cookies
  if cj is None:
      print "We don't have a cookie library available - sorry."
      print "I can't show you any cookies."
  else:
      print 'These are the cookies we have received so far :'
      for index, cookie in enumerate (cj):
          print index, '  :  ', cookie

  cj.save(COOKIEFILE)  

  print "ffgg \n"
  for index, cookie in enumerate (cj):
       print index开发者_开发问答, '  :  ', cookie


  #baseurl ="http://students.yale.edu/oci/resultList.jsp"
  baseurl="https://pisa.ucsc.edu/class_search/index.php"

  query="action=next&Rec_Dur=100&sel_col%5Bclass_nbr%5D=1&sel_col%5Bclass_id%5D=1&sel_col%5Bclass_title%5D=1&sel_col%5Btype%5D=1&sel_col%5Bdays%5D=1&sel_col%5Btimes%5D=1&sel_col%5Binstr_name%5D=1&sel_col%5Bstatus%5D=1&sel_col%5Benrl_cap%5D=1&sel_col%5Benrl_tot%5D=1&sel_col%5Bseats_avail%5D=1&sel_col%5Blocation%5D=1"

  request = urllib2.Request(baseurl, query, headers)
  response = urllib2.urlopen(request)

  print "gggg \n"
  #print req
  print "\n gggg 555555\n"

  print "res = ",response
  x1 = response.read()
  #x1 = res.read()
  print x1
  sys.exit()


  req = Request(baseurl, query, headers)
  print "gggg \n"
  #print req
  print "\n gggg 555555\n"
  #br.open(req)

  res = urlopen(req)
  print "gggg 000000000000\n"
  x1 = res.read()
  print x1


  sys.exit()

thanks for any thoughts/pointers...

and yeah.. i know.. the script/test is really bad!

-tom


I'm really not sure what's going wrong. It may not be helpful, but a different way to do it:

Open a table-row for each 1st page submit then keep only the rowId in the cookie and add to the row with subsequent accesses.

tf.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜