开发者

Reading website XML using Google App Engine & Python

I'm trying to read some xml from the world of warcraft armory (yea I'm one of those) - The url such as this returns the xml in Firefox (you need to view source to see it) but not in other browsers such as Chrome (which I don't fully understand why - though that's an aside).

Anyway I have this code which works fine when I run the app locally but now I'm migrating onto Google App Engine, it isn't and I don't know why. But it seems to be failing to fetch the xml. I've used Beautiful Coup to parse the xml in the full code.

import urllib2,urllib
import socket
from BeautifulSoup import BeautifulStoneSoup

class Object:
    def __init__(self):
        self.data = {}
        self.userAgent = "Mozilla/5.0 (Windows; U; Windows NT 5.0; en-GB; rv:1.8.1.4) Gecko/20070515 Firefox/2.0.0.4"

    def _getXml(self):
        strFile = ""
        try:
            url =开发者_JAVA百科 "http://eu.wowarmory.com/guild-info.xml?r=dentarg&n=penance"
            values = {}
            headers = { 'User-Agent' : self.userAgent }
            data = urllib.urlencode(values)
            socket.setdefaulttimeout(2)
            req = urllib2.Request(url, data, headers)
            response = urllib2.urlopen(req)
            strFile = response.read()
        except Exception, e:
            raise e
        finally:
            return strFile

    def getObject(self):
        soup = BeautifulStoneSoup( self._getXml() )
        return soup.guildheader["faction"]

Here's the main section:

from google.appengine.ext import webapp
from google.appengine.ext.webapp import util
from library import Object


class MainHandler(webapp.RequestHandler):
    def get(self):
        test = Object().getObject()
        self.response.out.write(test)


def main():
    application = webapp.WSGIApplication([('/', MainHandler)],
                                         debug=True)
    util.run_wsgi_app(application)


if __name__ == '__main__':
    main()

I've simplified the code to try better illustrate the problem. I'd be very gratefully for any help.


Ok, I've played aroun with http://shell.appspot.com/ ( FYI you can download the source and integrate it with your project for further experiments), this seems to do the trick:

headers = { 'User-Agent' : ""Mozilla/5.0 (Windows; U; Windows NT 5.0; en-GB; rv:1.8.1.4) Gecko/20070515 Firefox/2.0.0.4" }
resp = urlfetch.fetch(url="http://eu.wowarmory.com/guild-info.xml?r=dentarg&n=penance", method=urlfetch.GET, headers= headers)
print resp.content


urllib2.Request does a POST when you pass the data parameter. Is that what the server is expecting or do you need to do a GET?

Also, going to that URL now just gives a "we've moved" message.


Blizzard have altered the old armory to a new site layout and format. You probably need to parse the HTML directly now.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜