开发者

Parsing XML with Python xml.sax: How does one "keep track" of where in the tree you are?

I need to regularly export XML files from our Administration software.

This is the first time I'm using XML Parsing in Python. The XML with xml.sax isn't terribly difficult, but what is the best way to "keep track" of where in the XML tree you are?

For example, I have a list of our customers. I want to extract the Telephone by , but there are multiple places where occurs:

eExact -> Accounts -> Account -> Contacts -> Contact -> Addresses -> Address  -> Phone
eExact -> Accounts -> Account -> Contacts -> Contact -> Phone
eExact -> Accounts -> Account -> Phone

So I need to keep to keep track of where exactly in the XML tree I am in order to get the right Phone number

As far as I can figure out from the xml.sax documentation at the Python website, there is no "easy" method or variable that is set.

So, this is what I did:

import xml.sax

class Exact(xml.sax.handler.ContentHandler):
  def __init__(self):
    self.curpath = []

  def startElement(self, name, attrs):
    self.curpath.append(name)

    if name == 'Phone':
      print self.curpath, name

  def endElement(self, name):
    self.curpath.pop()

if __name__ == '__main__':
  parser = xml.sax.make_parser()
  handler = Exact()
  parser.setContentHandler(handler)
  parser.parse(open('/home/cronuser/xml/mount/daily/debtors.xml'))

This is not very difficult, but since I don't have a lot of experience with XML I wonder if this the "generally accepted" or "best possible" manner?

Thanks开发者_运维问答 :)


I used sax too, but then I found a better tool: iterparse from ElementTree.

It is similar to sax, but you can retrieve elements with contents, to free the memory you just have to clear element once retrieved.


Is there a specific reason you need to use SAX for this?

Because, if loading the entire XML file into an in-memory object model is acceptable, you'd probably find it a lot easier to use the ElementTree DOM API.

(If you don't need the ability to retrieve a parent node when given a child, cElementTree in the Python standard library should do the trick nicely. If you do, the LXML library provides an ElementTree implementation that'll give you parent references. Both use compiled C modules for speed.)


Thanks or all the comments.

I looked at ElementTree's iterparse, but by then I already did quite some code in xml.sax. Since the immediate advantage of iterparse is small to nonexistent I choose to just stick with xml.sax. It's already a large advantage over the current solution.

Alright, so this is what I did in the end.

class Exact(xml.sax.handler.ContentHandler):
    def __init__(self, stdpath):
        self.stdpath = stdpath

        self.thisrow = {}
        self.curpath = []
        self.getvalue = None

        self.conn = MySQLConnect()
        self.table = None
        self.numrows = 0

    def __del__(self):
        self.conn.close()

        print '%s rows affected' % self.numrows

    def startElement(self, name, att):
        self.curpath.append(name)

    def characters(self, data):
        if self.getValue is not None:
            self.thisrow[self.getValue.strip()] = data.strip()
            self.getValue = None

    def endElement(self, name):
        self.curpath.pop()

        if name == self.stdpath[len(self.stdpath) - 1]:
            self.EndRow()
            self.thisrow = { }

    def EndRow(self):
        self.numrows += MySQLInsert(self.conn, self.thisrow, True, self.table)
        #for k, v in self.thisrow.iteritems():
        #   print '%s: %s,' % (k, v),
        #print ''

    def curPath(self, full=False):
        if full:
            return ' > '.join(self.curpath)
        else:
            return ' > '.join(self.curpath).replace(' > '.join(self.stdpath) + ' > ', '')

I then subclass this a number of times for different XML files:

class Debtors(sqlimport.Exact):
    def startDocument(self):
        self.table = 'debiteuren'
        self.address = None

    def startElement(self, name, att):
        sqlimport.Exact.startElement(self, name, att)

        if self.curPath(True) == ' > '.join(self.stdpath):
            self.thisrow = {}
            self.thisrow['debiteur'] = att.get('code').strip()
        elif self.curPath() == 'Name':
            self.getValue = 'naam'
        elif self.curPath() == 'Phone':
            self.getValue = 'telefoon1'
        elif self.curPath() == 'ExtPhone':
            self.getValue = 'telefoon2'
        elif self.curPath() == 'Contacts > Contact > Addresses > Address':
            if att.get('type') == 'V':
                self.address = 'Contacts > Contact > Addresses > Address '
        elif self.address is not None:
            if self.curPath() == self.address + '> AddressLine1':
                self.getValue = 'adres1'
            elif self.curPath() == self.address + '> AddressLine2':
                self.getValue = 'adres2'
        else:
            self.getValue = None

if __name__ == '__main__':
    handler = Debtors(['Debtors', 'Accounts', 'Account'])
    parser = xml.sax.make_parser()
    parser.setContentHandler(handler)

    parser.parse(open('myfile.xml', 'rb'))

... and so forth ...


I think the simplest solution is exactly what you are doing in your example - maintain a stack of nodes.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜