Parsing XML with Python xml.sax: How does one "keep track" of where in the tree you are?
I need to regularly export XML files from our Administration software.
This is the first time I'm using XML Parsing in Python. The XML with xml.sax
isn't terribly difficult, but what is the best way to "keep track" of where in the XML tree you are?
For example, I have a list of our customers. I want to extract the Telephone by , but there are multiple places where occurs:
eExact -> Accounts -> Account -> Contacts -> Contact -> Addresses -> Address -> Phone
eExact -> Accounts -> Account -> Contacts -> Contact -> Phone
eExact -> Accounts -> Account -> Phone
So I need to keep to keep track of where exactly in the XML tree I am in order to get the right Phone number
As far as I can figure out from the xml.sax documentation at the Python website, there is no "easy" method or variable that is set.
So, this is what I did:
import xml.sax
class Exact(xml.sax.handler.ContentHandler):
def __init__(self):
self.curpath = []
def startElement(self, name, attrs):
self.curpath.append(name)
if name == 'Phone':
print self.curpath, name
def endElement(self, name):
self.curpath.pop()
if __name__ == '__main__':
parser = xml.sax.make_parser()
handler = Exact()
parser.setContentHandler(handler)
parser.parse(open('/home/cronuser/xml/mount/daily/debtors.xml'))
This is not very difficult, but since I don't have a lot of experience with XML I wonder if this the "generally accepted" or "best possible" manner?
Thanks开发者_运维问答 :)
I used sax too, but then I found a better tool: iterparse from ElementTree.
It is similar to sax, but you can retrieve elements with contents, to free the memory you just have to clear element once retrieved.
Is there a specific reason you need to use SAX for this?
Because, if loading the entire XML file into an in-memory object model is acceptable, you'd probably find it a lot easier to use the ElementTree DOM API.
(If you don't need the ability to retrieve a parent node when given a child, cElementTree in the Python standard library should do the trick nicely. If you do, the LXML library provides an ElementTree implementation that'll give you parent references. Both use compiled C modules for speed.)
Thanks or all the comments.
I looked at ElementTree's iterparse, but by then I already did quite some code in xml.sax. Since the immediate advantage of iterparse is small to nonexistent I choose to just stick with xml.sax. It's already a large advantage over the current solution.
Alright, so this is what I did in the end.
class Exact(xml.sax.handler.ContentHandler):
def __init__(self, stdpath):
self.stdpath = stdpath
self.thisrow = {}
self.curpath = []
self.getvalue = None
self.conn = MySQLConnect()
self.table = None
self.numrows = 0
def __del__(self):
self.conn.close()
print '%s rows affected' % self.numrows
def startElement(self, name, att):
self.curpath.append(name)
def characters(self, data):
if self.getValue is not None:
self.thisrow[self.getValue.strip()] = data.strip()
self.getValue = None
def endElement(self, name):
self.curpath.pop()
if name == self.stdpath[len(self.stdpath) - 1]:
self.EndRow()
self.thisrow = { }
def EndRow(self):
self.numrows += MySQLInsert(self.conn, self.thisrow, True, self.table)
#for k, v in self.thisrow.iteritems():
# print '%s: %s,' % (k, v),
#print ''
def curPath(self, full=False):
if full:
return ' > '.join(self.curpath)
else:
return ' > '.join(self.curpath).replace(' > '.join(self.stdpath) + ' > ', '')
I then subclass this a number of times for different XML files:
class Debtors(sqlimport.Exact):
def startDocument(self):
self.table = 'debiteuren'
self.address = None
def startElement(self, name, att):
sqlimport.Exact.startElement(self, name, att)
if self.curPath(True) == ' > '.join(self.stdpath):
self.thisrow = {}
self.thisrow['debiteur'] = att.get('code').strip()
elif self.curPath() == 'Name':
self.getValue = 'naam'
elif self.curPath() == 'Phone':
self.getValue = 'telefoon1'
elif self.curPath() == 'ExtPhone':
self.getValue = 'telefoon2'
elif self.curPath() == 'Contacts > Contact > Addresses > Address':
if att.get('type') == 'V':
self.address = 'Contacts > Contact > Addresses > Address '
elif self.address is not None:
if self.curPath() == self.address + '> AddressLine1':
self.getValue = 'adres1'
elif self.curPath() == self.address + '> AddressLine2':
self.getValue = 'adres2'
else:
self.getValue = None
if __name__ == '__main__':
handler = Debtors(['Debtors', 'Accounts', 'Account'])
parser = xml.sax.make_parser()
parser.setContentHandler(handler)
parser.parse(open('myfile.xml', 'rb'))
... and so forth ...
I think the simplest solution is exactly what you are doing in your example - maintain a stack of nodes.
精彩评论