开发者

Python: Create various file objects while reading a file

I am reading a large file containing various <xml>..</xml> elements.开发者_如何学Python Since every XML parser has trouble with that, I would like to produce efficiently new file objects for each <xml>..</xml> block.

I was starting to subclass the file object in Python, but got stucked there. I think, I've to intercept each line starting with </xml> and return a new file object; maybe by using yield.

Can someone guide me to do the step in the right direction?

Here is my current code fragment:

#!/bin/bash/env python

from lxml import etree
from StringIO import StringIO

class handler(file):
  def __init__(self, name, mode):
    file.__init__(self, name, mode)

  def next(self):
    return file.next(self)

  def listXmls(self):
    output = StringIO()
    line = self.next()
    while line is not None:
      output.write(line.strip())
      if line.strip() == '</xml>':
        yield output
        output = StringIO()
      try:
        line = self.next()
      except StopIteration:
        break
    output.close()

f = handler('myxml.xml', 'r')
for elem in f.listXmls():
  print 'm' + elem.getvalue() + 'm'
  context = etree.iterparse(elem, events=('end',), tag='id')
  for event, element in context:
    print element.tag

Thanks!

SOLUTION (still interested in a better version):

#!/bin/bash/env python

from lxml import etree
from StringIO import StringIO

class handler(file):
  def __init__(self, name, mode):
    file.__init__(self, name, mode)

  def next(self):
    return file.next(self)

  def listXmls(self):
    output = StringIO()
    output.write(self.next())
    line = self.next()
    while line is not None:
      if line.startswith('<?xml'):
        output.seek(0)
        yield output
        output = StringIO()
      output.write(line)
      try:
        line = self.next()
      except StopIteration:
        break
    output.seek(0)
    yield output

f = handler('myxml.xml', 'r')
for elem in f.listXmls():
  context = etree.iterparse(elem, events=('end',), tag='id')
  for event, element in context:
    print element.tag


While not a direct answer to your question, this may solve your problem anyway: Simply adding another <xml> at the beginning and another </xml> at the end will probably make your XML parser accept the document:

from lxml import etree
document = "<xml>a</xml> <xml>b</xml>"
document = "<xml>" + document + "</xml>"
for subdocument in etree.XML(document):
    # whatever
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜