to parse a file with text (with offset information) and binary data in python

2023-01-26 11:12 问答作者：

I have an xml file, which contains a set of textual element tags (each contains the decimal offset value and data length of the corresponding binary element) and the whole binary data of all the elements at the end. An example is as follows.

<?xml version="1.0" encoding="UTF-8"?>
<Package>
  <element>
        <offset>0</offset>
        <length>2961181</length>开发者_如何学JAVA;
        <checksum>4238515972</checksum>
        <format>gzip</format>
  </element>
  <element>
        <offset>2961181</offset>
        <length>5442</length>
        <checksum>4238515972</checksum>
        <format>bin</format>
  </element>
</Package>
BINARY_DATA

please note, the offset is decimal and counts from the first byte after the headers. How can I parse this file in python, grab the corresponding element based on the offset, uncompressed it (if its format is gzip) and store it as a file?

well, based on the replies from OmnipotentEntity and Jakob_B, I made the following short script, just to see if it works for the 1st element:

import zlib

f = open("file.xml", "r")
text = f.read()
position = text.find("</Package>\n")
headerSize=position+ len("</Package>\n") + 1 
offset=0
f.seek(headerSize + offset) 
length = 2961181
bin_data = f.read(length)
zipped=1
if (zipped):
  ungziped_str = zlib.decompressobj().decompress('x\x9c' + bin_data)
  print(ungziped_str)
f.close()

however, I got the following error:

Traceback (most recent call last): File "file_parse.py", line 11, in ? ungziped_str = zlib.decompressobj().decompress('x\x9c' + bin_data) zlib.error: Error -3 while decompressing: invalid block type

what is the problem? the input file is incorrect, or the code is incorrect?

The trick is going to be stopping XML parsers from puking on the binary data. lxml lets you feed a line at a time to a parser, so you can watch for the last XML tag and stop there:

from lxml import etree

def process(filename):
    f = file(filename,"r")
    parser = etree.XMLParser()
    for l in f:
        parser.feed(l)
        if l=="</Package>\n":
            break
    return parser.close()

That returns an

r=process("junk.xml")
<Element Package at 9f14eb4>

which is an lxml object you can get the data out of. The second object's offset is here:

>>> r[1][0].text
'2961181'

and so on. That should be enough for you to make a workable solution. Beware the line ending on the Package tag though, there might be a better way to do that, this might not work if the file has a different line ending.

Why not run a search for the end tag using lxml? Then when the end tag is found just .seek() to that point and read binary data.

Determine header size.

Grab offset and data length using xml magic

import zlib
python.seek(headerSize+offset)
mydata = python.read(length)
if (zipped):
  ungziped_str = zlib.decompressobj().decompress('x\x9c' + mydata)

Then write to file as normal.

Source for gunzip magic http://codingrecipes.com/ungzip-a-string-in-python-gzinflate-in-python

继续阅读：binary-data file parsing python

to parse a file with text (with offset information) and binary data in python

更多精彩内容

精彩评论

最新问答

第一次出国飞行流程+注意事项？

再生油（关于再生油的介绍）？

东莞科技进修学院（关于东莞科技进修学院的介绍）？

均为镇政府人员平均年龄不超30？

手机msn在哪里下载（其实很简单）？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？