Searching/reading binary data in Python

2023-01-07 02:13 问答作者：

I'm reading in a binary file (a jpg in this case), and need to find some values in that file. For those interested, the binary file is a jpg and I'm attempting to pick out its dimensions by looking for the binary structure as detailed here.

I need to find FFC0 in the binary data, skip ahead some number of bytes, and then 开发者_如何转开发read 4 bytes (this should give me the image dimensions).

What's a good way of searching for the value in the binary data? Is there an equivalent of 'find', or something like re?

You could actually load the file into a string and search that string for the byte sequence 0xffc0 using the str.find() method. It works for any byte sequence.

The code to do this depends on a couple things. If you open the file in binary mode and you're using Python 3 (both of which are probably best practice for this scenario), you'll need to search for a byte string (as opposed to a character string), which means you have to prefix the string with b.

with open(filename, 'rb') as f:
    s = f.read()
s.find(b'\xff\xc0')

If you open the file in text mode in Python 3, you'd have to search for a character string:

with open(filename, 'r') as f:
    s = f.read()
s.find('\xff\xc0')

though there's no particular reason to do this. It doesn't get you any advantage over the previous way, and if you're on a platform that treats binary files and text files differently (e.g. Windows), there is a chance this will cause problems.

Python 2 doesn't make the distinction between byte strings and character strings, so if you're using that version, it doesn't matter whether you include or exclude the b in b'\xff\xc0'. And if your platform treats binary files and text files identically (e.g. Mac or Linux), it doesn't matter whether you use 'r' or 'rb' as the file mode either. But I'd still recommend using something like the first code sample above just for forward compatibility - in case you ever do switch to Python 3, it's one less thing to fix.

Instead of reading the entire file into memory, searching it and then writing a new file out to disk you can use the mmap module for this. mmap will not store the entire file in memory and it allows for in-place modification.

#!/usr/bin/python

import mmap

with open("hugefile", "rw+b") as f:
    mm = mmap.mmap(f.fileno(), 0)
    print mm.find('\x00\x09\x03\x03')

The bitstring module was designed for pretty much this purpose. For your case the following code (which I haven't tested) should help illustrate:

from bitstring import ConstBitStream
# Can initialise from files, bytes, etc.
s = ConstBitStream(filename='your_file')
# Search to Start of Frame 0 code on byte boundary
found = s.find('0xffc0', bytealigned=True)
if found:
    print("Found start code at byte offset %d." % found[0])
    s0f0, length, bitdepth, height, width = s.readlist('hex:16, uint:16, 
                                                        uint:8, 2*uint:16')
    print("Width %d, Height %d" % (width, height))

The re module does work with both string and binary data (str in Python 2 and bytes in Python 3), so you can use it as well as str.find for your task.

In Python 3.x you can search a byte string by another byte string like this:

>>> byte_array = b'this is a byte array\r\n\r\nXYZ\x80\x04\x95 \x00\x00\x00\x00\x00'
>>> byte_array.find('\r\n\r\n'.encode())
20
>>>

The find() method should be used only if you need to know the position of sub, if not, you can use the in operator, for example:

with open("foo.bin", 'rb') as f:
    if b'\x00' in f.read():
        print('The file is binary!')
    else:
        print('The file is not binary!')

Well, obviously there is PIL The Image module has size as an attribute. If you are wanting to get the size exactly how you suggest and without loading the file you are going to have to go through it line by line. Not the nicest way to do it but it would work.

For Python >=3.2:

import re

f = open("filename.jpg", "rb")
byte = f.read()
f.close()

matchObj = re.match( b'\xff\xd8.*\xff\xc0...(..)(..).*\xff\xd9', byte, re.MULTILINE|re.DOTALL)
if matchObj:
    # https://stackoverflow.com/q/444591
    print (int.from_bytes(matchObj.group(1), 'big')) # height
    print (int.from_bytes(matchObj.group(2), 'big')) # width

继续阅读：binary python search

Searching/reading binary data in Python

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？