JPG+Zip File Combination Problem with Zip Format
Hopefully you've heard of the neat hack that lets you combine a JPG and a Zip file into a single file and it's a valid (or at least readable) file for both formats. Well, I realized that since JPG lets arbitrary stuff at the end, and ZIP at the beginning, you could stick one more format in there - in the middle. For the purposes of this question, assume the middle data is arbitrary binary data guar开发者_运维技巧enteed not to conflict with the JPG or ZIP formats (meaning it doesn't contain the magic zip header 0x04034b50). Illustration:
0xFFD8 <- start jpg data end -> 0xFFD9 ... ARBITRARY BINARY DATA ... 0x04034b50 <- start zip file ... EOF
I am catting like this:
cat "mss_1600.jpg" filea fileb filea fileb filea fileb filea fileb filea fileb filea fileb filea fileb filea fileb filea fileb filea fileb filea fileb filea fileb filea fileb "null.bytes" "randomzipfile.zip" > temp.zip
This produces a 6,318 KB file. It does not open in 7-Zip. However, when I cat one less 'double' (so instead of 13 filea and b's, 12):
cat "mss_1600.jpg" filea fileb filea fileb filea fileb filea fileb filea fileb filea fileb filea fileb filea fileb filea fileb filea fileb filea fileb filea fileb "null.bytes" "randomzipfile.zip" > temp.zip
It produces a 5,996 KB file that does open in 7-Zip.
So I know my arbitrary binary data doesn't have the magic Zip File Header to screw it up. I have reference files of the working jpg+data+zip and the non-working jpg+data+zip (save-as cause the browser thinks they're images, and add the zip extensions yourself).
I want to know why it fails with 13 combinations and doesn't with 12. For bonus points, I need to get around this somehow.
I downloaded the source for 7-Zip and figured out what is causing this to happen.
In CPP/7zip/UI/Common/OpenArchive.cpp, you'll see the following:
// Static-SFX (for Linux) can be big.
const UInt64 kMaxCheckStartPosition = 1 << 22;
That means that only the first 4194304 bytes of the file will be searched for the header. If it isn't found there, 7-Zip considers it an invalid file.
You can double that limit by changing 1 << 22
to 1 << 23
. I tested that change by rebuilding 7-Zip and it works.
EDIT: To get around this issue, you can download the source, make the above change, and build it. I built it using VS 2008. Open the VS command prompt, navigate to extracted-source-location\CPP\7zip\Bundles and type 'nmake'. Then in the Alone directory run '7za t nonworking.jpg' and you should see 'Everything is Ok'.
Actually it is a two part answer really :)
Firstly no matter what people say zip files cannot technically be put verbatim at the end of files. The end of central directory record has a value which indicates the byte offset from the start of the current disk (if you have only one .zip file, that means the current file). Now alot of processors ignore this, although Windows' zip folder doesn't so you need to correct that value to make it work in Windows explorer (not that you might care ;P) See Zip APPNOTE for info on the file format. Basically you find in a hex editor (or write a tool) to find the "offset of start of central directory with respect to the starting disk number" value. Then find the first "central file header signature" (hex of 504b0102) and set the value to that offset.
Now alas that doesn't fix 7zip but that is due to the way 7zip tries to guess the file format. Basically it will only search the first ~4MiB for the binary sequence 504b0304, if it doesn't find it it assumes it isn't Zip and tries its other archive formats. This is obviously why adding one more file breaks things, it pushes it over the limit for the search.
Now to fix it what you need to do is add that hex string to the jpeg without breaking it. One way of doing this is to add just after the FFD8 JPEG SOI header the following hex data, FFEF0005504B030400 . That adds a custom block with your sequence and is correct so jpeg headers should just ignore it.
So for anyone else finding this question, here's the story:
Yes, Andy is literally correct as to why 7-Zip is failing on the file, but it doesn't help my problem since I can't exactly get people to use MY version of 7-Zip.
tyranid however got me the solution.
- First off, adding a small bytestring to the JPG as he suggests will let 7-Zip open it. However, it's slightly off from a valid JPG fragment, it needs to be FFEF00 07 504B030400 - the length was off by 2 bytes.
- This lets 7-Zip open it, but not extract files, it fails silently. This is because the entries in the central directory have internal pointers/offsets that point to the entry of the file. Since you put a bunch of stuff before that, you need to correct all those pointers!
- To have the zip open with Windows built in zip support, you need to, as tyranid says, correct the "offset of start of central directory with respect to the starting disk number". Here is a python script to do the last two, although it's a fragment, not copypasta-ready-to-use
#Now we need to read the file and rewrite all the zip headers. Fun!
torewrite = open(magicfilename, 'rb')
magicdata = torewrite.read()
torewrite.close()
#Change the Central Repository's Offset
offsetOfCentralRepro = magicdata.find('\x50\x4B\x01\x02') #this is the beginning of the central repo
start = len(magicdata) - 6 #it so happens, that on my files, the point is stored 2 bytes from the end. so datadatadatdaata OF FS ET !! 00 00 EOF where OFFSET!! is the 4 bytes 00 00 are the last two bytes, then EOF
magicdata = magicdata[:start] + pack('I', offsetOfCentralRepro) + magicdata[start+4:]
#Now change the individual offsets in the central directory files
startOfCentralDirectoryEntry = magicdata.find('\x50\x4B\x01\x02', 0) #find the first central directory entry
startOfFileDirectoryEntry = magicdata.find('\x50\x4B\x03\x04', 10) #find the first file entry (we start at 10 because we have to skip past the first fake entry in the jpg)
while startOfCentralDirectoryEntry > 0:
#Now I move a magic number of bytes past the entry (really! It's 42!)
startOfCentralDirectoryEntry = startOfCentralDirectoryEntry + 42
#get the current offset just to output something to the terminal
(oldoffset,) = unpack('I', magicdata[startOfCentralDirectoryEntry : startOfCentralDirectoryEntry+4])
print "Old Offset: ", oldoffset, " New Offset: ", startOfFileDirectoryEntry , " at ", startOfCentralDirectoryEntry
#now replace it
magicdata = magicdata[:startOfCentralDirectoryEntry] + pack('I', startOfFileDirectoryEntry) + magicdata[startOfCentralDirectoryEntry+4:]
#now I move to the next central directory entry, and the next file entry
startOfCentralDirectoryEntry = magicdata.find('\x50\x4B\x01\x02', startOfCentralDirectoryEntry)
startOfFileDirectoryEntry = magicdata.find('\x50\x4B\x03\x04', startOfFileDirectoryEntry+1)
#Finally write the rewritten headers' data
towrite = open(magicfilename, 'wb')
towrite.write(magicdata)
towrite.close()
You can produce hybrid JPG+ZIP files using DotNetZip. DotNetZip can save to a stream, and it is intelligent enough to recognize the original offset of a pre-existing stream before it begins writing zip content into it. Therefore in pseudo code, you can get a JPG+ZIP this way:
open stream on an existing JPG file for update
seek to the end of that stream
open or create a zip file
call ZipFile.Save to write zip content to the JPG stream
close
All the offsets are correctly figured. The same technique is used to produce a self-extracting archive. You can open the stream on the EXE, then seek to the end, and write the ZIP content to that stream. All the offsets are correctly calculated if you do it this way.
Another thing - regarding one of the comments in another post... ZIP can have arbitrary data in the beginning and at the end of the file. There's no requirement as far as I know that the zip central directory needs to be at the end of the file, though that is typical.
精彩评论