Regex to match the first file in a rar archive file set in Python
I need to uncompress all the files in a directory and for this I need to find the first file in the set. I'm currently doing this using a bunch of if statements and loops. Can i do this this using regex?
Here's a list of files that i need to match:
yes.rar
yes.part1.rar
yes.part01.rar
yes.part001.rar
yes.r01
yes.r001
These should NOT be matched:
no.part2.rar
no.part02.rar
no.part002.rar
no.part011.rar
no.r002
no.r02
I found a similar regex on this thread but it seems that Python doesn't support varible length lookarounds. A single line regex would be complicated but I'll document it well and it's not a problem. It's just one of those problems you 开发者_C百科beat your heap up, over.
Thanks in advance guys.
:)
Don't rely on the names of the files to determine which one is first. You're going to end up finding an edge case where you get the wrong file.
RAR's headers will tell you which file is the first on in the volume, assuming they were created in a somewhat-recent version of RAR.
HEAD_FLAGS Bit flags:
2 bytes0x0100 - First volume (set only by RAR 3.0 and later)
So open up each file and examine the RAR headers, looking specifically for the flag that indicates which file is the first volume. This will never fail, as long as the archive isn't corrupt.
Update: I've just confirmed this by taking a look at some spanning archives in a hex editor. The files headers are constructed exactly as the link above indicates. It's just a matter of opening the files and reading the header for that flag. The file with that flag is the first volume.
There's no need to use look behind assertions for this. Since you start looking from the beginning of the string, you can do everything with look-aheads that you can with look-behinds. This should work:
^((?!\.part(?!0*1\.rar$)\d+\.rar$).)*\.(?:rar|r?0*1)$
To capture the first part of the filename as you requested, you could do this:
^((?:(?!\.part\d+\.rar$).)*)\.(?:(?:part0*1\.)?rar|r?0*1)$
Are you sure you want to match these cases?
yes.r01
They are not the first archives: .rar always is.
It's bla.rar, bla.r00 and then only bla.r01. You'll probably extract the files twice if you match .r01 and .rar as first archive.
yes.r001
.r001 doesn't exist. Do you mean the .001 files that WinRAR supports? After .r99, it's .s00. If it does exist, then somebody manually renamed the files.
In theory, matching on filename should be as reliable as matching on the 0x0100 flag to find the first archive.
精彩评论