Trying to determine if file has been uuencoded
I am trying to process a large collection of txt files which themselves are containers for the actual files that I am wanting to process. The txt files have sgml tags that set boundaries for the individual files I am processing. Sometimes, the contained files are binary that have been uuencoded. I have solved the problem of decoding the uuencoded files but as I was mulling over my solution I have determined that it is not general enough. That is, I have been using
if '\nbegin 644 ' in document['document']
to test if the file is uuencoded. I did some searching and have a vague understanding of what the 644 means (file permissions) and have then found other examples of uuencoded files that might have
if '\nbegin 642 ' in document['document']
or even some other alternates. Thus, my problem is how do I make sure that I capture/identify all of the subcontainers that have uuencoded files.
One solution is to test every subcontainer:
uudecode=codecs.getdecoder("uu")
for document in documents:
try:
decoded_document,m=uudecode(document)
except ValueError:
decoded_document=''
if len(decoded_document)==0
more stuff
This is not horrible, cpu-cycles are cheap but I am going to be handling some 8 million documents.
Thus, is there a more robust way to recognize whether or not a particular string is 开发者_运维技巧the result of uuencoding?
Wikipedia says that every uuencoded file begins with this line
begin <perm> <name>
So probably a line matching the regexp ^begin [0-7]{3} (.*)$
denotes the beginning reliably enough.
Two ways:
(1) On Unix-based systems, you can robustly use the file
command.
http://unixhelp.ed.ac.uk/CGI/man-cgi?file
$ file foo
foo: uuencoded or xxencoded text
(2) I also found the following (untested) Python code that looks like it will do what you want (at http://ubuntuforums.org/archive/index.php/t-1304548.html).
#!/usr/bin/env python
import magic
import sys
filename=sys.argv[1]
ms = magic.open(magic.MAGIC_NONE)
ms.load()
ftype = ms.file(filename)
print ftype
ms.close()
精彩评论