Get python getaddresses() to decode encoded-word encoding
msg = \
"""To: =?ISO-8859-1?Q?Caren_K=F8lter?= <ck@example.dk>, bob@example.com
Cc: "James =?ISO-8859-1?Q?K=F8lter?=" <jk@example.dk>
Subject: hello
message body blah blah blah
"""
import email.parser, email.utils
import itertools
parser = email.parser.Parser()
parsed_message = parser.parsestr(msg)
address_fields = ('to', 'cc')
addresses = itertools.chain(*(parsed_message.get_all(field) for field in address_fields if parsed_message.has_key(field)))
address_list = set(email.utils.getaddresses(addresses))
print address_list
It seems like email.utils.getaddresses() doesn't seem to automatically handle MIME R开发者_开发百科FC 2047 in address fields.
How can I get the expected result below?
actual result:
set([('', 'bob@example.com'), ('=?ISO-8859-1?Q?Caren_K=F8lter?=', 'ck@example.dk'), ('James =?ISO-8859-1?Q?K=F8lter?=', 'jk@example.dk')])
desired result:
set([('', 'bob@example.com'), (u'Caren_K\xf8lter', 'ck@example.dk'), (u'James \xf8lter', 'jk@example.dk')])
The function you want is email.header.decode_header, which returns a list of (decoded_string, charset) pairs. It's up to you to further decode them according to charset and join them back together again before passing them to email.utils.getaddresses or wherever.
You might think that this would be straightforward:
def decode_rfc2047_header(h):
    return ' '.join(s.decode(charset or 'ascii')
                   for s, charset in email.header.decode_header(h))
But since message headers typically come from untrusted sources, you have to handle (1) badly encoded data; and (2) bogus character set names. So you might do something like this:
def decode_safely(s, charset='ascii'):
    """Return s decoded according to charset, but do so safely."""
    try:
        return s.decode(charset or 'ascii', 'replace')
    except LookupError: # bogus charset
        return s.decode('ascii', 'replace')
def decode_rfc2047_header(h):
    return ' '.join(decode_safely(s, charset)
                   for s, charset in email.header.decode_header(h))
Yeah, the email package interface really isn't very helpful a lot of the time.
Here, you have to use email.header.decode_header manually on each address, and then, since that gives you a list of decoded tokens, you have to stitch them back together again manually:
for name, address in email.utils.getaddresses(addresses):
    name= u' '.join(
        unicode(b, e or 'ascii') for b, e in email.header.decode_header(name)
    )
    ...
Thank you Gareth Rees.Your answer was helpful in solving a problem case:
Input: 'application/octet-stream;\r\n\tname="=?utf-8?B?KFVTTXMpX0FSTE8uanBn?="'
The absence of whitespace around the encoded-word caused email.Header.decode_header to overlook it.  I'm too new to this to know if I've only made things worse, but this kludge, along with joining with a '' instead of ' ', fixed it:
if not ' =?' in h:
    h = h.replace('=?', ' =?').replace('?=', '?= ')
Output: u'application/octet-stream; name="(USMs)_ARLO.jpg"' 
 
         加载中,请稍侯......
 加载中,请稍侯......
      
精彩评论