开发者

Is there a way to decode q-encoded strings in Ruby?

I'm working with mails, and names and subjects sometimes come q-encoded, like this:

开发者_如何学Python
=?UTF-8?Q?J=2E_Pablo_Fern=C3=A1ndez?=

Is there a way to decode them in Ruby? It seems TMail should take care of it, but it's not doing it.


I use this to parse email subjects:

You could try the following:

str = "=?UTF-8?Q?J=2E_Pablo_Fern=C3=A1ndez?="
if m = /=\?([A-Za-z0-9\-]+)\?(B|Q)\?([!->@-~]+)\?=/i.match(str)
        case m[2]
        when "B" # Base64 encoded
          decoded = Base64.decode64(m[3])
        when "Q" # Q encoded
          decoded = m[3].unpack("M").first.gsub('_',' ')
        else
          p "Could not find keyword!!!"
        end
        Iconv.conv('utf-8',m[1],decoded) # to convert to utf-8
end


Ruby includes a method of decoding Quoted-Printable strings:

puts "Pablo_Fern=C3=A1ndez".unpack "M"
# => Pablo_Fernández

But this doesn't seem to work on your entire string (including the =?UTF-8?Q? part at the beginning. Maybe you can work it out from there, though.


This is a pretty old question but TMail::Unquoter (or its new incarnation Mail::Encodings) does the job as well.

TMail::Unquoter.unquote_and_convert_to(str, 'utf-8' )

or

Mail::Encodings.unquote_and_convert_to( str, 'utf-8' )


Decoding on a line-per-line basis:

line.unpack("M")

Convert STDIN or file provided input of encoded strings into a decoded output:

if ARGV[0]
  lines = File.read(ARGV[0]).lines
else
  lines = STDIN.each_line.to_a
end

puts lines.map { |c| c.unpack("M") }.join


This might help anyone wanting to test an email. delivery.html_part is normally encoded, but can be decoded to a straight HTML body using .decoded.

test "email test" do
  UserMailer.confirm_email(user).deliver_now
  assert_equal 1, ActionMailer::Base.deliveries.size
  delivery = ActionMailer::Base.deliveries.last
  assert_equal "Please confirm your email", delivery.subject
  assert delivery.html_part.decoded =~ /Click the link below to confirm your email/ # DECODING HERE
end


The most efficient and up to date solution it seems to use the value_decode method of the Mail gem.

> Mail::Encodings.value_decode("=?UTF-8?Q?Greg_of_Google?=")
=> "Greg of Google"

https://www.rubydoc.info/github/mikel/mail/Mail/Encodings#value_decode-class_method


Below is Ruby code you can cut-and-paste, if inclined. It will run tests if executed directly with ruby, ruby ./copy-pasted.rb. As done in the code, I use this module as a refinement to the String core class.

A few remarks on the solution:

  1. Other solutions perform .gsub('_', ' ') on the unpacked string. However, I do not believe this is correct, and can result in an incorrect decoding depending on the charsets. RFC2047 Section 4.2 (2) indicates "_ always represents hexidecimal 20", so it seems correct to first substitute =20 for _ then rely on the unpack result. (This also makes the implementation more elegant.) This is also discussed in an answer to a related question.

  2. To be more instructive, I have written the regular expression in free-spacing mode to allow comments (I find this generally helpful for complex regular expressions). If you adjust the regular expression, take note that free-spacing mode changes the matching of white-space, which must then be done escaped or as a character class (as in the code). I've also added the regular expression on regex101, so you can read an explanation of the named capture groups, lazy quantifiers, etc. and experiment yourself.

  3. The regular expression will absorb space ( ; but not TAB or newline) between multiple Q-encoded phrases in a single string, as shown with string test_4. This is because RFC2047 Section 5 (1) indicates that multiple Q encoded phrases must be separated from each other by linear white-space. Depending on your use-case, absorbing the white-space may not be desired.

  4. The regular expression code named capture permits unexpected quoted printable codes (other than [bBqQ] so that a match will occur and the code can raise an error. This helps me to detect unexpected values when processing text. Change the regular expression named capture for code to [bBqQ] if you do not want this behaviour. (There will be no match and the original string will be returned.)

  5. It makes use of the global Regexp.last_match as a convenience in the gsub block. You may need to take care if using this in multi-threaded code, I have not given this any consideration.

Additional references and reading:

  • https://en.wikipedia.org/wiki/Quoted-printable
  • https://en.wikipedia.org/wiki/MIME#Encoded-Word
require "minitest/autorun"

module QuotedPrintableDecode
  class UnhandledCodeError < StandardError
    def initialize(code)
      super("Unhandled quoted printable code: '#{code}'.")
    end
  end

  @@qp_text_regex = %r{
    =\?                # Opening literal: `=?`
    (?<charset>[^\?]+) # Character set, e.g. "Windows-1252" in `=?Windows-1252?`
    \?                 # Literal: `?`
    (?<code>[a-zA-Z])  # Encoding, e.g. "Q" in `?Q?` (`B`ase64); [BbQq] expected, others raise
    \?                 # Literal: `?`
    (?<text>[^\?]+?)   # Encoded text, lazy (non-greedy) matched, e.g. "Foo_bar" in `?Foo_bar?`
    \?=                # Closing literal: `?=`
    (?:[ ]+(?==\?))?   # Optional separating linear whitespace if another Q-encode follows
  }x                   # Free-spacing mode to allow above comments, also changes whitespace match

  refine String do
    def decode_q_p(to: "UTF-8")
      self.gsub(@@qp_text_regex) do
        code, from, text = Regexp.last_match.values_at(:code, :charset, :text)
        q_p_charset_to_charset(code, text, from, to)
      end
    end

    private

    def q_p_charset_to_charset(code, text, from, to)
      case code
        when "q", "Q"
          text.gsub("_", "=20").unpack("M")
        when "b", "B"
          text.unpack("m")
        else
          raise UnhandledCodeError.new(code)
      end.first.encode(to, from)
    end
  end
end

class TestQPDecode < Minitest::Test
  using QuotedPrintableDecode

  def test_decode_single_utf_8_phrase
    encoded = "=?UTF-8?Q?J=2E_Pablo_Fern=C3=A1ndez?="
    assert_equal encoded.decode_q_p, "J. Pablo Fernández"
  end

  def test_decoding_preserves_space_between_unencoded_phrase
    encoded = "=?utf-8?Q?Alfred_Sanford?= <me@example.com>"
    assert_equal encoded.decode_q_p, "Alfred Sanford <me@example.com>"
  end

  def test_decodinge_multiple_adjacent_phrases_absorbs_separating_whitespace
    encoded = "=?Windows-1252?Q?Foo_-_D?= =?Windows-1252?Q?ocument_World=9617=96520;_Recor?= =?Windows-1252?Q?d_People_to_C?= =?Windows-1252?Q?anada's_History?="
    assert_equal encoded.decode_q_p, "Foo - Document World–17–520; Record People to Canada's History"
  end

  def test_decoding_string_without_encoded_phrases_preserves_original
    encoded = "Contains no QP phrases"
    assert_equal encoded.decode_q_p, encoded
  end

  def test_unhandled_code_raises
    klass = QuotedPrintableDecode::UnhandledCodeError
    message = "Unhandled quoted printable code: 'Z'."
    encoded = "=?utf-8?Z?Unhandled code Z?="

    raised_error = assert_raises(klass) { encoded.decode_q_p }
    assert_equal message, raised_error.message
  end
end
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜