Is there a way to decode q-encoded strings in Ruby?

2023-01-11 22:11 问答作者：

I'm working with mails, and names and subjects sometimes come q-encoded, like this:

=?UTF-8?Q?J=2E_Pablo_Fern=C3=A1ndez?=

Is there a way to decode them in Ruby? It seems TMail should take care of it, but it's not doing it.

I use this to parse email subjects:

You could try the following:

str = "=?UTF-8?Q?J=2E_Pablo_Fern=C3=A1ndez?="
if m = /=\?([A-Za-z0-9\-]+)\?(B|Q)\?([!->@-~]+)\?=/i.match(str)
        case m[2]
        when "B" # Base64 encoded
          decoded = Base64.decode64(m[3])
        when "Q" # Q encoded
          decoded = m[3].unpack("M").first.gsub('_',' ')
        else
          p "Could not find keyword!!!"
        end
        Iconv.conv('utf-8',m[1],decoded) # to convert to utf-8
end

Ruby includes a method of decoding Quoted-Printable strings:

puts "Pablo_Fern=C3=A1ndez".unpack "M"
# => Pablo_Fernández

But this doesn't seem to work on your entire string (including the =?UTF-8?Q? part at the beginning. Maybe you can work it out from there, though.

This is a pretty old question but TMail::Unquoter (or its new incarnation Mail::Encodings) does the job as well.

TMail::Unquoter.unquote_and_convert_to(str, 'utf-8' )

Mail::Encodings.unquote_and_convert_to( str, 'utf-8' )

Decoding on a line-per-line basis:

line.unpack("M")

Convert STDIN or file provided input of encoded strings into a decoded output:

if ARGV[0]
  lines = File.read(ARGV[0]).lines
else
  lines = STDIN.each_line.to_a
end

puts lines.map { |c| c.unpack("M") }.join

This might help anyone wanting to test an email. delivery.html_part is normally encoded, but can be decoded to a straight HTML body using .decoded.

test "email test" do
  UserMailer.confirm_email(user).deliver_now
  assert_equal 1, ActionMailer::Base.deliveries.size
  delivery = ActionMailer::Base.deliveries.last
  assert_equal "Please confirm your email", delivery.subject
  assert delivery.html_part.decoded =~ /Click the link below to confirm your email/ # DECODING HERE
end

The most efficient and up to date solution it seems to use the value_decode method of the Mail gem.

> Mail::Encodings.value_decode("=?UTF-8?Q?Greg_of_Google?=")
=> "Greg of Google"

https://www.rubydoc.info/github/mikel/mail/Mail/Encodings#value_decode-class_method

Below is Ruby code you can cut-and-paste, if inclined. It will run tests if executed directly with ruby, ruby ./copy-pasted.rb. As done in the code, I use this module as a refinement to the String core class.

A few remarks on the solution:

Other solutions perform .gsub('_', ' ') on the unpacked string. However, I do not believe this is correct, and can result in an incorrect decoding depending on the charsets. RFC2047 Section 4.2 (2) indicates "_ always represents hexidecimal 20", so it seems correct to first substitute =20 for _ then rely on the unpack result. (This also makes the implementation more elegant.) This is also discussed in an answer to a related question.
To be more instructive, I have written the regular expression in free-spacing mode to allow comments (I find this generally helpful for complex regular expressions). If you adjust the regular expression, take note that free-spacing mode changes the matching of white-space, which must then be done escaped or as a character class (as in the code). I've also added the regular expression on regex101, so you can read an explanation of the named capture groups, lazy quantifiers, etc. and experiment yourself.
The regular expression will absorb space ( ; but not TAB or newline) between multiple Q-encoded phrases in a single string, as shown with string test_4. This is because RFC2047 Section 5 (1) indicates that multiple Q encoded phrases must be separated from each other by linear white-space. Depending on your use-case, absorbing the white-space may not be desired.
The regular expression code named capture permits unexpected quoted printable codes (other than [bBqQ] so that a match will occur and the code can raise an error. This helps me to detect unexpected values when processing text. Change the regular expression named capture for code to [bBqQ] if you do not want this behaviour. (There will be no match and the original string will be returned.)
It makes use of the global Regexp.last_match as a convenience in the gsub block. You may need to take care if using this in multi-threaded code, I have not given this any consideration.

Additional references and reading:

https://en.wikipedia.org/wiki/Quoted-printable
https://en.wikipedia.org/wiki/MIME#Encoded-Word

require "minitest/autorun"

module QuotedPrintableDecode
  class UnhandledCodeError < StandardError
    def initialize(code)
      super("Unhandled quoted printable code: '#{code}'.")
    end
  end

  @@qp_text_regex = %r{
    =\?                # Opening literal: `=?`
    (?<charset>[^\?]+) # Character set, e.g. "Windows-1252" in `=?Windows-1252?`
    \?                 # Literal: `?`
    (?<code>[a-zA-Z])  # Encoding, e.g. "Q" in `?Q?` (`B`ase64); [BbQq] expected, others raise
    \?                 # Literal: `?`
    (?<text>[^\?]+?)   # Encoded text, lazy (non-greedy) matched, e.g. "Foo_bar" in `?Foo_bar?`
    \?=                # Closing literal: `?=`
    (?:[ ]+(?==\?))?   # Optional separating linear whitespace if another Q-encode follows
  }x                   # Free-spacing mode to allow above comments, also changes whitespace match

  refine String do
    def decode_q_p(to: "UTF-8")
      self.gsub(@@qp_text_regex) do
        code, from, text = Regexp.last_match.values_at(:code, :charset, :text)
        q_p_charset_to_charset(code, text, from, to)
      end
    end

    private

    def q_p_charset_to_charset(code, text, from, to)
      case code
        when "q", "Q"
          text.gsub("_", "=20").unpack("M")
        when "b", "B"
          text.unpack("m")
        else
          raise UnhandledCodeError.new(code)
      end.first.encode(to, from)
    end
  end
end

class TestQPDecode < Minitest::Test
  using QuotedPrintableDecode

  def test_decode_single_utf_8_phrase
    encoded = "=?UTF-8?Q?J=2E_Pablo_Fern=C3=A1ndez?="
    assert_equal encoded.decode_q_p, "J. Pablo Fernández"
  end

  def test_decoding_preserves_space_between_unencoded_phrase
    encoded = "=?utf-8?Q?Alfred_Sanford?= <me@example.com>"
    assert_equal encoded.decode_q_p, "Alfred Sanford <me@example.com>"
  end

  def test_decodinge_multiple_adjacent_phrases_absorbs_separating_whitespace
    encoded = "=?Windows-1252?Q?Foo_-_D?= =?Windows-1252?Q?ocument_World=9617=96520;_Recor?= =?Windows-1252?Q?d_People_to_C?= =?Windows-1252?Q?anada's_History?="
    assert_equal encoded.decode_q_p, "Foo - Document World–17–520; Record People to Canada's History"
  end

  def test_decoding_string_without_encoded_phrases_preserves_original
    encoded = "Contains no QP phrases"
    assert_equal encoded.decode_q_p, encoded
  end

  def test_unhandled_code_raises
    klass = QuotedPrintableDecode::UnhandledCodeError
    message = "Unhandled quoted printable code: 'Z'."
    encoded = "=?utf-8?Z?Unhandled code Z?="

    raised_error = assert_raises(klass) { encoded.decode_q_p }
    assert_equal message, raised_error.message
  end
end

继续阅读：decoding email ruby

Is there a way to decode q-encoded strings in Ruby?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？