Is there a way to decode q-encoded strings in Ruby?
I'm working with mails, and names and subjects sometimes come q-encoded, like this:
开发者_如何学Python=?UTF-8?Q?J=2E_Pablo_Fern=C3=A1ndez?=
Is there a way to decode them in Ruby? It seems TMail should take care of it, but it's not doing it.
I use this to parse email subjects:
You could try the following:
str = "=?UTF-8?Q?J=2E_Pablo_Fern=C3=A1ndez?="
if m = /=\?([A-Za-z0-9\-]+)\?(B|Q)\?([!->@-~]+)\?=/i.match(str)
case m[2]
when "B" # Base64 encoded
decoded = Base64.decode64(m[3])
when "Q" # Q encoded
decoded = m[3].unpack("M").first.gsub('_',' ')
else
p "Could not find keyword!!!"
end
Iconv.conv('utf-8',m[1],decoded) # to convert to utf-8
end
Ruby includes a method of decoding Quoted-Printable strings:
puts "Pablo_Fern=C3=A1ndez".unpack "M"
# => Pablo_Fernández
But this doesn't seem to work on your entire string (including the =?UTF-8?Q?
part at the beginning. Maybe you can work it out from there, though.
This is a pretty old question but TMail::Unquoter (or its new incarnation Mail::Encodings) does the job as well.
TMail::Unquoter.unquote_and_convert_to(str, 'utf-8' )
or
Mail::Encodings.unquote_and_convert_to( str, 'utf-8' )
Decoding on a line-per-line basis:
line.unpack("M")
Convert STDIN or file provided input of encoded strings into a decoded output:
if ARGV[0]
lines = File.read(ARGV[0]).lines
else
lines = STDIN.each_line.to_a
end
puts lines.map { |c| c.unpack("M") }.join
This might help anyone wanting to test an email. delivery.html_part is normally encoded, but can be decoded to a straight HTML body using .decoded
.
test "email test" do
UserMailer.confirm_email(user).deliver_now
assert_equal 1, ActionMailer::Base.deliveries.size
delivery = ActionMailer::Base.deliveries.last
assert_equal "Please confirm your email", delivery.subject
assert delivery.html_part.decoded =~ /Click the link below to confirm your email/ # DECODING HERE
end
The most efficient and up to date solution it seems to use the value_decode
method of the Mail gem.
> Mail::Encodings.value_decode("=?UTF-8?Q?Greg_of_Google?=")
=> "Greg of Google"
https://www.rubydoc.info/github/mikel/mail/Mail/Encodings#value_decode-class_method
Below is Ruby code you can cut-and-paste, if inclined. It will run tests if executed directly with ruby, ruby ./copy-pasted.rb
. As done in the code, I use this module as a refinement to the String core class.
A few remarks on the solution:
Other solutions perform
.gsub('_', ' ')
on the unpacked string. However, I do not believe this is correct, and can result in an incorrect decoding depending on the charsets. RFC2047 Section 4.2 (2) indicates "_
always represents hexidecimal20
", so it seems correct to first substitute=20
for_
then rely on the unpack result. (This also makes the implementation more elegant.) This is also discussed in an answer to a related question.To be more instructive, I have written the regular expression in free-spacing mode to allow comments (I find this generally helpful for complex regular expressions). If you adjust the regular expression, take note that free-spacing mode changes the matching of white-space, which must then be done escaped or as a character class (as in the code). I've also added the regular expression on regex101, so you can read an explanation of the named capture groups, lazy quantifiers, etc. and experiment yourself.
The regular expression will absorb space (
TAB
or newline) between multiple Q-encoded phrases in a single string, as shown with stringtest_4
. This is because RFC2047 Section 5 (1) indicates that multiple Q encoded phrases must be separated from each other by linear white-space. Depending on your use-case, absorbing the white-space may not be desired.The regular expression
code
named capture permits unexpected quoted printable codes (other than[bBqQ]
so that a match will occur and the code can raise an error. This helps me to detect unexpected values when processing text. Change the regular expression named capture forcode
to[bBqQ]
if you do not want this behaviour. (There will be no match and the original string will be returned.)It makes use of the global
Regexp.last_match
as a convenience in thegsub
block. You may need to take care if using this in multi-threaded code, I have not given this any consideration.
Additional references and reading:
- https://en.wikipedia.org/wiki/Quoted-printable
- https://en.wikipedia.org/wiki/MIME#Encoded-Word
require "minitest/autorun"
module QuotedPrintableDecode
class UnhandledCodeError < StandardError
def initialize(code)
super("Unhandled quoted printable code: '#{code}'.")
end
end
@@qp_text_regex = %r{
=\? # Opening literal: `=?`
(?<charset>[^\?]+) # Character set, e.g. "Windows-1252" in `=?Windows-1252?`
\? # Literal: `?`
(?<code>[a-zA-Z]) # Encoding, e.g. "Q" in `?Q?` (`B`ase64); [BbQq] expected, others raise
\? # Literal: `?`
(?<text>[^\?]+?) # Encoded text, lazy (non-greedy) matched, e.g. "Foo_bar" in `?Foo_bar?`
\?= # Closing literal: `?=`
(?:[ ]+(?==\?))? # Optional separating linear whitespace if another Q-encode follows
}x # Free-spacing mode to allow above comments, also changes whitespace match
refine String do
def decode_q_p(to: "UTF-8")
self.gsub(@@qp_text_regex) do
code, from, text = Regexp.last_match.values_at(:code, :charset, :text)
q_p_charset_to_charset(code, text, from, to)
end
end
private
def q_p_charset_to_charset(code, text, from, to)
case code
when "q", "Q"
text.gsub("_", "=20").unpack("M")
when "b", "B"
text.unpack("m")
else
raise UnhandledCodeError.new(code)
end.first.encode(to, from)
end
end
end
class TestQPDecode < Minitest::Test
using QuotedPrintableDecode
def test_decode_single_utf_8_phrase
encoded = "=?UTF-8?Q?J=2E_Pablo_Fern=C3=A1ndez?="
assert_equal encoded.decode_q_p, "J. Pablo Fernández"
end
def test_decoding_preserves_space_between_unencoded_phrase
encoded = "=?utf-8?Q?Alfred_Sanford?= <me@example.com>"
assert_equal encoded.decode_q_p, "Alfred Sanford <me@example.com>"
end
def test_decodinge_multiple_adjacent_phrases_absorbs_separating_whitespace
encoded = "=?Windows-1252?Q?Foo_-_D?= =?Windows-1252?Q?ocument_World=9617=96520;_Recor?= =?Windows-1252?Q?d_People_to_C?= =?Windows-1252?Q?anada's_History?="
assert_equal encoded.decode_q_p, "Foo - Document World–17–520; Record People to Canada's History"
end
def test_decoding_string_without_encoded_phrases_preserves_original
encoded = "Contains no QP phrases"
assert_equal encoded.decode_q_p, encoded
end
def test_unhandled_code_raises
klass = QuotedPrintableDecode::UnhandledCodeError
message = "Unhandled quoted printable code: 'Z'."
encoded = "=?utf-8?Z?Unhandled code Z?="
raised_error = assert_raises(klass) { encoded.decode_q_p }
assert_equal message, raised_error.message
end
end
精彩评论