Cleaning up 'smart' characters from Word in Ruby
I need to clean up various Word 'smart' characters in user input, including but not limited to the following:
– EN DASH
‘ LEFT SINGL开发者_StackOverflow中文版E QUOTATION MARK
’ RIGHT SINGLE QUOTATION MARK
Are there any Ruby functions or libraries for mapping these into their ASCII (near-) equivalents, or do I really need to just do a bunch of manual gsubs?
The HTMLEntities gem will decode the entities to UTF-8.
You could use iconv to transliterate to the closest ASCII equivalents or simple gsub
or tr
calls. James Grey has some blogs about converting between various character sets showing how to do the transliterations.
require 'htmlentities'
chars = [
'–', # EN DASH
'‘', # LEFT SINGLE QUOTATION MARK
'’' # RIGHT SINGLE QUOTATION MARK
]
decoder = HTMLEntities.new('expanded')
chars.each do |c|
puts "#{ c } => #{ decoder.decode(c) } => #{ decoder.decode(c).tr('–‘’', "-'")} => #{ decoder.decode(c).encoding }"
end
# >> – => – => - => UTF-8
# >> ‘ => ‘ => ' => UTF-8
# >> ’ => ’ => ' => UTF-8
Some gsubs sound like the best bet, especially if you're planning to load an entire extra library to do basically the same thing.
精彩评论