开发者

Cleaning up 'smart' characters from Word in Ruby

I need to clean up various Word 'smart' characters in user input, including but not limited to the following:

– EN DASH
‘ LEFT SINGL开发者_StackOverflow中文版E QUOTATION MARK
’ RIGHT SINGLE QUOTATION MARK

Are there any Ruby functions or libraries for mapping these into their ASCII (near-) equivalents, or do I really need to just do a bunch of manual gsubs?


The HTMLEntities gem will decode the entities to UTF-8.

You could use iconv to transliterate to the closest ASCII equivalents or simple gsub or tr calls. James Grey has some blogs about converting between various character sets showing how to do the transliterations.

require 'htmlentities'

chars = [
  '–', # EN DASH
  '‘', # LEFT SINGLE QUOTATION MARK
  '’'  # RIGHT SINGLE QUOTATION MARK
]

decoder = HTMLEntities.new('expanded')
chars.each do |c|
  puts "#{ c } => #{ decoder.decode(c) } => #{ decoder.decode(c).tr('–‘’', "-'")} => #{ decoder.decode(c).encoding }"
end

# >> – => – => - => UTF-8
# >> ‘ => ‘ => ' => UTF-8
# >> ’ => ’ => ' => UTF-8


Some gsubs sound like the best bet, especially if you're planning to load an entire extra library to do basically the same thing.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜