Ruby fixing multiple encoding documents

2023-03-10 18:04 问答作者：

I'm trying to retrieve a Web page, and apply a simple regular expression on it. Some Web pages contain non-UTF-8 characters, even though UTF-8 is claimed in Content-Type (example). In these cases I get:

ArgumentError (invalid byte sequence in UTF-8)

I've tried to use the following methods for sanitizing bad characters, but none of them helped to solve the issue:

content = Iconv.conv("UTF-8//IGNORE", "UTF-8", content)
content.encode!("UTF-8", :illegal => :replace, :undef => :replace, :replace => "?")

Here's the complete code:

response = Net::HTTP.get_response(url)
@encoding = detect_encoding(response) # Detects encoding using Content-Type or meta charset HTML tag
if (@encoding)
  @content =response.body.force_encoding(@encoding)
  @content = Iconv.conv(@encoding + '//IGNORE', @encoding, @content);
else
  @content = response.body
end

@content.gsub!(/.开发者_C百科../, "") # bang

Is there a way to deal with this issue? Basically, what I need is to set the base URL meta tag, and inject some Javascripts into the retrieved Web page.

Thanks!

I had a similar problem importing emails with different encodings, I ended with this:

def enforce_utf8(from = nil)
  begin
    self.is_utf8? ? self : Iconv.iconv('utf8', from, self).first
  rescue
    converter = Iconv.new('UTF-8//IGNORE//TRANSLIT', 'ASCII//IGNORE//TRANSLIT') 
    converter.iconv(self).unpack('U*').select{ |cp| cp < 127 }.pack('U*')
  end
end

at first, it tries to convert from *some_format* to UTF-8, in case there isn't any encoding or Iconv fails for some reason, then apply a strong conversion (ignore errors, translit chars and strip non recognized chars).

let me know if it works for you ;)

Use the ASCII-8BIT encoding instead.

继续阅读：character-encoding ruby ruby-on-rails

Ruby fixing multiple encoding documents

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？