Ruby fixing multiple encoding documents
I'm trying to retrieve a Web page, and apply a simple regular expression on it. Some Web pages contain non-UTF-8 characters, even though UTF-8 is claimed in Content-Type (example). In these cases I get:
ArgumentError (invalid byte sequence in UTF-8)
I've tried to use the following methods for sanitizing bad characters, but none of them helped to solve the issue:
content = Iconv.conv("UTF-8//IGNORE", "UTF-8", content)
content.encode!("UTF-8", :illegal => :replace, :undef => :replace, :replace => "?")
Here's the complete code:
response = Net::HTTP.get_response(url)
@encoding = detect_encoding(response) # Detects encoding using Content-Type or meta charset HTML tag
if (@encoding)
@content =response.body.force_encoding(@encoding)
@content = Iconv.conv(@encoding + '//IGNORE', @encoding, @content);
else
@content = response.body
end
@content.gsub!(/.开发者_C百科../, "") # bang
Is there a way to deal with this issue? Basically, what I need is to set the base URL meta tag, and inject some Javascripts into the retrieved Web page.
Thanks!
I had a similar problem importing emails with different encodings, I ended with this:
def enforce_utf8(from = nil)
begin
self.is_utf8? ? self : Iconv.iconv('utf8', from, self).first
rescue
converter = Iconv.new('UTF-8//IGNORE//TRANSLIT', 'ASCII//IGNORE//TRANSLIT')
converter.iconv(self).unpack('U*').select{ |cp| cp < 127 }.pack('U*')
end
end
at first, it tries to convert from *some_format* to UTF-8, in case there isn't any encoding or Iconv fails for some reason, then apply a strong conversion (ignore errors, translit chars and strip non recognized chars).
let me know if it works for you ;)
A.
Use the ASCII-8BIT encoding instead.
精彩评论