开发者

Ruby fixing multiple encoding documents

I'm trying to retrieve a Web page, and apply a simple regular expression on it. Some Web pages contain non-UTF-8 characters, even though UTF-8 is claimed in Content-Type (example). In these cases I get:

ArgumentError (invalid byte sequence in UTF-8)

I've tried to use the following methods for sanitizing bad characters, but none of them helped to solve the issue:

  1. content = Iconv.conv("UTF-8//IGNORE", "UTF-8", content)
  2. content.encode!("UTF-8", :illegal => :replace, :undef => :replace, :replace => "?")

Here's the complete code:

response = Net::HTTP.get_response(url)
@encoding = detect_encoding(response) # Detects encoding using Content-Type or meta charset HTML tag
if (@encoding)
  @content =response.body.force_encoding(@encoding)
  @content = Iconv.conv(@encoding + '//IGNORE', @encoding, @content);
else
  @content = response.body
end

@content.gsub!(/.开发者_C百科../, "") # bang

Is there a way to deal with this issue? Basically, what I need is to set the base URL meta tag, and inject some Javascripts into the retrieved Web page.

Thanks!


I had a similar problem importing emails with different encodings, I ended with this:

def enforce_utf8(from = nil)
  begin
    self.is_utf8? ? self : Iconv.iconv('utf8', from, self).first
  rescue
    converter = Iconv.new('UTF-8//IGNORE//TRANSLIT', 'ASCII//IGNORE//TRANSLIT') 
    converter.iconv(self).unpack('U*').select{ |cp| cp < 127 }.pack('U*')
  end
end

at first, it tries to convert from *some_format* to UTF-8, in case there isn't any encoding or Iconv fails for some reason, then apply a strong conversion (ignore errors, translit chars and strip non recognized chars).

let me know if it works for you ;)

A.


Use the ASCII-8BIT encoding instead.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜