Hpricot encodings in ruby 1.9
I have a rails3 application running on ruby 1.9 here, and Im having some pain making encodings work.
My task was to open a remote html page, and parse some information from it. all my code and database are in UTF-8, im using the # code: UTF-8, mysql fix, and so on.
The page I open, is in charset ISO-8859-1, and when my parser find strage characters it complains its not a valid UTF-8 one.
I tryed to use .force_encoding("UTF-8") in all strings I've parsed, but it still. When I try to convert the whole page, I get this:
a = open("someurl")
b = a.read.encode("UTF-8")
Encoding::UndefinedConversionError: "\xE9" from ASCII-8BIT to UTF-8
from (irb):7:in `encode'
from (irb):7
from /Users/tscolari/.rvm/gems/ruby-1.9.2-p0/gems/railties-3.0.0/lib/rails/commands/console.rb:44:in `start'
from /Users/tscolari/.rvm/gems/ruby-1.9.2-p0/gems/railties-3.0.0/lib/rails/commands/console.rb:8:in `start'
from /Users/tscolari/.rvm/gems/ruby-1.9.2-p0/gems/railties-3.0.0/lib/rails/commands.rb:23:in `<top (required)>'
from script/rails:6:in `require'
from script/rails:6:in `<main>'
how could I fix this? it seems it already went wrong when he "converted" the iso8859 page to ascii.
UPDATE
I tryed opening the url using 'r:iso-8859-1:utf-8', but apparently my problem now is with Hpricot, that I use for parsing.
>a = open(b, 'r:iso-8859-1:utf-8')
>a开发者_如何学运维.read.encoding
=> #<Encoding:UTF-8>
> Hpricot(a).inner_html.encoding
=> #<Encoding:ASCII-8BIT>
and all the errors again... probably this is an hpricot issue, but if anyone knows a fix, please.
Hpricot - UTF-8 issues invalid byte sequence in UTF-8 (ArgumentError)
require 'hpricot'
require 'open-uri'
doc = open('http://www.amazon.co.jp/') { |f| Hpricot(f.read) }
puts doc.to_html
open('http://www.amazon.co.jp/') { |f| Hpricot(f.read.encode("UTF-8")) }
a = open("someurl", "r:iso-8859-1:utf-8")
See this other SO question for more details...
精彩评论