issue with CGI::unescapeHTML
CGI::unescapeHTML("&a开发者_开发知识库mp;#28195;打银")
=> "渣打\351\223\266"
CGI::unescapeHTML("渣打银 ")
=> "渣打银 "
Adding a space at the end makes the difference, else the last character is lost and I get this strange character sequence. I am facing this very issue when I try to scrape data form websites using utf-8 character encoding. This is true even for normal english text.
This is not a problem with the CGI
library that comes with Ruby 1.9.2 and above.
Run your ruby interpreter with -Ku
精彩评论