开发者

HTML encode UTF-8 string gets mangled into latin1

I'm parsing my nginx logs, and I want to discover some details from the HTTP_REFERER string, for example, the query string used to find the web site. One user typed in "México" which gets encoded in the log as "query=M%E9xico".

Passing this through Rack::Utils.parse_query('query=M%E9xico') you get a hash, {"query" => "M?xico"}

When you to stuff "M?exico" into Postgres (but not the more 开发者_JAVA百科forgiving SQLite), it pukes because the string isn't proper UTF-8. Looking at http://rack.rubyforge.org/doc/Rack/Utils.html#M000324, unescape is packing a hex string.

How can I convert the string back to UTF-8, or can I get parse_query to return UTF-8 in the first place.


unescape will decode the URL encoding:

Rack::Utils.parse_query(URI.unescape('query=M%E9xico'))

Or

Rack::Utils.parse_query(Utils.unescape('query=M%E9xico'))


The problem here happens well before you get ahold of the data. You need to fix the problem upstream if you can, and if you can't then my suggestion is find out the encoding and convert it on input or using conversion libraries in Ruby (iconv for example).

The problem is not in PostgreSQL, though.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜