开发者

Unescaping characters in a string with Ruby

Given a string in the following format (the Posterous API returns posts in this format):

s="\\u003Cp\\u003E"

How can I convert it to the actual ascii characters such that s="<p>"?

On OSX, I successfully used Iconv.iconv('ascii', 'java', s) but once deployed to Heroku, I receive an Iconv::IllegalSequence exception. I'm guessing that the system Heroku deploys to does开发者_如何学C't support the java encoder.


I am using HTTParty to make a request to the Posterous API. If I use curl to make the same request then I do not get the double slashes.

From HTTParty github page:

Automatic parsing of JSON and XML into ruby hashes based on response content-type

The Posterous API returns JSON (no double slashes) and HTTParty's JSON parsing is inserting the double slash.


Here is a simple example of the way I am using HTTParty to make the request.

class Posterous
  include HTTParty
  base_uri "http://www.posterous.com/api/2"
  basic_auth "username", "password"
  format :json
  def get_posts
    response = Posterous.get("/users/me/sites/9876/posts&api_token=1234")
    # snip, see below...
  end
end

With the obvious information (username, password, site_id, api_token) replaced with valid values.

At the point of snip, response.body contains a Ruby string that is in JSON format and response.parsed_response contains a Ruby hash object which HTTParty created by parsing the JSON response from the Posterous API.

In both cases the unicode sequences such as \u003C have been changed to \\u003C.


I've found a solution to this problem. I ran across this gist. elskwid had the identical problem and ran the string through a JSON parser:

s = ::JSON.parse("\\u003Cp\\u003E")

Now, s = "<p>".


I ran into this exact problem the other day. There is a bug in the json parser that HTTParty uses (Crack gem) - basically it uses a case-sensitive regexp for the Unicode sequences, so because Posterous puts out A-F instead of a-f, Crack isn't unescaping them. I submitted a pull request to fix this.

In the meantime HTTParty nicely lets you specify alternate parsers so you can do ::JSON.parse bypassing Crack entirely like this:

class JsonParser < HTTParty::Parser
  def json
    ::JSON.parse(body)
  end
end

class Posterous
   include HTTParty
   parser ::JsonParser

   #....
end


You can also use pack:

"a\\u00e4\\u3042".gsub(/\\u(....)/){[$1.hex].pack("U")} # "aäあ"

Or to do the reverse:

"aäあ".gsub(/[^ -~\n]/){"\\u%04x"%$&.ord} # "a\\u00e4\\u3042"


The doubled-backslashes almost look like a regular string being viewed in a debugger.

The string "\u003Cp\u003E" really is "<p>", only the \u003C is unicode for < and \003E is >.

>> "\u003Cp\u003E"  #=> "<p>"

If you are truly getting the string with doubled backslashes then you could try stripping one of the pair.

As a test, see how long the string is:

>> "\\u003Cp\\u003E".size #=> 13
>> "\u003Cp\u003E".size #=> 3
>> "<p>".size #=> 3

All the above was done using Ruby 1.9.2, which is Unicode aware. v1.8.7 wasn't. Here's what I get using 1.8.7's IRB for comparison:

>> "\u003Cp\u003E" #=> "u003Cpu003E"
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜