Can a malformed JSON string be parsed successfully?
Here's a sample string:
String s = "{\"source\": \"another \"quote inside\" text\"}";
What's the best way to parse this? I've already tried 4 parsers: json-lib, json-simple, gson, and Grails built-in JSON parser.
I'm using Java and I want to know if there's a way to fix the string after catching a MalformedJsonException or something.
Note: Or is this might be a bug in Twitter API? Here's a sample response string:
{
"coordinates": null,
"user": {
"is_translator": false,
"show_all_inline_media": false,
"following": null,
"geo_enabled": false,
"profile_background_color": "C0DEED",
"listed_count": 11,
"profile_background_image_url": "http://a3.twimg.com/a/1298064126/images/themes/theme1/bg.png",
"favourites_count": 4,
"followers_count": 66,
"contributors_enabled": false,
开发者_如何学JAVA "statuses_count": 1078,
"time_zone": "Tokyo",
"profile_text_color": "333333",
"friends_count": 51,
"profile_sidebar_fill_color": "DDEEF6",
"id_str": "107723125",
"profile_background_tile": false,
"created_at": "Sat Jan 23 14:16:03 +0000 2010",
"profile_image_url": "http://a3.twimg.com/profile_images/652140488/--------------_normal.jpg",
"description": "Mu8ecdu56e3u306eu56e3u9577u3068u30eau30fcu30c0u30fcu3067u3059u3002u8da3u5473u306fu7af6u99acu306eu4e88u60f3u3068u30b0u30e9u30c3u30d7u30eau30f3u30b0u3068u6253u6483u3092u30e1u30a4u30f3u3068u3057u3066u3044u307eu3059u3063uff01",
"location": "u5bccu5c71u770c",
"notifications": null,
"profile_link_color": "0084B4",
"protected": false,
"screen_name": "mattsun0209",
"follow_request_sent": null,
"lang": "ja",
"profile_sidebar_border_color": "C0DEED",
"name": "u307eu3063u3064u3093",
"verified": false,
"id": 107723125,
"profile_use_background_image": true,
"utc_offset": 32400,
"url": null
},
"in_reply_to_screen_name": null,
"in_reply_to_status_id": null,
"in_reply_to_status_id_str": null,
"in_reply_to_user_id": null,
"text": "u3042u30fcu3001u7d50u819cu708eu306bu306au3063u3066u3057u307eu3063u305fu3002",
"contributors": null,
"retweeted": false,
"in_reply_to_user_id_str": null,
"retweet_count": 0,
"source": "u003Ca href="http: //twtr.jp" rel="nofollow"u003EKeitai Webu003C/au003E",
"id_str": "42128197566861312",
"created_at": "Mon Feb 28 07:45:19 +0000 2011",
"geo": null,
"entities": {
"hashtags": [],
"user_mentions": [],
"urls": []
},
"truncated": false,
"place": null,
"id": 42128197566861312,
"favorited": false
}
Take note of the source
property:
"source": "u003Ca href="http: //twtr.jp" rel="nofollow"u003EKeitai Webu003C/au003E"
I'm afraid that's a classic "garbage in, garbage out" situation. The JSON is invalid, and so you can't parse it properly. You can only guess at what it's meant to be. Now, we humans can guess pretty well at what was intended (obviously), but that's much more difficult at a parser level.
If you know that consistently you're getting this invalid source
property, you could pre-process the string before deserializing it, but the real fix has to be at the source of the invalid data — Twitter or whatever twit (as it were) is providing it. I'm assuming that this is the actual string data you've received, and not a processed form of it.
Pre process the data before parsing it.
For each line, find the first colon (assumption: no colons in property names), then escape every double quote on the line except for the first after the colon and last on the line.
According to the JSON grammar, the format is invalid. Unless you work at Twitter, the only viable choice is to preprocess the response before parsing it.
精彩评论