Need help on a a .NET regular expression for parsing JSON strings
I am writing a JSON parser for .NET and it parses JSON objects near perfectly so far. One problem I'm having, though, is that it will parse simple strings, but will not parse complex strings. Here is an example:
It will parse \"Hi there!\"
as a 开发者_JAVA技巧string.
It will not parse \"Hi !*\t\r\n,,{}]][] (.&^.)@!+=~`' there\"
The spec I am using for a JSON string is directly from the JSON website.
My .NET regex strings (as I interpreted from the site) are:
string json_char = @"(\\""|\\\\|\\/|\\b|\\f|\\n|\\r|\\u|[^(\""|\\)])";
string json_string = @"(\""" + json_char + @"*\"")";
The above are exactly as they appear in Visual Studio. Note that with the @ symbols, two double-quotes ("") are required to specify a single double-quote (") character in the actual string value.
The above regex strings match nothing in the second, complex string example I gave above. I've fiddled with the regex strings, but nothing seems to work.
What I want is a regex string that will parse a JSON string as specified by the website. Any help is appreciated.
If I were writing the parser, I might approach it a bit differently. Parsing is a different kind of operation than matching, and sometimes a Regex may only get you half-way there. For example, I would probably match and capture all the name/value pairs from the parent JSON document using a Regex like this: string pattern = @"(?:""[^""\\]*(?:\\.[^""\\]*)*"")+";
which will return everything between and including the opening and closing quotes of the string. Then I would check the captured string for the exceptional cases outlined in the JSON spec, like a backslash not being followed by a valid escape code, and then throw an exception if I found any issues. I might also consider replacing any naked escape codes, like a tab character, with a \t
. Once I had the captured string sanitized, and error-checked, I could run Regex.Unescape()
to return the final string.
The first thing you should do is get rid of all those unneeded backslashes. Some of them should just be removed; for example, the backslash in \""
is just ignored. The rest of the backslashes are pulling their weight, but you don't have to write them out every time. For example, this will match escaped quotes and backslashes plus the whitespace escape sequences (FYI, you left the t
out of your regex):
@"\\[""\\/bfnrt]"
I left out the u
for Unicode escapes because it has to be followed by four hexadecimal digits; you have to match them separately from the other escapes. Adding them to the above regex gives you
@"\\(?:[""\\/bfnrt]|u[0-9A-Fa-f]{4})"
Finally, you seem to be using [^(""|\\)]
for the catch-all part, i.e., any Unicode character except a quotation mark, a backslash, or a control character. What that part actually matches is any Unicode character except (
, "
, |
, )
, or a backslash. The correct way to match anything but a quotation mark or a backslash would be [^""\\]
, but you also need to exclude control characters. For that you can use Unicode property, \p{Cc}
. Here's the whole thing:
@"""(?:[^\p{Cc}""\\]+|\\(?:[""\\/bfnrt]|u[0-9A-Fa-f]{4}))*"""
Notice that I included the quote delimiters in this regex instead of adding them in a separate step as you did. I'm assuming the backslash in \"
is not meant to be treated as a literal character; otherwise you would have used two of them.
Note that with the @ symbols, two double-quotes ("") are required to specify a single double-quote (") character in the actual string value.
Additionally to this, in an @-ed string, a backslash character is a literal backslash. So if you write, say, @"\\t"
, the regex engine will look for a backslash followed by a "t", not a tab character.
I suspect that these superfluous backslashes are the source of your problem.
精彩评论