开发者

normalization for whitespaces in <script> </script> blocks

I want to write regular expressions to read data inside blocks of <script></script> in HTML. Being script, I suppose there is flexibility in whitespace. In order to make my regex patterns robust, I would have to anticipate varying amounts of whitespace. Perhaps there is an easier way than putting many whitespace matchers in my patterns. For example, there might be a normalizer? (The normalizer would of course have to understand string literals in order not to ruin them.)

I'm using .NET and the Regex class. (Note: the Regex class has a ECMAScript o开发者_运维百科ption which I thought might enable a feature that understands script whitespace but reading the description of it, it seems not.)

Edit: Regex class has an option "IgnorePatternWhitespace" but note this grants flexibility in writing regex patterns. It doesn't change the parsing/matching behaviour.

I am trying to avoid putting whitespace matchers in many locations in the following kind of patterns:

const string propertyKey = @""".+""";
const string propertyValue = @""".+""";
string property = propertyKey + @"\x3a" + propertyValue;
string actionProperties = property + @"(\x2c" + property + @")*";
string actionPattern = @"\x7b" + actionProperties + @"\x7d";
string contentPattern = actionPattern + @"(\x2c" + actionPattern + @")*";
string corporateActionsPattern = @"corp_actions\s*:\s*""\s*[" + contentPattern + @"]\s*""";


As already noted in the comments by kirilloid, the javascript language is much too complex to be parsed by regular expressions. What you need is a fully fledged javascript parser, which is a nontrivial thing to write.

What is it you are trying to achieve by this?

Maybe there is a better way, and people here could help you if they knew what it is you hope to get out of it :)


The imperfect solution was to normalize the script by removing all whitespace (not merely normalizing down to a single space). The integrity of string literals was respected. Regex matchers become easier to write. Note the Javascript will be ruined because reserved words and identifiers will run into each other when whitespace is removed but the risk of problems is low if the goal is to parse only the "data" parts (ie: string literals, numbers, and the punctuation that surrounds them).

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜