开发者

How to search text surrounded by double-quotes with RegEx?

I have a string with some HTML code in, for example:

This is <strong id="c1-id-8">some</strong> <em id="c1-id-9">text</em>

I need to strip out the id attribute from every HTML tag, but I have zero experience with regular expressions, so I searched here and there from the internet and I wrote this pattern: [\s]+id=\".*\"

Unfortunately it's not working as I would expect. Infact, I was hoping that the regular expression would catch the id=" followed by any character repeated for any number of times and terminated with the nearest double quote; Practically in this example I was expecting to catch id="c1-id-8" and id="c1-id-9". But instead the pattern returned me the substring id="c1-id-8">some</strong> <em id="c1-id-9", it finds the first occurrence of id=" and the last occurrence of a double quote character.

Could you tell me what is wrong in my pattern and how to fix i开发者_运维问答t, please? Thank you very much


The quantifier .* in your regex is greedy (meaning it matches as much as it can). In order to match the minimum required you could use something like /\s+id=\"[^\"]*\"/. The brackets [] indicate a character class. So it will match everything inside of the brackets. The carat [^] at the beginning of your character class is a negation, meaning it will match everything except what is specified in the brackets.

An alternative would be to tell the .* quantifier to be lazy by changing it to .*? which will match as little as it can.


In .* the asterisk is a greedy quantifier and matches as many characters as it can, so it only stops at the last " it finds.

You can either use ".*?" to make it lazy, or (better IMO), use "[^"]*" to make the match explicit:

"      # match a quote
[^"]*  # match any number of characters except quotes
"      # match a quote

You might still need to escape the quotes if you're building the regex from a string; otherwise that's not necessary since quotes are no special characters in a regex.


A parser is the best solution in the general case, but they to take time to write. There are cases where writing one would take more time than the parser would save; perhaps this is such a time.

What you want is a either a non-greedy match or a more precise match. /[\s]+id=\".?\"/ will do the trick, but [\s]+id=\"[^"]\" will be faster.

Note that a full regex that takes into account the possibility of escaped quotes characters, allows single quotes instead of double quotes, and allows for the absence of quotes entirely would be much more complex. You would really want a parser at that point.


example with grep: (but the point is the expression)

kent$  echo 'This is <strong id="c1-id-8">some</strong> <em id="c1-id-9">text</em>'|grep -oP '(?<= id=")[^"]*(?=">)'
c1-id-8
c1-id-9


If you know that your id is always 7 characters, you could do this.

/\sid=".{7}"/g

So..

var a = 'This is <strong id="c1-id-8">some</strong> <em id="c1-id-9">text</em>';

var b = a.replace(/\sid=".{7}"/g, '');

document.write(b);

Example: http://jsfiddle.net/jasongennaro/XPMze/

Check the inspector to see the ids removed.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜