How to search text surrounded by double-quotes with RegEx?

2023-04-07 02:20 问答作者：

I have a string with some HTML code in, for example:

This is <strong id="c1-id-8">some</strong> <em id="c1-id-9">text</em>

I need to strip out the id attribute from every HTML tag, but I have zero experience with regular expressions, so I searched here and there from the internet and I wrote this pattern: [\s]+id=\".*\"

Unfortunately it's not working as I would expect. Infact, I was hoping that the regular expression would catch the id=" followed by any character repeated for any number of times and terminated with the nearest double quote; Practically in this example I was expecting to catch id="c1-id-8" and id="c1-id-9". But instead the pattern returned me the substring id="c1-id-8">some</strong> <em id="c1-id-9", it finds the first occurrence of id=" and the last occurrence of a double quote character.

Could you tell me what is wrong in my pattern and how to fix i开发者_运维问答t, please? Thank you very much

The quantifier .* in your regex is greedy (meaning it matches as much as it can). In order to match the minimum required you could use something like /\s+id=\"[^\"]*\"/. The brackets [] indicate a character class. So it will match everything inside of the brackets. The carat [^] at the beginning of your character class is a negation, meaning it will match everything except what is specified in the brackets.

An alternative would be to tell the .* quantifier to be lazy by changing it to .*? which will match as little as it can.

In .* the asterisk is a greedy quantifier and matches as many characters as it can, so it only stops at the last " it finds.

You can either use ".*?" to make it lazy, or (better IMO), use "[^"]*" to make the match explicit:

"      # match a quote
[^"]*  # match any number of characters except quotes
"      # match a quote

You might still need to escape the quotes if you're building the regex from a string; otherwise that's not necessary since quotes are no special characters in a regex.

A parser is the best solution in the general case, but they to take time to write. There are cases where writing one would take more time than the parser would save; perhaps this is such a time.

What you want is a either a non-greedy match or a more precise match. /[\s]+id=\".?\"/ will do the trick, but [\s]+id=\"[^"]\" will be faster.

Note that a full regex that takes into account the possibility of escaped quotes characters, allows single quotes instead of double quotes, and allows for the absence of quotes entirely would be much more complex. You would really want a parser at that point.

example with grep: (but the point is the expression)

kent$  echo 'This is <strong id="c1-id-8">some</strong> <em id="c1-id-9">text</em>'|grep -oP '(?<= id=")[^"]*(?=">)'
c1-id-8
c1-id-9

If you know that your id is always 7 characters, you could do this.

/\sid=".{7}"/g

So..

var a = 'This is <strong id="c1-id-8">some</strong> <em id="c1-id-9">text</em>';

var b = a.replace(/\sid=".{7}"/g, '');

document.write(b);

Example: http://jsfiddle.net/jasongennaro/XPMze/

Check the inspector to see the ids removed.

继续阅读：regex

How to search text surrounded by double-quotes with RegEx?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？