开发者

Regex Enforcing match

Ok i got this regex:

^[\w\s]+=["']\w+['"]

Now the regex will match:

a href='google'

a href="google"

and also

a href='google"

How can i enforce regex to match its quote?

If first quote is single quote, how can i make th开发者_运维问答e last quote also a single quote not a double quote


Read about backreferences.

^[\w\s]+=(["'])\w+?\1

Note that you want to put a ? after the second + or else it will be greedy. However, in general this is not the right way to parse HTML. Use Beautiful Soup.


I am afraid you will have to do it the long way:

^[\w\s]+=("\w+"|'\w+')

More technically, ensuring correct matching / nesting of quotes is not a problem for a regular grammar so for more complex problems you would have to use a proper parser (or perl6 style extended regular expression but they technically do not class as regular expressions).


Replace the ['"] with \1 to use a back reference (capture group)

^[\w\s]+=["']\w+\1


What exactly do you want to match? It sounds you want to match:

  • word (tagname)
  • mandatory whitespace
  • word (attr name)
  • optional whitespace
  • =
  • optional whitespace
  • either single quoted or double quoted anything (attr value)

That would be: ^(\w+)\s+(\w+)\s*=\s*(?:'([^']*)'|"([^"]*)")

This will allow matches like:

  • a href='' - empty attr
  • a href='Hello world' - spaces and other non-word characters in quoted part
  • a href="one 'n two" - quotes of different kind in quoted part
  • a href = 'google' - spaces on both sides of =

And disallow things like these that your original regexp allows:

  • a b c href='google' - extra words
  • ='google' - only spaces on the left
  • href='google' - only attr on the left

It still doesn't sound exactly right - you're trying to match a tag with exactly one attribute?

With this regexp, tag name will be in $1, attr name in $2, and attr value in either $3 or $4 (the other being nil - most languages distinguish group not taken with nil vs group taken but empty with "" if you need it).

Regexp that would ensure attr value gets in the same group would be messier if you wanted to allow single quotes in doubly quoted attr value and vice verse - something like ^(\w+)\s+(\w+)\s*=\s*(['"])((?:(?!\3).)*)\3 ((?!) is zero-width negative look-ahead - (?:(?!\3).) means something like [^\3] except the latter isn't supported).

If you don't care about this ^(\w+)\s+(\w+)\s*=\s*(['"])(['"]*)\3 will do just fine (for both $3 will be quote type, and $4 attr value).

By the way re (["'])\w+?\1 above - \w doesn't match quotes, so this ? doesn't change anything.

Having said all that, use a real HTML parser ;-)

These regexps will work in Perl and Ruby. Other languages usually copy Perl's regexp system, but often introduce minor changes so some adjustments might be necessary. Especially the one with negative look-aheads might be unsupported.


Try this:

^[\w\s]+="\w+"|^[\w\s]+='\w+'
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜