开发者

How can I improve this regular expression?

I want a regular expression to match valid input into a Tags input field with the following properties:

  • 1-5 tags
  • Each tag is 1-30 characters long
  • Valid tag characters are [a-zA-Z0-9-]
  • input and tags can be separated by any amount of whitespace

For example:

Valid: tag1 tag2 tag3-with-dashes tag4-with-more-dashes tAaG5-with-MIXED-case

Here's what I have so far--it seems to work but I'm interested how it could b开发者_StackOverflow中文版e simplified or if it has any major flaws:

\s*[a-zA-Z0-9-]{1,30}(\s+[a-zA-Z0-9-]{1,30}){0,4}\s*

// that is: 
\s*                          // match all beginning whitespace
[a-zA-Z0-9-]{1,30}           // match the first tag
(\s+[a-zA-Z0-9-]{1,30}){0,4} // match all subsequent tags
\s*                          // match all ending whitespace

Preprocessing the input to make the whitespace issue easier isn't an option (e.g. trimming or adding a space).

If it matters, this will be used in javascript. Any suggestions would be appreciated, thanks!


You can simplify it a bit like this:

^(?:(?:^|\s+)[a-zA-Z0-9-]{1,30}){1,5}\s*$

The (?: ) syntax is a noncapturing group, which I believe should improve performance when you don't need groups per se.

Then the trick is this statement:

(?:^|\s+)

Thanks to the caret, this will match the beginning of the line, or one or more characters of whitespace.

UPDATE: This works perfectly in my testing and there's certainly less redundant code. However, I just used the benchmarking in Regex Hero to find that your original regex is actually faster. That's probably because mine is causing more backtracking to occur.

UPDATE #2: I found another way that accomplishes the same thing, I think:

^(?:\s*[a-zA-Z0-9-]{1,30}){1,5}\s*$

I realized that I was trying too hard. \s* matches 0 or more spaces, which means that it'll work for a single tag. But... it'll work for 2-5 tags as well because the space is not in your character class [ ]. And indeed it fails with 6 tags as it should. That means this a much more forward-looking regex with less backtracking, better performance, and less redundancy.

UPDATE #3:

I see the error in my ways. This should work better.

^(?:\s*[a-zA-Z0-9-]{1,30}\b){1,5}\s*$

Putting the \b just before the last ) will assert a word boundary. That allows the 1-30 character length rule to work properly again.


Performance-wise, you can optimize (improve) it this way:

^(?:\s+[a-zA-Z0-9]{1,30}){1,5}\s*$

And add a whitespace in the front, before testing the regexp.

^
(?: // don't keep track of groups
\s+ // first (necessairy whitespace) or between
  [a-zA-Z0-9-]{1,30} // unchanged
  ){1,5} // 1 to 5 tags
\s*$


Your RE looks like it's doing pretty much exactly what you were asking for. I might recommend not using an RE at all though, in this case - just split the input on whitespace into an array, then validate each value in the array on it's own.

REs are cool, but sometimes, they aren't the best way to get the job done :)


\w could replace the a-zA-Z0-9 but it also contains _ if that's ok.

You may also be able to break it down a little more like this:

(\s*[a-zA-Z0-9-]{1,30}){0,5}

if you are always guaranteed to have whitespace separating your tags.


You could shorten it to something like

([a-zA-Z0-9-]{1,30}\s*){1,5}

I always like to make my regular expressions more concise (where it doesn't affect performance).


You're not going to improve on that. Anything you do to reduce the length will also make it harder to read, and regexes don't need any help in that regard. ;)

That said, your regex needs to be more complicated anyway. As written, it fails to ensure that tag names don't start or end with a hyphen, or contain two or more consecutive hyphens. The regex for a single tag would need to be structured like this:

[A-Za-z0-9]+(?:-[A-Za-z0-9]+)*

Then the base regex to match up to five tags would be

[A-Za-z0-9]+(?:-[A-Za-z0-9]+)*(?:\s+[A-Za-z0-9]+(?:-[A-Za-z0-9]+)*){0,4}

...but that doesn't enforce the maximum tag length. I think the simplest way to do that would be to put your original regex in a lookahead:

/^\s*
 (?=[A-Za-z0-9-]{1,30}(\s+[A-Za-z0-9-]{1,30}){0,4}\s*$)
 (?:[A-Za-z0-9]+(?:-[A-Za-z0-9]+)*\s*)+$
/

The lookahead enforces the tag lengths as well as the overall structure of up five tags separated by whitespace. Then the main body only has to enforce the structure of the individual tags.

I could have shortened the regex a bit by leaving the a-z out of the character classes and adding the i modifier. I didn't do that because you talked about using the regex in an ASP.NET validator, and as far as I know, they don't let you use regex modifiers. And, since JavaScript doesn't support the (?i) inline modifier syntax, case-insensitive validator regexes aren't possible. If I'm mistaken about that, I hope someone will correct me.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜