How can I improve this regular expression?

2022-12-24 09:19 问答作者：

I want a regular expression to match valid input into a Tags input field with the following properties:

1-5 tags
Each tag is 1-30 characters long
Valid tag characters are [a-zA-Z0-9-]
input and tags can be separated by any amount of whitespace

For example:

Valid: tag1 tag2 tag3-with-dashes tag4-with-more-dashes tAaG5-with-MIXED-case

Here's what I have so far--it seems to work but I'm interested how it could b开发者_StackOverflow中文版e simplified or if it has any major flaws:

\s*[a-zA-Z0-9-]{1,30}(\s+[a-zA-Z0-9-]{1,30}){0,4}\s*

// that is: 
\s*                          // match all beginning whitespace
[a-zA-Z0-9-]{1,30}           // match the first tag
(\s+[a-zA-Z0-9-]{1,30}){0,4} // match all subsequent tags
\s*                          // match all ending whitespace

Preprocessing the input to make the whitespace issue easier isn't an option (e.g. trimming or adding a space).

If it matters, this will be used in javascript. Any suggestions would be appreciated, thanks!

You can simplify it a bit like this:

^(?:(?:^|\s+)[a-zA-Z0-9-]{1,30}){1,5}\s*$

The (?: ) syntax is a noncapturing group, which I believe should improve performance when you don't need groups per se.

Then the trick is this statement:

(?:^|\s+)

Thanks to the caret, this will match the beginning of the line, or one or more characters of whitespace.

UPDATE: This works perfectly in my testing and there's certainly less redundant code. However, I just used the benchmarking in Regex Hero to find that your original regex is actually faster. That's probably because mine is causing more backtracking to occur.

UPDATE #2: I found another way that accomplishes the same thing, I think:

^(?:\s*[a-zA-Z0-9-]{1,30}){1,5}\s*$

I realized that I was trying too hard. \s* matches 0 or more spaces, which means that it'll work for a single tag. But... it'll work for 2-5 tags as well because the space is not in your character class [ ]. And indeed it fails with 6 tags as it should. That means this a much more forward-looking regex with less backtracking, better performance, and less redundancy.

UPDATE #3:

I see the error in my ways. This should work better.

^(?:\s*[a-zA-Z0-9-]{1,30}\b){1,5}\s*$

Putting the \b just before the last ) will assert a word boundary. That allows the 1-30 character length rule to work properly again.

Performance-wise, you can optimize (improve) it this way:

^(?:\s+[a-zA-Z0-9]{1,30}){1,5}\s*$

And add a whitespace in the front, before testing the regexp.

^
(?: // don't keep track of groups
\s+ // first (necessairy whitespace) or between
  [a-zA-Z0-9-]{1,30} // unchanged
  ){1,5} // 1 to 5 tags
\s*$

Your RE looks like it's doing pretty much exactly what you were asking for. I might recommend not using an RE at all though, in this case - just split the input on whitespace into an array, then validate each value in the array on it's own.

REs are cool, but sometimes, they aren't the best way to get the job done :)

\w could replace the a-zA-Z0-9 but it also contains _ if that's ok.

You may also be able to break it down a little more like this:

(\s*[a-zA-Z0-9-]{1,30}){0,5}

if you are always guaranteed to have whitespace separating your tags.

You could shorten it to something like

([a-zA-Z0-9-]{1,30}\s*){1,5}

I always like to make my regular expressions more concise (where it doesn't affect performance).

You're not going to improve on that. Anything you do to reduce the length will also make it harder to read, and regexes don't need any help in that regard. ;)

That said, your regex needs to be more complicated anyway. As written, it fails to ensure that tag names don't start or end with a hyphen, or contain two or more consecutive hyphens. The regex for a single tag would need to be structured like this:

[A-Za-z0-9]+(?:-[A-Za-z0-9]+)*

Then the base regex to match up to five tags would be

[A-Za-z0-9]+(?:-[A-Za-z0-9]+)*(?:\s+[A-Za-z0-9]+(?:-[A-Za-z0-9]+)*){0,4}

...but that doesn't enforce the maximum tag length. I think the simplest way to do that would be to put your original regex in a lookahead:

/^\s*
 (?=[A-Za-z0-9-]{1,30}(\s+[A-Za-z0-9-]{1,30}){0,4}\s*$)
 (?:[A-Za-z0-9]+(?:-[A-Za-z0-9]+)*\s*)+$
/

The lookahead enforces the tag lengths as well as the overall structure of up five tags separated by whitespace. Then the main body only has to enforce the structure of the individual tags.

I could have shortened the regex a bit by leaving the a-z out of the character classes and adding the i modifier. I didn't do that because you talked about using the regex in an ASP.NET validator, and as far as I know, they don't let you use regex modifiers. And, since JavaScript doesn't support the (?i) inline modifier syntax, case-insensitive validator regexes aren't possible. If I'm mistaken about that, I hope someone will correct me.

继续阅读：javascript regex

How can I improve this regular expression?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？