Regex not to match a set of Strings

2023-03-01 09:29 问答作者：

How to construct a regex not to contain a set of strings within.

For this example, I want to validate the Address Line 1 text box so that it wont contain any secondary address part开发者_StackOverflow社区s such 'Apt', 'Bldg','Ste','Unit' etc.

A regex can be used to verify that a string does not contain a set of words. Here is a tested Java code snippet with a commented regex which does precisely this:

if (s.matches("(?sxi)" +
    "# Match string containing no 'bad' words.\n" +
    "^                # Anchor to start of string.\n" +
    "(?:              # Step through string one char at a time.\n" +
    "  (?!            # Negative lookahead to exclude words.\n" +
    "    \\b          # All bad words begin on a word boundary\n" +
    "    (?:          # List of 'bad' words NOT to be matched.\n" +
    "      Apt        # Cannot be 'Apt',\n" +
    "    | Bldg       # or 'Bldg',\n" +
    "    | Ste        # or 'Ste',\n" +
    "    | Unit       # or 'Unit'.\n" +
    "    )            # End list of words NOT to be matched.\n" +
    "    \\b          # All bad words end on a word boundary\n" +
    "  )              # Not at the beginning of bad word.\n" +
    "  .              # Ok. Safe to match this character.\n" +
    ")*               # Zero or more 'not-start-of-bad-word' chars.\n" +
    "$                # Anchor to end of string.")
    ) {
    // String has no bad words.
    System.out.print("OK: String has no bad words.\n");
} else {
    // String has bad words.
    System.out.print("ERR: String has bad words.\n");
}

This assumes that the words must be "whole" words and that the "bad" words should be recognized regardless of case. Note also, (as others have correctly stated), that this is not as efficient as simply checking for the presence of bad words and then taking the logical NOT.

Rather than trying to construct a regex to match strings that don't contain these substrings, why not construct a regex to match strings that do contain one or more of them? Then if that regex returns true, you know that you have an invalid string.

A more theoretical answer:

Deterministic Finite Automata have a one-to-one correspondence with regular expressions; that is, for every regular language, you can construct a DFA that will accept exactly the strings that are contained in the regular language. And, for every regular language, you can construct a regular expression that will match only the strings that are in that language. Thus, for any regular expression, you can construct a DFA that accepts exactly the same strings, and vice versa.

A Non-Deterministic Finite Automaton (NFA) can be turned into a Deterministic Finite Automaton (DFA) by constructing a DFA state for every combination of states in the NFA. (This is |Q|² states, which is a finite number.)

With that knowledge, we can reverse a DFA A and produce a DFA A' which accepts every string that A rejects, and rejects every string that A accepts.

This can be done by turning all of the end states into temporary start states, and the start state into an end state. Then, we proceed to add epsilon-transitions from a new starting state to every one of these temporary start states to make it a valid NFA (epsilon-NFA, if you want to nitpick). Then, we turn it into a DFA as we know we can do.

The only remaining step is to turn our new DFA into a regular expression. The algorithm for this is stupidly simple: for every path from start to end states, we include that in the regular expression by using | (or) for every branch, concatenation for serial states, and * (kleene closure) for every loop.

You do the negation of the strings that you don't want - e.g.

"ten" !~ /one|two|three/

This gives you:

print "one" !~ /one|two|three/ --> false
print "two" !~ /one|two|three/ --> false
print "ten" !~ /one|two|three/ --> true

继续阅读：regex

Regex not to match a set of Strings

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？