开发者

Regex not to match a set of Strings

How to construct a regex not to contain a set of strings within.

For this example, I want to validate the Address Line 1 text box so that it wont contain any secondary address part开发者_StackOverflow社区s such 'Apt', 'Bldg','Ste','Unit' etc.


A regex can be used to verify that a string does not contain a set of words. Here is a tested Java code snippet with a commented regex which does precisely this:

if (s.matches("(?sxi)" +
    "# Match string containing no 'bad' words.\n" +
    "^                # Anchor to start of string.\n" +
    "(?:              # Step through string one char at a time.\n" +
    "  (?!            # Negative lookahead to exclude words.\n" +
    "    \\b          # All bad words begin on a word boundary\n" +
    "    (?:          # List of 'bad' words NOT to be matched.\n" +
    "      Apt        # Cannot be 'Apt',\n" +
    "    | Bldg       # or 'Bldg',\n" +
    "    | Ste        # or 'Ste',\n" +
    "    | Unit       # or 'Unit'.\n" +
    "    )            # End list of words NOT to be matched.\n" +
    "    \\b          # All bad words end on a word boundary\n" +
    "  )              # Not at the beginning of bad word.\n" +
    "  .              # Ok. Safe to match this character.\n" +
    ")*               # Zero or more 'not-start-of-bad-word' chars.\n" +
    "$                # Anchor to end of string.")
    ) {
    // String has no bad words.
    System.out.print("OK: String has no bad words.\n");
} else {
    // String has bad words.
    System.out.print("ERR: String has bad words.\n");
} 

This assumes that the words must be "whole" words and that the "bad" words should be recognized regardless of case. Note also, (as others have correctly stated), that this is not as efficient as simply checking for the presence of bad words and then taking the logical NOT.


Rather than trying to construct a regex to match strings that don't contain these substrings, why not construct a regex to match strings that do contain one or more of them? Then if that regex returns true, you know that you have an invalid string.


A more theoretical answer:

Deterministic Finite Automata have a one-to-one correspondence with regular expressions; that is, for every regular language, you can construct a DFA that will accept exactly the strings that are contained in the regular language. And, for every regular language, you can construct a regular expression that will match only the strings that are in that language. Thus, for any regular expression, you can construct a DFA that accepts exactly the same strings, and vice versa.

A Non-Deterministic Finite Automaton (NFA) can be turned into a Deterministic Finite Automaton (DFA) by constructing a DFA state for every combination of states in the NFA. (This is |Q|2 states, which is a finite number.)

With that knowledge, we can reverse a DFA A and produce a DFA A' which accepts every string that A rejects, and rejects every string that A accepts.

This can be done by turning all of the end states into temporary start states, and the start state into an end state. Then, we proceed to add epsilon-transitions from a new starting state to every one of these temporary start states to make it a valid NFA (epsilon-NFA, if you want to nitpick). Then, we turn it into a DFA as we know we can do.

The only remaining step is to turn our new DFA into a regular expression. The algorithm for this is stupidly simple: for every path from start to end states, we include that in the regular expression by using | (or) for every branch, concatenation for serial states, and * (kleene closure) for every loop.


You do the negation of the strings that you don't want - e.g.

"ten" !~ /one|two|three/

This gives you:

print "one" !~ /one|two|three/ --> false
print "two" !~ /one|two|three/ --> false
print "ten" !~ /one|two|three/ --> true
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜