开发者

regex to match a sequence of non letter characters with at least n digits

I'm looking for a regex to use in Java (java.util.regex.Pattern) that will match a generalised form of a telephone number. I've specified this as being:

a sequence of at least 8 non-letter characters with at least 8 characters being digits.

For example of a string literal with a positive match would be:

"Tel: (011) 1234-1234 blah blah blah"

however the following string literal would not match:

"Fot 3 ..... a 3 blah blah blah"

I've got as far as matching a开发者_JAVA百科 sequence of at least 8 non-letter characters

Pattern.compile("[^\\p{L}]{8,}");

How can I add an "and" / "conjuncive restriction" onto that regex specifying [\d]{8,}

I saw this post on stackoverflow:

Regular Expressions: Is there an AND operator?

About "anding" regex expressions but I can't seem to get it to work.

Any help or suggestions, very welcome:

Simon


If you are searching for phone numbers in unstructured documents, ie where the phone numbers could be expressed in any number of ways (with or without intl prefixes, brackets around area codes, dashes, a variable number of digits, randomly split with white space etc), and where you might well get lots of numbers that naively look like phone numbers but aren't (e.g on the web), forget using a regex, seriously.

You are much better off writing your own parser. Basically this steps through your text one character at a time, and you can add any rules you like to it. This is approach also makes it much easier to match against actual real phone numbers (e.g valid international or area codes, or other rules local or national exchanges may have) and so cut down on false positives. I know from doing this myself matching UK numbers across over a million buiness websites: a general regex for 10 or 11 digits plus some other basic rules match against an unbelievable number of non-phone numbers.

Edit: also if you're matching against web documents, you've also got the problem of phone numbers not being contiguous free text but containing html markup. It happens :)


^(?=(?:.*[^\\p{L}\\d]){8,})(?=(?:.*\\d){8,}) if non-letter can't be a digit

^(?=(?:.*\\P{L}){8,})(?=(?:.*\\d){8,}) if non-letter can be a digit

edit: commented/exclude whitespace modifier /x

if non-letter can't be a digit

^                          # beginning of string
     (?=                         # Start look ahead assertion (consumes no characters)
          (?:                       # Start non-capture group
              .*                        # 0 or more anychar (will backtrack to match next char)
              [^\pL\d]                  # character: not a unicode letter nor a digit
          ){8,}                     # End group, do group 8 or more times
     )                           # End of look ahead assertion
     (?=                         # Start new look ahead (from beginning of string)
          (?:                        # Start grouping
              .*                         # 0 or more anychar (backtracks to match next char)
              \d                         # a digit
          ){8,}                      # End group, do 8 or more times (can be {8,}? to minimize match)
     )                           # End of look ahead

if non-letter can be a digit

^                       # Same form as above (except where noted)
    (?=                 #  ""
         (?:            #  ""
             .*         
             \PL        # character: not a unicode letter
         ){8,}
    )
    (?=
         (?:
             .*
             \d
         ){8,}
    )


I would do it without using regular expressions. The non-regex code would be simple enough.


How about something like this:

import java.util.regex.*;

class Test {
    public static void main(String args[]) {
        for (String tel : new String[]{
            "Tel: (011) 1234-1234 blah blah blah",
            "Tel: (011) 123-1 blah blah blah"
        }) {
            System.err.println(tel + " " + (test(tel) ?
                "matches" : "doesn't match"));
        }
    }

    public static boolean test(String tel) {
        return Pattern.compile("^(\\D*(\\d+?)\\D*){8,}$").matcher(tel).matches();
    }
}

will produce:

Tel: (011) 1234-1234 blah blah blah matches
Tel: (011) 123-1 blah blah blah doesn't match
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜