开发者

Regular Expression for finding phone numbers [duplicate]

This question already has answers here: Closed 12 years ago.

Possible Duplicates:

A comprehensive regex for phone number validation

grep with regex for phone number

Hello Everyone,

I am new to Stackoverflow and I have a quick question. Let's assume we are given a large number of HTML files (large as in theoretically infinite). How can I use Regular Expressions to extract the list of Phone Num开发者_C百科bers from all those files?

Explanation/expression will be really appreciated. The Phone numbers can be any of the following formats:

  • (123) 456 7899
  • (123).456.7899
  • (123)-456-7899
  • 123-456-7899
  • 123 456 7899
  • 1234567899

Thanks a lot for all your help and have a good one!


/^[\.-)( ]*([0-9]{3})[\.-)( ]*([0-9]{3})[\.-)( ]*([0-9]{4})$/

Should accomplish what you are trying to do.

The first part ^ means the "start of the line" which will force it to account for the whole string.

The [\.-)( ]* that I have in there mean "any period, hyphen, parenthesis, or space appearing 0 or more times".

The ([0-9]{3}) clusters match a group of 3 numbers (the last one is set to match 4)

Hope that helps!


Without knowing what language you're using I am unsure whether or not the syntax is correct.

This should match all of your groups with very few false positives:

/\(?([0-9]{3})\)?([ .-]?)([0-9]{3})\2([0-9]{4})/

The groups you will be interested in after the match are groups 1, 3, and 4. Group 2 exists only to make sure the first and second separator characters , ., or - are the same.

For example a sed command to strip the characters and leave phone numbers in the form 123456789:

sed "s/(\{0,1\}\([0-9]\{3\}\))\{0,1\}\([ .-]\{0,1\}\)\([0-9]\{3\}\)\2\([0-9]\{4\}\)/\1\3\4/"

Here are the false positives of my expression:

  • (123)456789
  • (123456789
  • (123 456 789
  • (123.456.789
  • (123-456-789
  • 123)456789
  • 123) 456 789
  • 123).456.789
  • 123)-456-789

Breaking up the expression into two parts, one that matches with parenthesis and one that does not will eliminate all of these false positives except for the first one:

/\(([0-9]{3})\)([ .-]?)([0-9]{3})\2([0-9]{4})|([0-9]{3})([ .-]?)([0-9]{3})\5([0-9]{4})/

Groups 1, 3, and 4 or 5, 7, and 8 would matter in this case.


This will help you catch the ones with an area code in parentheses

([0-9]\{3\})[ .-][0-9]\{3\}[ .-][0-9]\{4\}

The others are:

[0-9]\{3\}[ -][0-9]\{3\}[ -][0-9]\{4\}
[0-9]\{10\}

I separated the first one and the second one because putting them together without backtracking could get you into accepting (123 456 7890 or 123) 456 7890

Note also that on my terminal using grep, I had to escape the { } for the repetition. You may not have to, or you may have to escape other characters depending on where you intend to use this.


^(\(?\d{3}\)?)([ .-])(\d{3})([ .-])(\d{4})$

This should match all except the last pattern. For the last one you could use a separated pattern ^\d{10}$

And there is a error, it will match (123 456 7899

  1. ^(\(?\d{3}\)?), if we break this code, the first character (^) matches the beginning of the text. \(? and \)? will accept or not this character, there is the problem to do that you have to check if there was an opening char, if there were the second have to match, I don't know if it is possible using Regex only. And \d{3} will match three numbers

  2. ([ .-]) will match any of those, but only one and only once.

  3. (\d{3}) will match three numbers

  4. Same as 2

  5. (\d{4})$ four numbers followed by the end of the text ($)

Since you want to extract from an HTML page you would have to ignore ^ and $ to match any part of the text and set a flag global, in javascript /exp/g

You can test Regex here

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜