Edit regex in python script
The following python script allows me to scrape email addresses from a given file using regular expressions.
I'm trying to add phone numbers to the regular expression also. I created this regex and seems to work on 7 and 10 digit numbers:
(\d{3}[-\.\s]??\d{3}[-\.\s]??\d{4}|\(\d{3}\)\s*\d{3}[-\.\s]??\d{4}|\d{3}[-\.\s]??\d{4})
C开发者_运维技巧an this just be added to my existing regular expression? I figure I need to edit where I use re.compile but not completely sure how to do this in python. Any help would be appreciated.
# filename variables
filename = 'file.txt'
newfilename = 'result.txt'
# read the file
if os.path.exists(filename):
data = open(filename,'r')
bulkemails = data.read()
else:
print "File not found."
raise SystemExit
# regex = something@whatever.xxx
r = re.compile(r'(\b[\w.]+@+[\w.]+.+[\w.]\b)')
results = r.findall(bulkemails)
emails = ""
for x in results:
emails += str(x)+"\n"
# function to write file
def writefile():
f = open(newfilename, 'w')
f.write(emails)
f.close()
print "File written."
EDIT When running on http://en.wikipedia.org/wiki/Telephone_number It produces the following output:
2678400
2678400
2678400
2678400
2678400
2678400
2678400
2678400
2678400
8790468
9664261
555-1212
555-9225
555-1212
869-1234
555-5555
555-1212
867-5309
867-5309
867-5309
(267) 867-5309
(212) 736-5000
243-3460
2977743
1000000
2048000
2048000
8790468
9070412
9664261
9664261
9664261
I would not advise combining the two regexes. It's possible, but it will make for code which is harder to understand and maintain down the road.
(Also, leaving the regexes separate will let you handle emails and phone numbers differently down the line, which you're likely to want to do.)
For one, I would simplify your regex:
(?:\(?\b\d{3}\)?[-.\s]*)?\d{3}[-.\s]*\d{4}\b
will match the same correct numbers as before and have fewer false hits.
Second, your e-mail regex will miss a lot of valid e-mail addresses and have many false positives, too (it would match aaaa@@@@aaaa, for example). While you can never match e-mail address with 100 % reliability using regex, the following one is better, too:
\b[A-Z0-9._%+-]+@(?:[A-Z0-9-]+\.)+[A-Z]{2,6}\b
(Use the case insensitive option when compiling it).
To restrict yourself to some few TLDs, you can use
\b[A-Z0-9._%+-]+@(?:[A-Z0-9-]+\.)+(?:asia|com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum|travel|[A-Z]{2})\b
精彩评论