Regex, how to remove all non-alphanumeric except colon in a 12/24 hour timestamp?
I have a string like:
Today, 3:30pm - Group Meeting to discuss "big idea"
开发者_开发问答
How do you construct a regex such that after parsing it would return:
Today 3:30pm Group Meeting to discuss big idea
I would like it to remove all non-alphanumeric characters except for those that appear in a 12 or 24 hour time stamp.
# this: D:DD, DD:DDam/pm 12/24 hr
re = r':(?=..(?<!\d:\d\d))|[^a-zA-Z0-9 ](?<!:)'
A colon must be preceded by at least one digit and followed by at least two digits: then it's a time. All other colons will be considered textual colons.
How it works
: // match a colon
(?=.. // match but not capture two chars
(?<! // start a negative look-behind group (if it matches, the whole fails)
\d:\d\d // time stamp
) // end neg. look behind
) // end non-capture two chars
| // or
[^a-zA-Z0-9 ] // match anything not digits or letters
(?<!:) // that isn't a colon
Then when applied to this silly text:
Today, 3:30pm - Group 1,2,3 Meeting to di4sc::uss3: 2:3:4 "big idea" on 03:33pm or 16:47 is also good
...changes it into:
Today, 3:30pm Group 123 Meeting to di4scuss3 234 big idea on 03:33pm or 16:47 is also good
Python.
import string
punct=string.punctuation
s='Today, 3:30pm - Group Meeting:am to discuss "big idea" by our madam'
for item in s.split():
try:
t=time.strptime(item,"%H:%M%p")
except:
item=''.join([ i for i in item if i not in punct])
else:
item=item
print item,
output
$ ./python.py
Today 3:30pm Group Meetingam to discuss big idea by our madam
# change to s='Today, 15:30pm - Group 1,2,3 Meeting to di4sc::uss3: 2:3:4 "big idea" on 03:33pm or 16:47 is also good'
$ ./python.py
Today 15:30pm Group 123 Meeting to di4scuss3 234 big idea on 03:33pm or 1647 is also good
NB: Method should be improved to check for valid time only when necessary(by imposing conditions) , but i will leave it as that for now.
I assume you'd like to keep spaces as well, and this implementation is in python, but it's PCRE so it should be portable.
import re
x = u'Today, 3:30pm - Group Meeting to discuss "big idea"'
re.sub(r'[^a-zA-Z0-9: ]', '', x)
Output: 'Today 3:30pm Group Meeting to discuss big idea'
for a slightly cleaner answer (no double spaces)
import re
x = u'Today, 3:30pm - Group Meeting to discuss "big idea"'
tmp = re.sub(r'[^a-zA-Z0-9: ]', '', x)
re.sub(r'[ ]+', ' ', tmp)
Output: 'Today 3:30pm Group Meeting to discuss big idea'
You can try, in Javascript:
var re = /(\W+(?!\d{2}[ap]m))/gi;
var input = 'Today, 3:30pm - Group Meeting to discuss "big idea"';
alert(input.replace(re, " "))
Correct regexp to do that would be:
'(?<!\d):|:(?!\d\d)|[^a-zA-Z0-9 :]'
s="Call me, my dear, at 3:30"
re.sub(r'[^\w :]','',s)
'Call me my dear at 3:30'
精彩评论