How can I parse multiple (unknown) date formats in python?
I have a bunch of excel documents I am extracting dates from. I 开发者_C百科am trying to convert these to a standard format so I can put them in a database. Is there a function I can throw these strings at and get a standard format back? Here is a small sample of my data:
The good thing is I know it is always Month/Day
10/02/09
07/22/09
09-08-2008
9/9/2008
11/4/2010
03-07-2009
09/01/2010
I'd like to get them all into MM/DD/YYYY format. Is there a way I can do this without trying each pattern against the string?
The third-party module dateutil has a function parse
that operates similarly to PHP's strtotime
: you don't need to specify a particular date format, it just tries a bunch of its own.
>>> from dateutil.parser import parse
>>> parse("10/02/09", fuzzy=True)
datetime.datetime(2009, 10, 2, 0, 0) # default to be in American date format
It also allows you to specify different assumptions:
- dayfirst – Whether to interpret the first value in an ambiguous 3-integer date (e.g. 01/05/09) as the day (True) or month (False). If yearfirst is set to True, this distinguishes between YDM and YMD. If set to None, this value is retrieved from the current parserinfo object (which itself defaults to False).
- yearfirst – Whether to interpret the first value in an ambiguous 3-integer date (e.g. 01/05/09) as the year. If True, the first number is taken to be the year, otherwise the last number is taken to be the year. If this is set to None, the value is retrieved from the current parserinfo object (which itself defaults to False).
import re
ss = '''10/02/09
07/22/09
09-08-2008
9/9/2008
11/4/2010
03-07-2009
09/01/2010'''
regx = re.compile('[-/]')
for xd in ss.splitlines():
m,d,y = regx.split(xd)
print xd,' ','/'.join((m.zfill(2),d.zfill(2),'20'+y.zfill(2) if len(y)==2 else y))
result
10/02/09 10/02/2009
07/22/09 07/22/2009
09-08-2008 09/08/2008
9/9/2008 09/09/2008
11/4/2010 11/04/2010
03-07-2009 03/07/2009
09/01/2010 09/01/2010
Edit 1
And Edit 2 : taking account of the information on '{0:0>2}'.format(day)
from JBernardo, I added a 4th solution, that appears to be the fastest
import re
from time import clock
iterat = 100
from datetime import datetime
dates = ['10/02/09', '07/22/09', '09-08-2008', '9/9/2008', '11/4/2010',
' 03-07-2009', '09/01/2010']
reobj = re.compile(
r"""\s* # optional whitespace
(\d+) # Month
[-/] # separator
(\d+) # Day
[-/] # separator
(?:20)? # century (optional)
(\d+) # years (YY)
\s* # optional whitespace""",
re.VERBOSE)
te = clock()
for i in xrange(iterat):
ndates = (reobj.sub(r"\1/\2/20\3", date) for date in dates)
fdates1 = [datetime.strftime(datetime.strptime(date,"%m/%d/%Y"), "%m/%d/%Y")
for date in ndates]
print "Tim's method ",clock()-te,'seconds'
regx = re.compile('[-/]')
te = clock()
for i in xrange(iterat):
ndates = (reobj.match(date).groups() for date in dates)
fdates2 = ['%s/%s/20%s' % tuple(x.zfill(2) for x in tu) for tu in ndates]
print "mixing solution",clock()-te,'seconds'
te = clock()
for i in xrange(iterat):
ndates = (regx.split(date.strip()) for date in dates)
fdates3 = ['/'.join((m.zfill(2),d.zfill(2),('20'+y.zfill(2) if len(y)==2 else y)))
for m,d,y in ndates]
print "eyquem's method",clock()-te,'seconds'
te = clock()
for i in xrange(iterat):
fdates4 = ['{:0>2}/{:0>2}/20{}'.format(*reobj.match(date).groups()) for date in dates]
print "Tim + format ",clock()-te,'seconds'
print fdates1==fdates2==fdates3==fdates4
result
number of iteration's turns : 100
Tim's method 0.295053700959 seconds
mixing solution 0.0459111423379 seconds
eyquem's method 0.0192239516475 seconds
Tim + format 0.0153756971906 seconds
True
The mixing solution is interesting because it combines the speed of my solution and the ability of the regex of Tim Pietzcker to detect dates in a string.
That's still more true for the solution combining Tim's one and the formating with {:0>2}
. I cant' combine {:0>2}
with mine because regx.split(date.strip())
produces year with 2 OR 4 digits
If you don't want to install a third-party module like dateutil:
import re
from datetime import datetime
dates = ['10/02/09', '07/22/09', '09-08-2008', '9/9/2008', '11/4/2010', ' 03-07-2009', '09/01/2010']
reobj = re.compile(
r"""\s* # optional whitespace
(\d+) # Month
[-/] # separator
(\d+) # Day
[-/] # separator
(?:20)? # century (optional)
(\d+) # years (YY)
\s* # optional whitespace""",
re.VERBOSE)
ndates = [reobj.sub(r"\1/\2/20\3", date) for date in dates]
fdates = [datetime.strftime(datetime.strptime(date,"%m/%d/%Y"), "%m/%d/%Y")
for date in ndates]
Result:
['10/02/2009', '07/22/2009', '09/08/2008', '09/09/2008', '11/04/2010', '03/07/2009', '09/01/2010']
You can use a regex like r'(\d+)\D(\d+)\D(\d+)'
to get the month, day and year in a tuple with the re.findall
function.
then just concatenate the 2-digit years with the number 20
or 19
and use the separator you want to join then back:
'/'.join(the_list)
As pointed by Tim:
To normalize days, just do '{0:0>2}'.format(day)
and the same to months.
精彩评论