How can I parse multiple (unknown) date formats in python?

2023-03-27 16:29 问答作者：

I have a bunch of excel documents I am extracting dates from. I 开发者_C百科am trying to convert these to a standard format so I can put them in a database. Is there a function I can throw these strings at and get a standard format back? Here is a small sample of my data:

The good thing is I know it is always Month/Day

I'd like to get them all into MM/DD/YYYY format. Is there a way I can do this without trying each pattern against the string?

The third-party module dateutil has a function parse that operates similarly to PHP's strtotime: you don't need to specify a particular date format, it just tries a bunch of its own.

>>> from dateutil.parser import parse
>>> parse("10/02/09", fuzzy=True)
datetime.datetime(2009, 10, 2, 0, 0)  # default to be in American date format

It also allows you to specify different assumptions:

dayfirst – Whether to interpret the first value in an ambiguous 3-integer date (e.g. 01/05/09) as the day (True) or month (False). If yearfirst is set to True, this distinguishes between YDM and YMD. If set to None, this value is retrieved from the current parserinfo object (which itself defaults to False).

yearfirst – Whether to interpret the first value in an ambiguous 3-integer date (e.g. 01/05/09) as the year. If True, the first number is taken to be the year, otherwise the last number is taken to be the year. If this is set to None, the value is retrieved from the current parserinfo object (which itself defaults to False).

import re

ss = '''10/02/09
07/22/09
09-08-2008
9/9/2008
11/4/2010
03-07-2009
09/01/2010'''


regx = re.compile('[-/]')
for xd in ss.splitlines():
    m,d,y = regx.split(xd)
    print xd,'   ','/'.join((m.zfill(2),d.zfill(2),'20'+y.zfill(2) if len(y)==2 else y))

result

10/02/09     10/02/2009
07/22/09     07/22/2009
09-08-2008     09/08/2008
9/9/2008     09/09/2008
11/4/2010     11/04/2010
03-07-2009     03/07/2009
09/01/2010     09/01/2010

Edit 1

And Edit 2 : taking account of the information on '{0:0>2}'.format(day) from JBernardo, I added a 4th solution, that appears to be the fastest

import re
from time import clock
iterat = 100

from datetime import datetime
dates = ['10/02/09', '07/22/09', '09-08-2008', '9/9/2008', '11/4/2010',
         ' 03-07-2009', '09/01/2010']

reobj = re.compile(
r"""\s*  # optional whitespace
(\d+)    # Month
[-/]     # separator
(\d+)    # Day
[-/]     # separator
(?:20)?  # century (optional)
(\d+)    # years (YY)
\s*      # optional whitespace""",
re.VERBOSE)

te = clock()
for i in xrange(iterat):
    ndates = (reobj.sub(r"\1/\2/20\3", date) for date in dates)
    fdates1 = [datetime.strftime(datetime.strptime(date,"%m/%d/%Y"), "%m/%d/%Y")
               for date in ndates]
print "Tim's method   ",clock()-te,'seconds'



regx = re.compile('[-/]')


te = clock()
for i in xrange(iterat):
    ndates = (reobj.match(date).groups() for date in dates)
    fdates2 = ['%s/%s/20%s' % tuple(x.zfill(2) for x in tu) for tu in ndates]
print "mixing solution",clock()-te,'seconds'


te = clock()
for i in xrange(iterat):
    ndates = (regx.split(date.strip()) for date in dates)
    fdates3 = ['/'.join((m.zfill(2),d.zfill(2),('20'+y.zfill(2) if len(y)==2 else y)))
              for m,d,y in ndates]
print "eyquem's method",clock()-te,'seconds'



te = clock()
for i in xrange(iterat):
    fdates4 = ['{:0>2}/{:0>2}/20{}'.format(*reobj.match(date).groups()) for date in dates]
print "Tim + format   ",clock()-te,'seconds'


print fdates1==fdates2==fdates3==fdates4

result

number of iteration's turns : 100
Tim's method    0.295053700959 seconds
mixing solution 0.0459111423379 seconds
eyquem's method 0.0192239516475 seconds
Tim + format    0.0153756971906 seconds 
True

The mixing solution is interesting because it combines the speed of my solution and the ability of the regex of Tim Pietzcker to detect dates in a string.

That's still more true for the solution combining Tim's one and the formating with {:0>2}. I cant' combine {:0>2} with mine because regx.split(date.strip()) produces year with 2 OR 4 digits

If you don't want to install a third-party module like dateutil:

import re
from datetime import datetime
dates = ['10/02/09', '07/22/09', '09-08-2008', '9/9/2008', '11/4/2010', ' 03-07-2009', '09/01/2010']
reobj = re.compile(
    r"""\s*  # optional whitespace
    (\d+)    # Month
    [-/]     # separator
    (\d+)    # Day
    [-/]     # separator
    (?:20)?  # century (optional)
    (\d+)    # years (YY)
    \s*      # optional whitespace""", 
    re.VERBOSE)
ndates = [reobj.sub(r"\1/\2/20\3", date) for date in dates]
fdates = [datetime.strftime(datetime.strptime(date,"%m/%d/%Y"), "%m/%d/%Y")
          for date in ndates]

Result:

['10/02/2009', '07/22/2009', '09/08/2008', '09/09/2008', '11/04/2010', '03/07/2009', '09/01/2010']

You can use a regex like r'(\d+)\D(\d+)\D(\d+)' to get the month, day and year in a tuple with the re.findall function.

then just concatenate the 2-digit years with the number 20 or 19 and use the separator you want to join then back:

'/'.join(the_list)

As pointed by Tim:

To normalize days, just do '{0:0>2}'.format(day) and the same to months.

继续阅读：date parsing python

How can I parse multiple (unknown) date formats in python?

Edit 1

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

Edit 1

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？