Extracting date from a string in Python
How can I extract the da开发者_如何学编程te from a string like "monkey 2010-07-10 love banana"? Thanks!
Using python-dateutil:
In [1]: import dateutil.parser as dparser
In [18]: dparser.parse("monkey 2010-07-10 love banana",fuzzy=True)
Out[18]: datetime.datetime(2010, 7, 10, 0, 0)
Invalid dates raise a ValueError
:
In [19]: dparser.parse("monkey 2010-07-32 love banana",fuzzy=True)
# ValueError: day is out of range for month
It can recognize dates in many formats:
In [20]: dparser.parse("monkey 20/01/1980 love banana",fuzzy=True)
Out[20]: datetime.datetime(1980, 1, 20, 0, 0)
Note that it makes a guess if the date is ambiguous:
In [23]: dparser.parse("monkey 10/01/1980 love banana",fuzzy=True)
Out[23]: datetime.datetime(1980, 10, 1, 0, 0)
But the way it parses ambiguous dates is customizable:
In [21]: dparser.parse("monkey 10/01/1980 love banana",fuzzy=True, dayfirst=True)
Out[21]: datetime.datetime(1980, 1, 10, 0, 0)
If the date is given in a fixed form, you can simply use a regular expression to extract the date and "datetime.datetime.strptime" to parse the date:
import re
from datetime import datetime
match = re.search(r'\d{4}-\d{2}-\d{2}', text)
date = datetime.strptime(match.group(), '%Y-%m-%d').date()
Otherwise, if the date is given in an arbitrary form, you can't extract it easily.
For extracting the date from a string in Python; the best module available is the datefinder module.
You can use it in your Python project by following the easy steps given below.
Step 1: Install datefinder Package
pip install datefinder
Step 2: Use It In Your Project
import datefinder
input_string = "monkey 2010-07-10 love banana"
# a generator will be returned by the datefinder module. I'm typecasting it to a list. Please read the note of caution provided at the bottom.
matches = list(datefinder.find_dates(input_string))
if len(matches) > 0:
# date returned will be a datetime.datetime object. here we are only using the first match.
date = matches[0]
print date
else:
print 'No dates found'
note: if you are expecting a large number of matches; then typecasting to list won't be a recommended way as it will be having a big performance overhead.
Using Pygrok, you can define abstracted extensions to the Regular Expression syntax.
The custom patterns can be included in your regex in the format %{PATTERN_NAME}
.
You can also create a label for that pattern, by separating with a colon: %s{PATTERN_NAME:matched_string}
. If the pattern matches, the value will be returned as part of the resulting dictionary (e.g. result.get('matched_string')
)
For example:
from pygrok import Grok
input_string = 'monkey 2010-07-10 love banana'
date_pattern = '%{YEAR:year}-%{MONTHNUM:month}-%{MONTHDAY:day}'
grok = Grok(date_pattern)
print(grok.match(input_string))
The resulting value will be a dictionary:
{'month': '07', 'day': '10', 'year': '2010'}
If the date_pattern does not exist in the input_string, the return value will be None
. By contrast, if your pattern does not have any labels, it will return an empty dictionary {}
References:
- pygrok (Github)
- pygrok Preinstalled Definitions (Github)
Hands Down The Best Ways
There are two good modules on PyPI and GitHub, that make this task easier for us. Those are
- DATEFINDER Module, useful for finding dates in strings of text.
Installation
pip install datefinder
EXAMPLE
import datefinder
input_string = "monkey 2010-07-10 love banana"
# a generator will be returned by the datefinder module. I'm typecasting it to a list. Please read the note of caution provided at the bottom.
matches = list(datefinder.find_dates(input_string))
if len(matches) > 0:
# date returned will be a datetime.datetime object. here we are only using the first match.
date = matches[0]
print date
else:
print 'No dates found'
SOURCE: Finny Abraham
- DATERPARSER, extremely useful for scraping dates from an HTML file, in different lingual formats, supports Hijri and Jalali Calender as well. And supporters almost 200+ Languages in Different Formats
Features
Generic parsing of dates in over 200 language locales plus numerous formats in a language agnostic
fashion.
Generic parsing of relative dates like: '1 min ago'
, '2 weeks ago'
, '3 months
, 1 week and 1 day ago'
, 'in 2 days'
, 'tomorrow'.
Advanced Features
Generic parsing of dates with time zones abbreviations or UTC offsets like: 'August 14, 2015 EST', 'July 4, 2013 PST', '21 July 2013 10:15 pm +0500'.
Date lookup in longer texts.
Support for non-Gregorian calendar systems. See Supported Calendars.
Extensive test coverage.
SOURCE CODE [Example]
>>> parse('1 hour ago')
datetime.datetime(2015, 5, 31, 23, 0)
>>> parse('Il ya 2 heures') # French (2 hours ago)
datetime.datetime(2015, 5, 31, 22, 0)
>>> parse('1 anno 2 mesi') # Italian (1 year 2 months)
datetime.datetime(2014, 4, 1, 0, 0)
>>> parse('yaklaşık 23 saat önce') # Turkish (23 hours ago)
datetime.datetime(2015, 5, 31, 1, 0)
>>> parse('Hace una semana') # Spanish (a week ago)
datetime.datetime(2015, 5, 25, 0, 0)
>>> parse('2小时前') # Chinese (2 hours ago)
datetime.datetime(2015, 5, 31, 22, 0)
You could also try the dateparser module, which may be slower than datefinder on free text but which should cover more potential cases and date formats, as well as a significant number of languages.
HARD MODE:
If your dates are not separated by whitespace from surrounding text, combining datefinder
with wordninja
will solve this problem:
>>>import datefinder
>>>import wordninja
>>>example = '04.02.22ILeftMyHeartInSF ---> I Left My Heart In Sf - blah blah blah'
>>>list(datefinder.find_dates(' '.join(wordninja.split(example))))
[datetime.datetime(2022, 4, 22, 0, 0)]
Well sorta. That date was actually February 2004 not April 2022, but any tool would have to guess.
Just to be clear, this is what wordninja
does to squishedtogethertext:
>>>wordninja.split(example)
['04', '02', '22', 'I', 'Left', 'My', 'Heart', 'In', 'SF', 'I', 'Left', 'My', 'Heart', 'In', 'Sf', 'blah', 'blah', 'blah']
If you know the position of the date object in the string (for example in a log file), you can use .split()[index] to extract the date without fully knowing the format.
For example:
>>> string = 'monkey 2010-07-10 love banana'
>>> date = string.split()[1]
>>> date
'2010-07-10'
精彩评论