Extract inconsistently formatted date from string (date parsing, NLP)
I have a large list of files, some of which have dates embedded in the filename. The format of the dates is inconsistent and often incomplete, e.g. "Aug06", "Aug2006", "August 2006", "08-06", "01-08-06", "2006", "011004" etc. In addition to that, some filenames have unrelate开发者_StackOverflow社区d numbers that look somewhat like dates, e.g. "20202010".
In short, the dates are normally incomplete, sometimes not there, are inconsistently formatted and are embedded in a string with other information, e.g. "Report Aug06.xls".
Are there any Perl modules available which will do a decent job of guessing the date from such a string? It doesn't have to be 100% correct, as it will be verified by a human manually, but I'm trying to make things as easy as possible for that person and there are thousands of entries to check :)
Date::Parse is definitely going to be part of your answer - the bit that works out a randomly formatted date-like string and make an actual useable date out of it.
The other part of your problem - the rest of the characters in your filenames - is unusual enough that you're unlikely to find someone else has packaged up a module for you.
Without seeing more of your sample data, it's really only possible to guess, but I'd start by identifying possible or likely "date section" candidates.
Here's a nasty brute-force example using Date::Parse (a smarter approach would use a list of regex-en to try and identify dates-bits - I'm happy to burn cpu cycles to not think quite so hard though!)
!/usr/bin/perl
use strict;
use warnings;
use Date::Parse;
my @files=("Report Aug06.xls", "ReportAug2006", "Report 11th September 2006.xls",
"Annual Report-08-06", "End-of-month Report01-08-06.xls", "Report2006");
# assumption - longest likely date string is something like '11th September 2006' - 19 chars
# shortest is "2006" - 4 chars.
# brute force all strings from 19-4 chars long at the end of the filename (less extension)
# return the longest thing that Date::Parse recognises as a date
foreach my $file (@files){
#chop extension if there is one
$file=~s/\..*//;
for my $len (-19..-4){
my $string = substr($file, $len);
my $time = str2time($string);
print "$string is a date: $time = ",scalar(localtime($time)),"\n" if $time;
last if $time;
}
}
Date::Parse does what you want.
DateTime::Format::Natural looks like a candidate for this job. I can't vouch for it personally but it has good reviews.
精彩评论