开发者

Extract inconsistently formatted date from string (date parsing, NLP)

I have a large list of files, some of which have dates embedded in the filename. The format of the dates is inconsistent and often incomplete, e.g. "Aug06", "Aug2006", "August 2006", "08-06", "01-08-06", "2006", "011004" etc. In addition to that, some filenames have unrelate开发者_StackOverflow社区d numbers that look somewhat like dates, e.g. "20202010".

In short, the dates are normally incomplete, sometimes not there, are inconsistently formatted and are embedded in a string with other information, e.g. "Report Aug06.xls".

Are there any Perl modules available which will do a decent job of guessing the date from such a string? It doesn't have to be 100% correct, as it will be verified by a human manually, but I'm trying to make things as easy as possible for that person and there are thousands of entries to check :)


Date::Parse is definitely going to be part of your answer - the bit that works out a randomly formatted date-like string and make an actual useable date out of it.

The other part of your problem - the rest of the characters in your filenames - is unusual enough that you're unlikely to find someone else has packaged up a module for you.

Without seeing more of your sample data, it's really only possible to guess, but I'd start by identifying possible or likely "date section" candidates.

Here's a nasty brute-force example using Date::Parse (a smarter approach would use a list of regex-en to try and identify dates-bits - I'm happy to burn cpu cycles to not think quite so hard though!)

!/usr/bin/perl
use strict;
use warnings;
use Date::Parse;

my @files=("Report Aug06.xls", "ReportAug2006", "Report 11th September 2006.xls", 
           "Annual Report-08-06", "End-of-month Report01-08-06.xls", "Report2006");

# assumption - longest likely date string is something like '11th September 2006' - 19 chars
# shortest is "2006" - 4 chars.
# brute force all strings from 19-4 chars long at the end of the filename (less extension)
# return the longest thing that Date::Parse recognises as a date



foreach my $file (@files){
  #chop extension if there is one
  $file=~s/\..*//;
  for my $len (-19..-4){
    my $string = substr($file, $len);
    my $time = str2time($string);
    print "$string is a date: $time = ",scalar(localtime($time)),"\n" if $time;
    last if $time;
    }
  }


Date::Parse does what you want.


DateTime::Format::Natural looks like a candidate for this job. I can't vouch for it personally but it has good reviews.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜