开发者

Strategies for finding dates or date/times in a text document?

Problem: Given an unstructured text document find any date or date/time substrings.

My current thoughts are to search for known formats with a bunch of regex's which feels grossly kludgy, expensive and prone to errors :-)

This is the sort of doc I'm talking about:

Bacon ipsum dolor sit amet sirloin reprehenderit spare ribs aute. Ullamco consequat shank swine chuck, laboris do pastrami January 10th 1980 est venison shankle short 1-20-1980 loin bresaola corned beef. Beef ribs 28/2/2001 tri-tip est开发者_开发百科 cupidatat shank, excepteur qui non pastrami.

I suspect I'm not the first person to address this problem, and I'm hoping that the resultant code is buried in some open source project I don't know about…

Thoughts?


This is a bit of an ad-hoc heuristic - but maybe tokenize first?

You could recogize the following tokens

  • "junk" (the default, anything not like a date part)
  • dddd (4 digits - usually a year)
  • dd (2 digits - day month or year)
  • d (1 digit - day or month)
  • dd_st
  • dd_th (and variations on number of digits)
  • dd_rd
  • dd_nd
  • monthname

etc etc

Each token can have several interpretations (eg d is month or day) and a date is any sequence of 3 tokens where you can select one of each from year, month, day (in any order you wish to allow).

The idea here is to accept many more syntaxes than you would get with regex, if that was your intention ...

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜