开发者

Extracting dates from text in Java

Is it possible to extract dates from a string in Java?

I have 500+ string with 开发者_运维百科different data. In them, there can be:

"... period from 08.23.2011 - 09.05.2011..."

and also:

"...period ends 06.09.2011...".

It's not certain that the above string are there, but they can be.

Is it possible to extract the 3 dates and get them in Date format?


In essence regex is the answer for recognition, but there are lots and lots of ways to express dates and time periods, so if you want a good solution, you probably want to use an existing well-tuned set of regex. There's then a second phase of interpretation, which needs more flexibility than what JodaTime will parse out of the box. So for a robust solution, you probably want to use one of the systems that have been built in the natural language processing community, such as SUTime, HeidelTime or GUTime.


You can extract them with regex first: \d{2}\.\d{2}\.\d{4} and then parse each match with SimpleDateFormat - new SimpleDateFormat("dd.MM.yyyy").parse(dateString)


I would use a simple regex to get "likely" dates out first, and then parse them more carefully (ideally with Joda Time, IMO). I'd start off with a regex of \b\d{2}\.\d{2}\.\d{4}\b (plus escaping for the Java string of course).

(The \b bit matches a word boundary, so 12345.45.12345 won't match.)

You can make your regex more selective, of course, but it would be very hard to make it do all the validation required (imagine trying to encode all the rules for leap years in a regex) - so if you're going to need to validate as you parse anyway, there's not a lot of point in making the regex complicated.


You mean String and not text (this is Java)

Create a String Object to represent the text and then parse it into a newDateFormat class:

SimpleDateFormat = new SimpleDateFormat("dd.MM.yyyy").parse(yourString)


A date pattern recognition algorithm to not only identify date pattern but also fetches probable date in Java date format. This algorithm is very fast and lightweight. The processing time is linear and all dates are identified in a single pass. Algorithm resolves date using tree traverse mechanism. Tree data structures are custom created to build supported date, time and month patterns.

The algorithm also acknowledges multiple space characters in between Date literals. E.g. DD DD DD and DD DD DD are considered as valid dates.

Following date-patterns are considered as valid and are identifiable using this algorithm.

dd MM(MM) yy(yy) yy(yy) MM(MM) dd MM(MM) dd yy(yy)

Where M is month literal is alphabet format like Jan or January

Allowed delimiters between dates are '/', '\', ' ', ',', '|', '-', ' '

It also recognizes trailing time pattern in following format hh(24):mm:ss.SSS am / pm hh(24):mm:ss am / pm hh(24):mm:ss am / pm

Resolution time is linear, no pattern matching or brute force is used. This algorithm is based on tree traversal and returns back, the list of date with following three components - date string identified in the text - converted & formatted date string - SimpleDateFormat

Using date string and the format string, users are free to convert the string into objects based on their requirements.

The algorithm library is available at maven central.

<dependency>
    <groupId>net.rationalminds</groupId>
    <artifactId>DateParser</artifactId>
    <version>0.3.0</version>
</dependency>

The sample code to use this is below.

 import java.util.List;  
 import net.rationalminds.LocalDateModel;  
 import net.rationalminds.Parser;  
 public class Test {  
   public static void main(String[] args) throws Exception {  
        Parser parser=new Parser();  
        List<LocalDateModel> dates=parser.parse("Identified date :'2015-January-10 18:00:01.704', converted");  
        System.out.println(dates);  
   }  
 }  

Output: [LocalDateModel{originalText=2015-january-10 18:00:01.704, dateTimeString=2015-1-10 18:00:01.704, conDateFormat=yyyy-MM-dd HH:mm:ss.SSS, start=18, end=46}]

Detailed blog at http://coffeefromme.blogspot.com/2015/10/how-to-extract-date-object-from-given.html

The complete source is available on GitHub at https://github.com/vbhavsingh/DateParser

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜