Distinguishing and Parsing Dates in Java
i know this topic isn't new, though i have to dig it up again. I already searched the Web numerous times (including some Threads here on stackoverflow) but haven't found a satisfying answer so far.
(Amongst others I checked Parsing Ambiguous Dates in Java and http://www.coderanch.com/t/375367/java/java/Handling-Multiple-Date-Formats-Elegantly
I am currently writing a Dateparser in Java, which takes a date and generates a format-String which can be used by SimpleDateFormat for parsing the date.
The dates are parsed via regex (yes, it's an ugly one xD) from Logfiles (IBM Websphere, Tomcat, Microsoft Exchange, ....). Because we have customers in (at least 2) different Locales, there is no way to simply "throw" the String against the parse-method of SimpleDateFormat and expect it to work properly.
Furthermore, there is the problem with the position of day and month (i.e. formats "dd/MM/yyyy" or "MM/dd/yyyy") which cannot be solved if i don't have at least two datasets where the day-digit has changed..
So my current approach would be storing the dateformats for a specific software installed at a specific customer'开发者_JS百科s systems in a database (mysql / xml / ... ) and forcing the user to at least specify customername and softwarename so there is enough context to break down the amount of possibilites the format may be given in.
This "subset" then would be used to try to parse the logfiles of the specified software. (The subset is stored in a HashMap in a HashMap in the form HashMap> map; The Integer-Key is the length of the formatstring and the String Key of the second Hashmap specifies a datesignature only containing the separating characters. (i.e. ".. ::." for a date with format "dd.MM.yyyy 11:11:11.111")
I also take into account the value of the digits, i.e. a digit > 12 has to be a day because there is no 13th month. But this only works reliably for Date-Strings later than the 12th of a month..
Is there any chance to avoid implementing prior knowledge about the environment out of which the logfile came, thus enabling the parser to reliably parse one date without having to refer a second datestring for comparison?
I'm stuck on that for almost 3 months now -.-
Any suggestions would be very welcome =)
Edit:
Okay guys this thread can be closed. I now came up with a different solution for my specific problem. For those who are interested: I am writing a Logreader in Java. As we have regular maintenance I have to read many logfiles. But it's not just the plain text information that's written in the file. Imagine a server just having crashed, it's sunday night and the next person to notice is the head of the IT dpt of the customer. Then on the following day I have to to maintenance and check the logfiles. Judging by content, everything seemed okay, nothing unusual. Half an hour after sending the maintenance report I receive a mail with the above mentioned head of it dpt ranting, that the server had crashed and it seemed to go unnoticed.
The point is, you can't keep track over content and the timestamps for logfiles with several thousand lines. So i developed a component which reads a logfile and calculates the time between two different log-entrys. Each logline got parsed into a java.util.Date to later get the Date as Timestamp for high resolution regarding the log-intervals. The differences i then threw onto a linegraph, which makes longer timeouts between two loglines visible as a big spike relating to the rest of the file.
My solution now will be to completely throw away the date-half of the String and insert a dummy-Date with a predefined format. The date only has to change if the Hour and minute approach 23:59. The original date later is presented on the graph with the "fake-data" lying beneath.
I thank all of you for your suggestions and feedback =) (And I hope my English has been understandable so far ;) )
My suggestion is to store all dates as 'ambiguous' until such time that the ambiguity can be resolved. (This assumes that a particular customer will always supply data in the same format.) As soon as you get a log from a customer for which you can unambiguously identify the date format, you would then be able to retrospectively apply this format to previously files.
To do this, you would need a table mapping each customer to their date format with some marker (e.g. NULL) to indicate that format is not yet established. You will probably also need to create your own date representation such that you can model these ambiguous dates.
So, as an example, if the possible date formats are:
dd/mm/yyyy
mm/dd/yyyy
yyyy/mm/dd
yyyy/dd/mm
Given dates, you should always be able to identify the year (permitting two digit years would make this problem considerably harder). So you should be able to map dates as follows:
25/01/2011 -> UNAMBIGUOUS_DD_MM_YYYY
12/01/2011 -> AMBIGUOUS_XX_XX_YYYY
2011/03/03 -> AMBIGUOUS_YYYY_XX_XX
03/30/2011 -> UNAMBIGUOUS_MM_DD_YYYY
If possible, you can ask the customers to pass the dateformat string also along with their actual date strings.
i.e. in their log files, they would need to have one more column
..... , '03/11/2011' , 'MM/DD/YYYY' , ...
I think the strategy you are going for (i.e. analysing a bigger set of data) is the best you can get. From a single line of logfile you will never know if 3/5/11 is the 3rd of may in 2011 or the 5th of march in 2011. (I guess there might also be locales that might interpret this as 11th of may in 2003...) I had these problems myself some time ago, and i also could only try to introduce some sort of context by either looking at numbers>12, or what changes quickest (must be "day"). But you already stated that yourself...
精彩评论