fast auto-guessing of date strings
For a huge number of huge csv files (100M lines+) from different sources I need a fast snippet or library to auto-guess the date format and convert it to broken-down time or unix time-stamp. Once successfully guessed the snippet must be able t开发者_如何学Co check subsequent occurrences of the date field for validity because it is likely that the date format changes throughout the file.
The test set of date formats must be variable but compiling an optimal decision tree or something from a number of given date formats is fine.
I've come to the conclusion that nothing of the kind exists but yet have to do a `market research' hence my question.
My first attempt was to mimic getdate() for 23 different date formats I've observed so far, and to replace the number parsers by optimised versions taking date-specific characteristics into account (no '4' to '9' in the tenners of the day part, no '3' to '9' in the tenners of the month part, etc.)
Did anyone face a similar problem or even produce code of the kind?
I dealt with timestamped sensor data (structurally CSV) in over fifty formats from numerous sources with a Perl script. Never constrained for functionality, and although it is script based it was reasonably quick (>10Klines/sec where line was ~60-100chars) I implemented a) analyse first couple of hundred lines, rewind and then do the run ...to build up context for decision logic. b) emit erroneous line(s) with line number and context ...so at the end of the run could edit the offending lines then set them to be re-inserted on a subsequent run, so it could pass "patched" error free ie every line would have matched a format. c) time difference between lines ...only allowed increasing timestamps. d) also I could reformat other stuff like changing units ie imperial to SI. Although from the C camp, simple Perl is not too alien, but made it so so much easier Note This method could deal with problems like 10/04/05 ie DD/MM/YY or MM/DD/YY if there was enough information in the file
After two weeks of excessive googl^Wweb browsing I came to the conclusion that I have to write this one myself. FTW, my first go at it: http://github.com/hroptatyr/glod
精彩评论