开发者

Dealing with Messy Dates

I hope you didn't think I was asking for relationship advice.

Infrequently, I have to offer survey respondents the ability to specify when an event occurred. What results is a horribly messy string that I honestly just don't know what to do with. Beyond recoding by hand.

Here it is a short sample, of thousands:

c("May2/ 12 noon", "9:45 am", "11:00 AM AST", "April 27 / 12:00 AST", 
"11:40 AM AST", "April 25 2011", "April 12th 2011 / 8:44", "April 12 2011 / 8:36am", 
"April 12 2011 / 8:30am", "April 12th 2011 / 8:18", "April 12 2011 / 8:12am", 
"April 11th 2011 / 5:57pm", "April 11th 2011 / 5:49pm", "April 11th 2011 / 5:42pm", 
"April 11th 2011 / 5:36pm", "April 11th 2011 / 5:27", "April 5 @ 11:26am", 
"8:50", "April 4th 12:45pm", "April 4th around 10am", "April 4th around 10am", 
"Mar 18, 2011 9:33am", "Mar 18, 2011 9:27am", "df", "fg", "12:16", 
"9:50", "Feb 8, 2011 / 12:20pm", "8:34 am  2/4/11", "Jan 31, 2011 2:50pm", 
"Jan 31, 2011 2:45pm", "Jan 31, 2011 2:38pm", "Jan 31, 2011 2:26pm", 
"11h09", "11:00 am", "1h02 pm", "10h03", "2h10", "Jan 13, 2011 9:50am Van", 
"Jan 12, 2011", "Jan 12, 2011 3:59pm", "Jan 12     14:19PM", 
"Jan 12, 2011 1:35pm", "Jan 12,2011 1:28pm", "1h36", "9h15", 
"9h09", "8h51", "8h45", "8h35", "1h12 pm", "12h59", "11h52 am", 
"10h45", "15h55", "Dec 31, 10 11:11am", "Dec 31,10 10:15am", 
"Dec 30, 2010 12:32pm", "Dec 30, 2010 12:18pm", "9:16 am", "11h16 am", 
"11h12", "9h29 am", "11h38", "Dec 16, 2010", "December 16, 2010", 
"December 16, 2010", "Dec 15,2010", "DEC 14 2010", "Dec 14 11:38", 
"Dec 14 11:35", "Dec 14 11:25", "December 13, 2010", "Dec 10, 1:38 pm", 
"Dec 10, 1:26 pm", "Dec 10, 1:20 pm", "Dec 10, 1:12 pm", "December 9 2010", 
"11h10 am", "10h59 am", "10:50 am", "Tues Dec 7th, 9:45 Van time", 
"Dec 3, 2010 12:30pm", "Dec 3, 2010 12:20pm", "Dec 3, 2010 12:10 pm", 
"November 30, 2010 4.02pm", "November 30, 2010", "november 29 120pm", 
"November 29 2010 11:27", "10:12am November 29, 2010", "Nov 26/10 1:18pm", 
"10:56 am", "Nov 24", "nov 24/ 4:20 PM AST", "Nov 24/4:00 PM AST", 
"NOVEMBER 24/10  2:10 pm", "November 24/10  11:00 a.m.", "12:05 MST", 
"3.55PM", "Nov. 17/10 12:45 pm", "Nov. 16/10  12:00 noon", "Nov. 16/10 11;50 a.m.", 
"nov 16/10  11:30 a.m.", "November 12, 2010 @ 12:23pm", "november 11 2010  2:20pm", 
"November 11 2010  2:15pm", "November 11 2:00pm", "Nov. 10/10:22am", 
"nov. 8/10...3:19 pm", "Nov 8/10  1;50 p.m.", "November 8/10...12 noon", 
"November 8/10..10: am", "Nov 5, 2010  1:10 pm", "11:32 am CST", 
"Nov 4  11:10", "nov 3 10am", "9:30 am", "11/02/2010 1:50PM", 
"Oct 29/10 2:50PM", "Oct 28 @ 11:20am", "27Oct10 10:40am", "10/26/2010 11:18", 
"Oct 26/10 11am", "Oct 26/10 10:30 am", "Oct 26 10:50", "10/25/2010 13:50", 
"10/22/2010  10:15", "Oct 22/10 10AM", "Oct 21, 2010 3:00 pm", 
"Oct 21, 2010 2:59", "10/21/2010 11:50", "10/21/2010 11:45", 
"10/21/2010 11:40", "10/21/2010 11:30", "11:30", "Oct 20 approx 1pm", 
"Oct 20/10 4:50PM", "13:48", "13:45", "Oct 20, 2010 11:45 am", 
"October 19th 3:05pm", "Oct 18,2010 2:15pm", "Oct 18/10 3:10PM", 
"10:30 am", "Oct 15/10 11:50am", "oct 14 @ 11:05am", "Oct 14/ 11:06", 
"4:40 oct 13 atlantic", "oct 13 4:05 pm atlantic", "oct 13 1:45 atlantic time", 
"Oct 13 / 10:37", "OCT 12 3:33", "Oct 12,2010 1:10pm", "Oct 12 / 11:45", 
"Oct 12 / 9:45", "Oct 8. 2010/ 2:00", "Oct 8/10- 1145am", "2 Sept 2010 3.52pm", 
"2 Sept 2010 10.21am", "1 Sept 2010 2.05pm", "1 Sept 2010", "31 Aug 2010 - 11.52am", 
"31 aug 10:40am", "31 aug 2010 - 10am")

Generally, these events occur near to the date which the respondent fills out the survey, but not always. The survey date 开发者_如何学Gois recorded automatically and in a consistent format and is easily to translate into POSIX using as.Date so, elements that only contain the time can be ignored and merged with the date that they filled out the survey.

Your thoughts are much appreciated.

Note1: Some of you may say, you should have done X, Y, or Z in terms of validating the responses. To you, I say - hell yes - next time. I didn't design it! I just have to deal with it.

A few facts that can assist in a workaround:

  • The times will always be business day hours, 9am-6pm (hence am/pm doesn't matter)
  • The years don't matter as I can pull them from another field (it will always only ever be 2011/2010, which is thankfully outside of the possible timeframe in any notation)
  • I don't care about timezones, as I have their geographic location

What I've done so far:

mos <- strsplit('
jan
feb
mar
apr
may
jun
jul
aug
sep
oct
nov
dec
january
february
march
april
may
june
july
august
september
october
november
december
', '\n')[[1]][-1]

days <- strsplit('
mon
tue
wed
thu
fri
sat
sun
monday
tuesday
wednesday
thursday
friday
saturday
sunday
', '\n')[[1]][-1]
## Messy Date Wrangling
x <- ## that hot ghetto mess above
# minimize
x <- tolower(x)
# remove unnecessary crap
x <- sub("2011"," ",x)
x <- sub("2010"," ",x)
x <- sub("am"," ",x)
x <- sub("pm"," ",x)
x <- sub("[p][.][m]"," ",x)
x <- sub("[a][.][m]"," ",x)
x <- sub("[.]{3}"," ",x)
x <- str_trim(x, side="both")
# divide
x <- strsplit(x,c(" "))
# conquer?

lapply(x, function(x) pmatch(x,mos))
lapply(x, function(x) pmatch(x,days))


My sympathy that your date didn't turn out as pretty as expected. ;-)

I have constructed a (still partial) solution along the lines suggested by @Rguy.

(Please note that this code still has a bug: It does't always return the correct time. For some reason, it doesn't always do a greedy match on the digits before the colon, thus sometimes returning 1:00 when the time is 11:00.)

First, construct a helper function that wraps around gsub and grep. This function takes a character vector as one of its arguments and collapses this into a single string separated by |. The effect of this is to allow you to easily pass multiple patterns to be matched by a regex:

find.pattern <- function(x, pattern_list){
  pattern <- paste(pattern_list, collapse="|")
  ret <- gsub(paste("^.*(", pattern, ").*", sep=""), "\\1", x, ignore.case=TRUE)
  ret[ret==x] <- NA 
  ret2 <- grepl(paste("^(", pattern, ")$", sep=""), x, ignore.case=TRUE)
  ret[ret2] <- x[ret2] 
  ret
}

Next, use some built-in variable names to construct a vector of months and abbreviations:

all.month <- c(month.name, month.abb)

Finally, construct a data frame with different extracts:

ret <- data.frame(
    data = dat, 
    date1 = find.pattern(dat, "\\d+/\\d+/\\d+"),
    date2 = find.pattern(dat, 
      paste(all.month, "\\s*\\d+[(th)|,]*\\s{0,3}[(2010)|(2011)]*", collapse="|", sep="")),
    year = find.pattern(dat, c(2010, 2011)),
    month = find.pattern(dat, month.abb), #Use base R variable called month.abb for month names
    hour = find.pattern(dat, c("\\d+[\\.:h]\\d+", "12 noon")),
    ampm = find.pattern(dat, c("am", "pm"))
)

The results:

head(ret, 50)
                      data  date1        date2 year month  hour ampm
20   April 4th around 10am   <NA>   April 4th  <NA>   Apr  <NA>   am
21   April 4th around 10am   <NA>   April 4th  <NA>   Apr  <NA>   am
22     Mar 18, 2011 9:33am   <NA> Mar 18, 2011 2011   Mar  9:33   am
23     Mar 18, 2011 9:27am   <NA> Mar 18, 2011 2011   Mar  9:27   am
24                      df   <NA>         <NA> <NA>  <NA>  <NA> <NA>
25                      fg   <NA>         <NA> <NA>  <NA>  <NA> <NA>
26                   12:16   <NA>         <NA> <NA>  <NA> 12:16 <NA>
27                    9:50   <NA>         <NA> <NA>  <NA>  9:50 <NA>
28   Feb 8, 2011 / 12:20pm   <NA>  Feb 8, 2011 2011   Feb  2:20   pm
29         8:34 am  2/4/11 2/4/11         <NA> <NA>  <NA>  8:34   am
30     Jan 31, 2011 2:50pm   <NA> Jan 31, 2011 2011   Jan  2:50   pm
31     Jan 31, 2011 2:45pm   <NA> Jan 31, 2011 2011   Jan  2:45   pm
32     Jan 31, 2011 2:38pm   <NA> Jan 31, 2011 2011   Jan  2:38   pm
33     Jan 31, 2011 2:26pm   <NA> Jan 31, 2011 2011   Jan  2:26   pm
34                   11h09   <NA>         <NA> <NA>  <NA> 11h09 <NA>
35                11:00 am   <NA>         <NA> <NA>  <NA>  1:00   am
36                 1h02 pm   <NA>         <NA> <NA>  <NA>  1h02   pm
37                   10h03   <NA>         <NA> <NA>  <NA> 10h03 <NA>
38                    2h10   <NA>         <NA> <NA>  <NA>  2h10 <NA>
39 Jan 13, 2011 9:50am Van   <NA> Jan 13, 2011 2011   Jan  9:50   am
40            Jan 12, 2011   <NA> Jan 12, 2011 2011   Jan  <NA> <NA>


This may be one of the few cases where another tool other than R is the best to use. I know that there are some modules for Perl that have already been developed to parse messy looking dates, on module DateTime::Format::Natural::Lang::EN can parse strings like: "1st tuesday last november". I seem to remember another module that could understand things like "the second tuesday after the first Monday in February".

There is also a tool at http://www.datasciencetoolkit.org/ that grabs what looks like dates in text and converts them to a standard format.


I'm not going to try to write the function right now, but I have an idea that might work.

Search each string for a 4-digit number to call the year.

Use grep to search each string for the first 3 letters of the abbreviation for the months. It seems almost all of your data (at least above) has an identifier like that. I'd store the value which is found in a "months" vector, and put blanks wherever no value is found. Here's a really ugly version of the code (i'll make this more efficient later, and add the case when the month isn't capitalized!)

mos <- c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")   
blah <- lapply(1:12, function(i) grepl(mos[i], test))   
lapply(blah, function(i) which(i))   
months <- 0*(1:length(test))   
for (i in 1:12) {   
  months[blah[[i]]] <- i   
}  


   months
  [1]  5  0  0  4  0  4  4  4  4  4  4  4  4  4  4  4  4  0  4  4  4  3  3  0  0  0  0  2  0  1
 [31]  1  1  1  0  0  0  0  0  1  1  1  1  1  1  0  0  0  0  0  0  0  0  0  0  0 12 12 12 12  0
 [61]  0  0  0  0 12 12 12 12  0 12 12 12 12 12 12 12 12 12  0  0  0 12 12 12 12 11 11  0 11 11
 [91] 11  0 11  0 11  0 11  0  0 11 11 11  0 11  0 11 11 11  0 11 11 11 11  0 11  0  0  0 10 10
[121] 10  0 10 10 10  0  0 10 10 10  0  0  0  0  0 10 10  0  0 10 10 10 10  0 10  0 10  0  0  0
[151] 10  0 10 10 10 10 10  9  9  9  9  8  0  0 

The "day" most commonly follows the word used for the month immediately. So if there is a one or 2 digit number after the month(which is character), extract that number and call it the day.

Times most commonly have the ":" or "." symbol in them, and so search each string for that character. If found in a string, create a "Time" vector with all of the digits immediately before and after that character (in theory, including 2 before and 2 after should not cause a problem). Put blanks whenever the symbol is not present. It would be nice if all of the data were definitely confined to a <12 hour period, because then you won't have to worry about AM and PM. If not, Maybe search the string for "AM" and "PM" as well.

Then, try to convert the strings which have all four of the above to POSIXct. The ones that don't convert, you'll have to manually enter of course. I think it would take me a few hours to code the function described above, and depending on the variability and size of your dataset it may or may not be worth the effort. Also, there is some risk for incorrect outputs, so adding an acceptable time range would help to avoid that.

In summary, it sounds like you're going to have to code a function with a whole lot of exceptions and then end up hand-coding a good portion of the times anyway. I hope someone can provide a better solution for you, though.

Good Luck!


The wolfram alpha http://www.wolframalpha.com/ is definitely a great tool to do that work.

At least, it successfully interpret some messy input in your data. It would be worth trying.

I'm not sure if the site is suitable for extremely large dataset, but if the data is not so large, it will be useful.

It is not difficult to write a automatized script that send query, get data and parse it, although I'm not sure if the site allows such usage.


Others have already addressed standard approaches and packages. I'll take a different perspective. Using regular expressions and fixed formats will get you most of the way. For the rest, I'd simply approach it as I would any problem in "pattern matching": statistical methods or machine learning. You've already specified the date and time ranges, and the timestamp of the logs is also informative. By extracting a lot of text features (this is where regular expressions would prove useful), you could then try to map to times of interest.

There are only three things to do for getting this working:

  1. Feature extraction
  2. Training set generation
  3. Build & deploy models

Build and deploy models? Let me introduce you to my friend R and the machine learning task view. :) The basic models to explore include multinomial models (take a look at glmnet), decision trees, and support vector machines. You might use decision trees and SVMs as inputs for a multinomial model (and the SVMs might not be necessary after all). To be honest, this part is nebulous: one could do this modeling as disconnected date components or as a process of refinements, e.g. get the year, if possible, then the minutes (because the range is much larger than for hours, days, months), then day of month, and finally hours and months. Essentially, I'd aim for trying to identify "parts of time" (analogous to parts of speech) for the numerical/string components.

Feature extraction: I'd try splits with colons, commas, slashes, dashes, periods, etc. Anything that is not a numeric value. I would then create data sets based on the features in order and in any order (i.e. an indicator value of features seen, ignoring the positions).

Training data: Amazon's Mechanical Turk.

Or, you know what, just ignore all of that programming and statistical mumbo jumbo and send everything to Mechanical Turk. :)

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜