开发者

Speeding up date pattern matching

I am writing some simple code that tries to deduce whether or not a specific String is actually开发者_开发知识库 a Java date and, if yes, identify its format (pattern).

Obviously, because there are many possible date formats, establishing which one is applicable for a string requires successive pattern matching, which is really time and CPU-consuming, given that the input string can have other values, too.

So, what I have ended up doing, for a String variable called input, is something like

String datePattern;

if (isLikeDate(input))
{
    datePattern = matchAnyOfThePredefinedDatePatterns(input);
}

where the isLike... method rejects obvious non-date strings and the match... method goes over about 40-50 predefined patterns, trying to construct a valid SimpleDateFormat object. The constructor throws an exception if the input string is not a valid date for the pattern examined each time.

The exception handling slows things down dramatically, but there seems to be no avoiding it. The Apache Commons Date packages exhibit similar performance.

Is there any faster way of implementing this date pattern matching?


Depending on the complexity of the patterns, you might want to match each potential pattern with a regex (or hand-written code) before trying to parse it properly as a date. For example, if the pattern is "yyyyMMddThh:mm:ss" you could check for the length, the position of the T, the position of the colons, and that everything else is a digit before passing it on to the date parsing code.

This level of pattern matching can be very liberal - it's only trying to rule out definite infringements of the pattern. The important thing is that it doesn't reject any values which are actually valid.

The downside is that for any pattern which does match, you're doing work twice - but that may well still be easily balanced by significantly reducing the number of exceptions you throw.

EDIT: Just to clarify, you're currently testing whether it looks like it could match any of the patterns, and then testing all of them. I'm suggesting that you have a regex for each pattern, and only try parsing against patterns which have already matched the corresponding regex.

I'd also suggest trying Joda Time - not only is it a generally better API, but its patterns are thread-safe, so you can reuse them. Presumably you're currently creating new SimpleDateFormat objects each time you have something to parse.


match... method goes over about 40-50 predefined patterns, trying to construct a valid SimpleDateFormat object.

Does this mean you are constructing new SimpleDateFormat objects in every call to match? That is quite expensive, don't do that.

Keep the format objects previously constructed. If I remember right SimpleDateFormat.parse() is not thread-safe so some extra work will be required.

Of course, you want to try the formats with higher chances of succeeding first, but I don't know if you have that insight into data patterns.


You might consider building a trie-like state machine, sort of like playing pachinko with the incoming string. This would fail relatively quickly on non-dates–basically a date grammar parser.

Not sure if it would always be faster, or faster-enough to be worth the effort.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜