Freely-available, well-debugged regular expressions
I was reading ICU documentation and came across this fine advice:
For common tasks like this there are libraries of freely available regular expressions that have been well debugged. It's worth making a quick search before writing a new expression.
To which libraries of well-开发者_Python百科debugged regular expressions do you commonly refer?
I'm not much taken with http://regexlib.com where the expressions don't seem all that well debugged. It appears to have no QA process besides user comments and ratings.
The problem with regular expression libraries, even those that are well-tested, is that they haven't been tested on your data or for your purposes. A regex that worked fine on somebody else's data for their purposes may not work at all for you.
The screen shot at http://www.regexbuddy.com/library.html indeed shows a regex that matches invalid dates such as February 30th. The comment with the regular expression explains this. The comment is not fully visible in the screen shot though.
This is a perfect example of why you have to be careful with regex libraries and copy-and-paste programming in general. The regex \d\d/\d\d/\d\d\d\d
may be perfectly acceptable for extracting dates from a file if you know that the file never contains something like 99/99/9999
. If a file only contains valid dates and other data that doesn't look like dates at all, then the simple regex is perfectly adequate for extracting the dates. And even if the data can contain invalid dates, you may choose to allow the regex match them and to filter the invalid dates out in the procedural code that processes the regex matches.
As for email addresses, the only way to determine whether it is valid is to send an email to it and get a response. Even the lack of a bounce message doesn't mean that the email was saved in somebody's mailbox or that it will be read by anyone. A regex can be useful to filter out things that are obviously not email addresses so you can skip the much more expensive step of sending a verification email. A regex can also be useful to extract email addresses from documents or archives. But it indeed can't say whether invalid@regexbuddy.com is a valid email address or not. It looks like it is, but it isn't. Email sent to this address is saved to /dev/null
.
I can't say enough good things about RegexBuddy. It comes with a large library within it. http://www.regexbuddy.com/library.html
It's not free, but if you're on a Windows box it's well worth the investment.
The interactive mode lets you debug your own expressions in real time - and it has many engines (.NET, Perl, etc.) So - it'd let you find that particular leap year bug pretty quick :).
I disagree with Mark.
He is right technically, but it depends on the exact context you're trying to do it in whether or not using regex is an acceptable risk.
Don't let the "good enough" solution be killed because you're trying for perfection.
If you take the time to learn regular expressions you won't need a library of expressions. I remember consciously deciding to learn regular expressions (years ago -- measured in decades sigh) and it has paid off countless times since.
Regular expressions aren't hard. They are just a little mini programming language. If you can write code you can learn regular expressions. One solid day of study should be plenty of time for anyone with a knack for programming.
Then, once you know them you can make an educated decision as to when they are an appropriate solution. Otherwise you're just throwing ideas against a wall in the hopes that one of them sticks. Plus, writing a regular expression from scratch will likely always be quicker and easier than trying to look up a pattern in a library and deciding whether it's good or not.
No - do not use regular expressions to parse emails, even if they have been "well debugged". Chances are they still don't work. Definitely use a library that is designed to parse emails, but stay away from libraries of regular expressions. I've seen one regular expression for emails that claimed to exactly follow the standards and it was several pages long and came with a warning that before applying it you had to first strip comments from the email (which would require a second regular expression).
If you insist on using a regular expression to parse emails then please make it accept invalid addresses rather than rejecting valid addresses.
精彩评论