Fuzzy Regular Expressions
In my work I have with great results used approximate string matching algorithms such as Damerau–Levenshtein distance to make my code less vulnerable to spelling mistakes.
Now I have a need to match strings against simple regular expressions such TV Schedule for \d\d (Jan|Feb|Mar|...)
. This means that the string TV Schedule开发者_开发百科 for 10 Jan
should return 0 while T Schedule for 10. Jan
should return 2.
This could be done by generating all strings in the regex (in this case 100x12) and find the best match, but that doesn't seam practical.
Do you have any ideas how to do this effectively?
I found the TRE library, which seems to be able to do exactly fuzzy matching of regular expressions. Example: http://hackerboss.com/approximate-regex-matching-in-python/ It only supports insertion, deletion and substitution though. No transposition. But I guess that works ok.
I tried the accompanying agrep tool with the regexp on the following file:
TV Schedule for 10Jan
TVSchedule for Jan 10
T Schedule for 10 Jan 2010
TV Schedule for 10 March
Tv plan for March
and got
$ agrep -s -E 100 '^TV Schedule for \d\d (Jan|Feb|Mar)$' filename
1:TV Schedule for 10Jan
8:TVSchedule for Jan 10
7:T Schedule for 10 Jan 2010
3:TV Schedule for 10 March
15:Tv plan for March
Thanks a lot for all your suggestions.
Also see: The Python regex (newer version, Oct '14) (search for "fuzzy" inside the document).
If you're not a Python guy (neither I am), you could compile your code to C (exe/dll). Then you would be able to use your dll even from good old vb6 (and the like).
Other libraries to choose from:
- TRE/agrep ('classic, good, old and fast) (search for 'agrep performace'), but you need to write POSIX compatible regex (search for 'regular expressions info posix') Of course, all libraries/examples using TRE have this limitation (search for 'hackerboss approximate regex matching in python'). For massive data: search for 'A fast CUDA implementation of agrep algorithm'.
- FREJ (Java) - some (more) limitations (for instance, no look ahead/look behind)
- fuzzy-wuzzy (Python-based) - worth looking at, not tested...
Search also for:
- 'Comparison_of_regular_expression_engines'
- 'regular-expressions.info tools'
(sorry for not be able to post real links)
I just use the regex
module: 'Alternative regular expression module, to replace re.' It provides the familiarity of re
but includes options for fuzzy matching, along with several other improvements on re
.
For Windows binaries, see this resource.
Here is a resource on the question you are asking. It is a bit of a teaser for a company. More useful might be this paper. I've seen an implementation inspired by the paper that could do a fuzzy search, biased for special language (e.g. Arabic vs. English), on a large dataset.
In general, you won't be able to do what you asked about. You can make a regexp search fuzzy by replacing characters with equivalence classes, or you can search a database for near-matches defined by Levenshtein distance. Trying to expand the (n)DFA behind a regexp to include near-matches by distance would rapidly become impossibly complex.
Have you considered using a lexer?
I've never actually used one so i can't be much help, but it sounds like it fits!
I started to implement a Java tool called prex for approximate regular expression matching. The tool determines how far a string s is from matching a regular expression r, i.e. how many insertions, deletions and substitutions on s are at least required (minimum cost) such that the resulting string s' is acceptable by r. If your are interested, you can check out the code from https://github.com/julianthome/prex. I would be very happy to get some feedback. Note, that the approach is still a bit slow, but I am currently incorporating some heuristics for improving its performance.
精彩评论