regular expression search engine [closed]
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Improve this question 开发者_如何学PythonIs there a search engine, that would allow me to search by a regular expression?
Google Code Search allows you to search using a regular expression.
As far as I am aware no such search engine exists for general searches.
There are a few problems with regular expressions that current prohibit employing these in real-world scenarios. The most pressing would be that the entire cached Internet would have to be matched with your regex, which would take significant computing resources; indexes are pretty much useless in regex context it seems, due to regexes being potentially unbound (/fo*bar/).
I don't have a specific engine to suggest.
However, if you could live with a subset of regex syntax, a search engine could store additional tokens to efficiently match rather complex expressions. Solr/Lucene allows for custom tokenization, where the same word can generate multiple tokens and with various rule sets.
I'll use my name as an example: "Mark marks the spot."
Case insensitive with stemming: (mark, mark, spot)
Case sensitive with no stemming: (Mark, marks, spot)
Case sensitive with NLP thesaurus expansion: ( [Mark, Marc], [mark, indicate, to-point], [spot, position, location, beacon, coordinate] )
And now evolving towards your question, case insensitive, stemming, dedupe, autocomplete prefix matching: ( [m, ma, mar, mark], [s, sp, spo, spot] )
And if you wanted "substring" style matching it would be: ( [m, ma, mar, mark, a, ar, ark, r, rk, k], [s, sp, spo, spot, p, po, pot, o, ot, t] )
A single search Index contain all of these different forms of tokens, and choose which ones to use for each type of search.
Let's try the word "Missippi" with a regex style with literal tokens: [ m, m?, m+, i, i?, i+, s, ss, s+, ss+ ... ] etc.
The actual rules would depend on the regex subset, but hopefully the pattern is becoming clearer. You would extend even further to match other regex fragments, and then use a form of phrase searching to locate matches.
Of course the index would be quite large, BUT it might be worth it, depending on the project's requirements. And you'd also need a query parser and application logic.
I realize if you're looking for a canned engine this doesn't do it, but in terms of theory this is how I'd approach it (assuming it's really a requirement!). If all somebody wanted was substring matching and flexible wildcard matching, you could get away with far fewer tokens in the index.
In terms of canned apps, you might check out OpenGrok, used for source code indexing, which is not full regex, but understands source code pretty well.
If regex takes up too many resources, why not charge for its use by cputime instead of making it completely unavailable? I'm sure some people would pay and get use of it (and of course offer an explanation for the charge, explain in terms of carbon footprint and cpu resources). Google does support expansive * in its searches *go
or go*
or intitle:"*go"
here it is: http://www.hackcollege.com/blog/2011/11/23/infographic-get-more-out-of-google.html
A very good article on regex search on a trigram index for by Russ Cox
http://swtch.com/~rsc/regexp/regexp4.html
http://www.google.com/codesearch has been shut down...
Regular expression search takes much resources and thus is not affordale by popular search engines.
Globalogiq has an HTML Source Code Search where you can search with regular expressions. It's not free though.
精彩评论