regular expression search engine [closed]

2023-02-01 19:19 问答作者：

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.

Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.

Closed 9 years ago.

Improve this question 开发者_如何学Python

Is there a search engine, that would allow me to search by a regular expression?

Google Code Search allows you to search using a regular expression.

As far as I am aware no such search engine exists for general searches.

There are a few problems with regular expressions that current prohibit employing these in real-world scenarios. The most pressing would be that the entire cached Internet would have to be matched with your regex, which would take significant computing resources; indexes are pretty much useless in regex context it seems, due to regexes being potentially unbound (/fo*bar/).

I don't have a specific engine to suggest.

However, if you could live with a subset of regex syntax, a search engine could store additional tokens to efficiently match rather complex expressions. Solr/Lucene allows for custom tokenization, where the same word can generate multiple tokens and with various rule sets.

I'll use my name as an example: "Mark marks the spot."

Case insensitive with stemming: (mark, mark, spot)

Case sensitive with no stemming: (Mark, marks, spot)

Case sensitive with NLP thesaurus expansion: ( [Mark, Marc], [mark, indicate, to-point], [spot, position, location, beacon, coordinate] )

And now evolving towards your question, case insensitive, stemming, dedupe, autocomplete prefix matching: ( [m, ma, mar, mark], [s, sp, spo, spot] )

And if you wanted "substring" style matching it would be: ( [m, ma, mar, mark, a, ar, ark, r, rk, k], [s, sp, spo, spot, p, po, pot, o, ot, t] )

A single search Index contain all of these different forms of tokens, and choose which ones to use for each type of search.

Let's try the word "Missippi" with a regex style with literal tokens: [ m, m?, m+, i, i?, i+, s, ss, s+, ss+ ... ] etc.

The actual rules would depend on the regex subset, but hopefully the pattern is becoming clearer. You would extend even further to match other regex fragments, and then use a form of phrase searching to locate matches.

Of course the index would be quite large, BUT it might be worth it, depending on the project's requirements. And you'd also need a query parser and application logic.

I realize if you're looking for a canned engine this doesn't do it, but in terms of theory this is how I'd approach it (assuming it's really a requirement!). If all somebody wanted was substring matching and flexible wildcard matching, you could get away with far fewer tokens in the index.

In terms of canned apps, you might check out OpenGrok, used for source code indexing, which is not full regex, but understands source code pretty well.

If regex takes up too many resources, why not charge for its use by cputime instead of making it completely unavailable? I'm sure some people would pay and get use of it (and of course offer an explanation for the charge, explain in terms of carbon footprint and cpu resources). Google does support expansive * in its searches *go or go* or intitle:"*go" here it is: http://www.hackcollege.com/blog/2011/11/23/infographic-get-more-out-of-google.html

A very good article on regex search on a trigram index for by Russ Cox

http://swtch.com/~rsc/regexp/regexp4.html

http://www.google.com/codesearch has been shut down...

Regular expression search takes much resources and thus is not affordale by popular search engines.

Globalogiq has an HTML Source Code Search where you can search with regular expressions. It's not free though.

继续阅读：regex search-engine

regular expression search engine [closed]

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？