AppEngine Approximate Partial String Matching Algorithm

2023-04-13 09:19 问答作者：

So, I realize that this covers a wide array of topics and pieces of them have been covered before on StackOverflow, such as this question. Similarly, Partial String Matching and Approximate String Matching are popular algorithmic discussions, it seems. However, using these ideas in conjunction to suit a problems where both need to be discussed seems highly inefficient. I'm looking for a way to combine the two problems in to one solution, efficiently.

Right now, I'm using AppEngine with Java and the Persistent DataStore. This is somewhat annoying, since it doesn't seem to have any arithmetic usage in the queries to make things easier, so I'm currently considering doing some precalculation and storing it as an extra field in the database. Essentially, this is the idea that a friend and I were having on how to possibly implement a system for matching and I was more or less hoping for suggestions on how to make it more efficient. If it needs to be scrapped in favor of something better that already exists, I can handle that, as well.

Let's start off with a basic example of what I'd look to do a search for. Consider the following nonsense sentence:

The isolating layer rackets the principal beneath your hypocritical rubbish.

If a user does a search for

isalatig pri

I would think that this would be a fairly good starting match for the string, and the value should be returned. The current method that we are considering using basically assigns a value to test divisibility. Essentially, there is a table with the following data

A: 2        B: 3        C: 5
D: 7        E: 11       F: 13
...

with each character being mapped to a prime number (multiple characters don't make a difference, only one character is needed). And if the query string divides the string in the database, then the value is returned as a possible match.

After this, keywords that aren't listed as stopwords are compared from the search string to see if they are starting substrings of words in the possible match under a given threshold of an edit distance (currently using the Levenshtein distance).

distance("isalatig", "isolating") == 2
distance("pri", "principal开发者_StackOverflow中文版") == 0 // since principal has a starting 
                                  // substring of pri it passes

The total distance for each query is then ranked in ascending order and the top n values are then returned back to the person doing the querying.

This is the basic idea behind the algorithm, though since this is my first time dealing with such a scenario, I realize that I'm probably missing something very important (or my entire idea may be wrong). What is the best way to handle the current situation that I'm trying to implement. Similarly, if there are any utilities that AppEngine currently offers to combat what I'm trying to do, please let me know.

First off, a clarification: App Engine doesn't allow arithmetic in queries because there's no efficient way to query on the result of an arbitrary arithmetic expression. When you do this in an SQL database, the planner is forced to select an inefficient query plan, which usually involves scanning all the candidate records one by one.

Your scheme will not work for the same reason: There's no way to index an integer such that you can efficiently query for all numbers that are divisible by your target number. Other potential issues include words that translate into numbers that are too large to store in a fixed length integer, and being unable to distinguish between 'rental', 'learnt' and 'antler'.

If we discard for the moment your requirement for matching arbitrary prefixes of strings, what you are searching for is full-text indexing, which is typically implemented using an inverted index and stemming. Support for fulltext search is on the App Engine roadmap but hasn't been released yet; in the meantime your best option appears to be SearchableModel, or using an external search engine such as Google Site Search.

继续阅读：algorithm google-app-engine string-search

AppEngine Approximate Partial String Matching Algorithm

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？