Creating a character variation algorithm for a synonym table

2023-03-20 11:15 问答作者：

I have a need to create a variation/synonym table for a client who needs to make sure if someone enters an incorrect variable, we c开发者_Python百科an return the correct part.

Example, if we have a part ID of GRX7-00C. When the client enters this into a part table, they would like to automatically create a variation table that will store variations that this product could be. Like GBX7-OOC (letter O instead of number 0). Or if they have the number 1, to be able to use L or I.

So if we have part GRL8-OOI we could have the following associated to it in the variation table:

GRI8-OOI
GRL8-0OI
GRL8-O0I
GRL8-OOI
etc....

I currently have a manual entry for this, but there could be a ton of variations of these parts. So, would anyone have a good idea at how I can create a automatic process for this?

How can I do this in C# and/or SQL?

I'm not a C# programmer, but for other .NET languages it would make more sense to me to create a list of CHARACTERS that are similar, and group those together, and use RegEx to evaluate if it matches.

i.e. for your example:

Original:

GRL8-001

Regex-ploded:

GR(l|L|1)(8|b|B)-(0|o|O)(0|o|O)(1|l|L)

You could accomplish this by having a table of interchangeable characters and running a replace function to sub the RegEx for the character automatically.

Lookex function psuedocode (works like soundex but for look alike instead of sound alike)

string input
for each char c
  if c in "O0Q" c = 'O'
  else if c in "IL1" c = 'I'
  etc.

compute a single Lookex code and store that with each product id. If user's entry doesn't match a product id, compute the Lookex code on their entry and search for all products having that code (there could be more than 1). This would consume minimal space, and be quite fast with a single index, and inexpensive to compute as well.

Given your input above, what I would do is not store a table of synonyms, but instead, have a set of rules checked against a master dictionary. So for example, if the user types in a value that is not found in the dictionary, change O to 0, and check for that existing in the dictionary. Change GR to GB and check for that. Etc. All the variations they want to allow described above can be explained as rules that you can apply one at a time or in combination and check if the resulting entry exists. That way you do not have to have a massive dictionary of synonyms to maintain and update.

I wouldn't go the synonym route at all.

I would cleanse all values in the database using a standard rule set.

For every value that exists, replace all '0's with 'O's, strip out dashes etc, so that for each real value you have only one modified value and store that in a seperate field\table.

Then I would cleanse the input the same way, and do a two-part match. Check the actual input string against the actual database values(this will get you exact matches), and secondly check the cleansed input against the cleansed values. Then order the output against the actual database values using a distance calc such as Levenshtein Distance to get the most likely match.

Now for the input: GRL8-OO1

With parts: GRL8-00I & GRL8-OOI

These would all normalize to the same value GRL8OOI, though the distance match would be closer for GRL8-OOI, so that would be your closest bet.

Granted this dramatically reduces the "uniqueness" of your part numbers, but the combo of the two-part match and the Levenshtein should get you what you are looking for.

There are several T-SQL implementations of Levenshtein available

继续阅读：algorithm sql

Creating a character variation algorithm for a synonym table

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？