开发者

Creating a character variation algorithm for a synonym table

I have a need to create a variation/synonym table for a client who needs to make sure if someone enters an incorrect variable, we c开发者_Python百科an return the correct part.

Example, if we have a part ID of GRX7-00C. When the client enters this into a part table, they would like to automatically create a variation table that will store variations that this product could be. Like GBX7-OOC (letter O instead of number 0). Or if they have the number 1, to be able to use L or I.

So if we have part GRL8-OOI we could have the following associated to it in the variation table:

  • GRI8-OOI
  • GRL8-0OI
  • GRL8-O0I
  • GRL8-OOI
  • etc....

I currently have a manual entry for this, but there could be a ton of variations of these parts. So, would anyone have a good idea at how I can create a automatic process for this?

How can I do this in C# and/or SQL?


I'm not a C# programmer, but for other .NET languages it would make more sense to me to create a list of CHARACTERS that are similar, and group those together, and use RegEx to evaluate if it matches.

i.e. for your example:

Original:

GRL8-001

Regex-ploded:

GR(l|L|1)(8|b|B)-(0|o|O)(0|o|O)(1|l|L)

You could accomplish this by having a table of interchangeable characters and running a replace function to sub the RegEx for the character automatically.


Lookex function psuedocode (works like soundex but for look alike instead of sound alike)

string input
for each char c
  if c in "O0Q" c = 'O'
  else if c in "IL1" c = 'I'
  etc.

compute a single Lookex code and store that with each product id. If user's entry doesn't match a product id, compute the Lookex code on their entry and search for all products having that code (there could be more than 1). This would consume minimal space, and be quite fast with a single index, and inexpensive to compute as well.


Given your input above, what I would do is not store a table of synonyms, but instead, have a set of rules checked against a master dictionary. So for example, if the user types in a value that is not found in the dictionary, change O to 0, and check for that existing in the dictionary. Change GR to GB and check for that. Etc. All the variations they want to allow described above can be explained as rules that you can apply one at a time or in combination and check if the resulting entry exists. That way you do not have to have a massive dictionary of synonyms to maintain and update.


I wouldn't go the synonym route at all.

I would cleanse all values in the database using a standard rule set.

For every value that exists, replace all '0's with 'O's, strip out dashes etc, so that for each real value you have only one modified value and store that in a seperate field\table.

Then I would cleanse the input the same way, and do a two-part match. Check the actual input string against the actual database values(this will get you exact matches), and secondly check the cleansed input against the cleansed values. Then order the output against the actual database values using a distance calc such as Levenshtein Distance to get the most likely match.

Now for the input: GRL8-OO1

With parts: GRL8-00I & GRL8-OOI

These would all normalize to the same value GRL8OOI, though the distance match would be closer for GRL8-OOI, so that would be your closest bet.

Granted this dramatically reduces the "uniqueness" of your part numbers, but the combo of the two-part match and the Levenshtein should get you what you are looking for.

There are several T-SQL implementations of Levenshtein available

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜