开发者

String Pattern Matching for Limited, Ltd, Incorporated, Inc, Etc

We're doing a LOT of work towards trying to reconcile about 1,000 duplicate manufacturer names and 1,000,000 duplicate part numbers. One thing that has come up is how to "match" things like "Limited" vs. "Ltd." vs. "Ltd"

The p开发者_StackOverflow社区urpose is for the application to reconcile these matched items into a standard format. So:

ACME Ltd. ACME Limited ACME Ltd

Should all be reconciled into ACME Ltd.

This will also be used to prevent entering additional duplicates in the future.

Any suggestions on how to accomplish this pattern matching in SQL Server? Any known algorithms to find items with mapped equivalencies, etc...?

Thanks!

Eric.


How about a table that lists what you want in one column and variations in the next?

Ltd   Limited 
Ltd   Ltd.
St    Street
St    Str.

Then, if you find a match on the second column, you change it to the first. It may take several iterations, as you find other alternatives.


Using SQL Server Full Text Search you can use synonyms:

For each full-text language, SQL Server also provides a file in which you can optionally define language-specific synonyms to extend the scope of search queries (a thesaurus file).

In your case you could add a section like the following:

 <expansion>
         <sub>Limited</sub>
         <sub>Ltd</sub>
         <sub>Ltd.</sub>
 </expansion>

Here is a link that goes into more detail on how to modify the thesaurus file. This may work for what you are trying to do...

SQL Server also offers some limited pattern matching by using LIKE. I would recommend looking over the options it offers to see if they will be sufficient for your needs.

If LIKE is insufficient you can always look at creating a CLR stored procedure or UDFs that will allow you to use regular expressions. This will allow you to match MUCH more complex patters...

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜