Find Similar Rows in Database

2023-01-18 08:52 问答作者：

I try to design my app to find database entries which are similar.

Let's for example take the table car (Everything in one table to keep the example simple):

CarID  |  Car Name  | Brand | Year | Top Speed | Performan开发者_高级运维ce | Displacement | Price
1         Z3          BMW     1990    250          5.4           123           23456
2         3er         BMW     2000    256          5.4           123           23000
3         Mustang     Ford    2000    190          9.8           120           23000

Now i want to do Queries like that:

"Search for Cars similar to Z3 (all brands)" (ignore "Car Name")

Similar in this context means that the row where the most columns are exactly the same is the most similar.

In this example it would be "3er BMW" since 2 columns(Performance and Displacement are the same)

Can you give me hints how to design database queries/application like that. The application gonna be really big with a lot of entries.

Also I would really appreciate useful links or books. (No problem for me to investigate further if i know where to search or what to read)

You could try to give each record a 'score' depending on its fields

You could weigh a column's score depending on how important the property is for the comparison (for instance top speed could be more important than brand)

You'll end up with a score for each record, and you will be able to find similar records by comparing scores and finding the records that are +/- 5% (for example) of the record you're looking at

The methods of finding relationships and similarities in data is called Data Mining, in your case you could already try clustering and classify your data in order to see what are the different groups that show up.

I think this book is a good start for an introduction to data mining. Hope this helps.

To solve your problem, you have to use a cluster algorithm. First, you need define a similarity metric, than you need to count the similarity between your input tuples (all Z3) and the rest of the database. You can speed up the process using algorithms, such as k-means. Please take a look on this question, there you will find a discussion on similar problem as yours - Finding groups of similar strings in a large set of strings.

This link is very helpful as well: http://matpalm.com/resemblance/.

Regarding the implementation if you have a lot of tuples (and more than several machines) you can use http://mahout.apache.org/. It is machine learning framework based on hadoop. You will need a lot of computation power, because cluster algorithms are complex.

Have a look at one of the existing search engines like Lucene. They implement a lot of things like that.

This paper might also be useful: Supporting developers with natural language queries

Not really an answer to your question, but you say you have lot of entries, you should consider normalizing your car table, move Brand to a separate table and "Car name"/model to a separate table. This will reduce the amount of data to compare during the lookups.

继续阅读：algorithm database sql web-applications

Find Similar Rows in Database

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？