implementing complex algorithms on database stored information

2023-01-06 16:58 问答作者：

I'm trying to figure out the best practice for implementing a complex algorithm on stored information in a relational DB.

Specifically: I want to implement a variation of the k-means algorithm (a document clustering algorithm) on a large MS SQL Server database containing TFxIDF vectors 开发者_高级运维of many documents (these vectors are used as input for the algorithm).

My first thought was doing the entire thing in SQL using stored procedures, functions, views and all the other basic SQL Server tools, but then I thought maybe I should write managed code (I'm fluent in C#) that will be executed on the SQL Server.

Performance is an issue here, so I need to take that in consideration also.

I would appreciate any advice on the path I should take.

Thank you!

Performance is an issue here

It always is. When looking at this kind of code, there are two opposing trends that you have to consider:

Thanks to indexing, caching, and other optimization techniques, the database server is often best positioned to make these calculation quickly. You seem to understand this.

On the other hand:

These calculations seldom happen in isolation. You have to take the whole server performance into account, and your database is typically the most loaded server in your data center. It's also the hardest to scale, both from a technical and business perspective. Technical because you have to balance several different components, including disk, RAM, and cpu, and it's not always easy to know where your bottlenecks are. Also, these tend to be "big" machines that not many in your organization will have experience tuning. Finally, they don't often scale out very well. You can't add another database server to share the load as easily as you could an application server. From a business standpoint, all that technical mumbo jumbo adds up to cost. More than that, the database license is itself often several thousands per cpu.

Take these two points together, and the best course for performance is typically to use the querying capabilities in the database to pull down just the subset of records that you really need, and maybe do some of the easier pre-processing — the low-hanging fruit, if you will. Then finish the heavy lifting on an application server, in parallel if possible.

继续阅读：database k-means sql

implementing complex algorithms on database stored information

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？