implementing complex algorithms on database stored information
I'm trying to figure out the best practice for implementing a complex algorithm on stored information in a relational DB.
Specifically: I want to implement a variation of the k-means algorithm (a document clustering algorithm) on a large MS SQL Server database containing TFxIDF vectors 开发者_高级运维of many documents (these vectors are used as input for the algorithm).
My first thought was doing the entire thing in SQL using stored procedures, functions, views and all the other basic SQL Server tools, but then I thought maybe I should write managed code (I'm fluent in C#) that will be executed on the SQL Server.
Performance is an issue here, so I need to take that in consideration also.
I would appreciate any advice on the path I should take.
Thank you!
Performance is an issue here
It always is. When looking at this kind of code, there are two opposing trends that you have to consider:
- Thanks to indexing, caching, and other optimization techniques, the database server is often best positioned to make these calculation quickly. You seem to understand this.
On the other hand:
- These calculations seldom happen in isolation. You have to take the whole server performance into account, and your database is typically the most loaded server in your data center. It's also the hardest to scale, both from a technical and business perspective. Technical because you have to balance several different components, including disk, RAM, and cpu, and it's not always easy to know where your bottlenecks are. Also, these tend to be "big" machines that not many in your organization will have experience tuning. Finally, they don't often scale out very well. You can't add another database server to share the load as easily as you could an application server. From a business standpoint, all that technical mumbo jumbo adds up to cost. More than that, the database license is itself often several thousands per cpu.
Take these two points together, and the best course for performance is typically to use the querying capabilities in the database to pull down just the subset of records that you really need, and maybe do some of the easier pre-processing — the low-hanging fruit, if you will. Then finish the heavy lifting on an application server, in parallel if possible.
精彩评论