Best architecture for a 30 + hour query
I have an interesting problem to solve. One of my clients has me developing a st开发者_StackOverflowock analysis program with close to 50 years of stock data for almost a thousand symbols. I've developed a series of filters that are applied on any given day to see if anything falls out for a trade.
We want to run this filter for each day of data we have for each stock. Basically your begin and end date type report. However it takes 6 minutes to filter each week for each symbol. We are figuring about 40 hours or so to run the report on our entire data set.
The overriding requirement is that my client is able to do anything in the application from any computer anywhere (he travels a lot), so we are browser based.
To solve this issue, I wrote an asychronous method that runs this report, however the application pool inactivity timer will kill the job. I don't want to have to start adjusting timeouts for the entire application to support this one report (we are going to do a lot of these as every stock scenerio will need to be run against our entire dataset for analysis before it gets used for active trading).
Does anyone have any general ideas or experiences with a web architecture that will support ultra-long asychronous processes?
Thanks
As a general suggestion I would recommend a standalone Windows Service, Console App or similar with very careful lifetime controls and logging, which would run constantly and check (poll) for 'jobs to process' in a database, then update the database with results and progress information.
It may not be the best way but I've used it before many times and it's reliable, scalable and has good performance.
Best to keep web requests to a minute or two maximum - they were never designed for heavy processing times. This way you can 'check in' on the job status every minute or so (using a Web Service).
If you have any questions of me or about the idea please post a comment & I will be happy to help, elaborate or suggest..
Hope that helps!
(Additional: I believe Windows Services are underused! All it takes is a quick base class or collection of reusable helper methods and you've got a logged, reliable, automatic, configurable, quick-to-implement process running under your control. Quick to prototype with too!)
Is there any reason not to simply run a service in the background and archive individual resultsets to a read only results table as they are requested? Do you need to run the query in realtime? The app could retrieve pages of results as they get generated by the service.
It sounds like you are doing SQL queries directly against these data. Have you considered loading the data to e.g. SQL Server Analysis Services and setting up a cube with (for starters) time, stock and symbol dimensions? Depending on the nature of your queries, you may get into quite reasonable response times. Relational databases are good for online transaction processing (within certain load and response time parameters), but analytical work sometimes requires the methods and technologies of data warehouses instead. (Or, perhaps, associative databases... there are alternatives.)
However, considering Murphy, you'll probably have some long running queries. Do the data vary for different end users? If not, why not precompute answers? Nothing http based should take more than a minute to process, if at that -- at least not by design!
Depending on the specifics of your filter, this sounds like a task that could benefit from parallelization - split the query across multiple compute nodes that run the filter on a subset (shard) of the data. If your filter is focused on analyzing one stock across a lot of time data, you could split the work on the stock symbol, and have multiple compute nodes processing different stock symbols concurrently. If you need to study relationships between stock symbols over time, it might make more sense to split the work by time intervals and coalesce the results after the operation (mapreduce). This is a case where throwing more hardware at the problem can actually improve response time considerably. Consider the Google search engine one example.
The usual caveats apply: review your current filter implementation for performance bottlenecks first. Make sure the tables you're hitting are appropriately indexed, etc. Precalculate relationships and digests of frequently needed computations in advance. Storage is cheap if it will save time.
Your web request could kick off a scatter/gather query operation distributing the query to available compute nodes in the cloud (Windows Azure, Google Apps, Amazon). Given sufficient compute nodes and appropriate distribution of work, you can probably get the response back in near real time.
Generally, ultra-long asychronous processes don't go on the web.
His request should be queued and another process should run the job and store the data in the format the user will use it in.
Six minutes to filter a week of data? Seems like your db needs proper index finetuning.
I recommend you read this article about Asynchronous Procedure Execution. If you logic is database bound (which it most certainly is) then this give a completely reliable way to launch computation task in an asynchronous fashion that is resilient to failover. Given that your load is highly paralelizable, you can launch several tasks, eg. one for each ticker, see the follow up article Passing Parameters to a Background Procedure.
As a side note, this technique of leveraging the built-in SQL Server asynchronous activation is used by at least two major financial corporations I know of, for exactly the same scenario as your.
I don't know why everyone answer here wants all the work to be done in the database, but the job control should be outside. It's like putting the ignition key for your car on the outside. Actually, i do know why. You tagged this question with wcf.
And the suggestion to use "compute nodes" just completed my buzzword bingo, thanks dthorpe! you don't need "compute nodes" just cores. Most RDBMSs have PX built right in (Parallel Execution). Why pay for cloud computing that you use everyday, just buy a server with enough CPUs, you'll be fine... No need for "scatter gather" queries, just turn on PX...
Pontus is pointing you in the right direction. Being satisfied with 6 min performance and worrying about how to schedule that is your problem. There are lots of strategies to manage your data into formats which promote speed. Indexes, partitioning, cubes, IOTs. You maybe doing two pass sorts instead of in memory sorts. Your statistics could be out of date causing a bad plan.
I'm assuming that you've not done a whole ton of db tuning from the tenor of this question. You really should post a database tuning question(s) and tell us the RDBMS you're using and how far you've already tuned.
Mike,
There are many ways to answer this question, but the more important question I see that you should be asking is, why does it take 6 mins to filter for a stock?
Yes I know you have 50 years of data, and lots of stocks, BUT it should not take 6 mins. So more importantly, I would be looking it that particular table structure, the indexes on there and the query and what it is doing.
I used to work for a similar company, with tables that are almost 100Gb each. Yes the size of the table, not the entire db, and after some fine tuning got queries that used to take 15 mins + down to 3 seconds.
I would love to help you especially if you are running on SQL Server. Email me ryk99[at]hotmail[dot]com and we'll see what we can do from there.
Have you thought of using an ETL solution such as SSIS to pre-populate your data?
精彩评论