Is SQL Azure suitable for adhoc intensive SQL processing?
I'm looking for some guidance on whether SQL Azure is a suitable platform for one-off, short-term intensive processing of a batch of SQL data. (i.e. "crunching" of d开发者_StackOverflow社区ata)
My scenario is as follows:
I have a 32Gb database containing a single table of data. The table contains spatial data defined using the geometry datatype, together with various columns of associated attributes. I need to perform some one-off processing of this data, which involves executing a series of computationally-expensive queries (as most spatial queries seem to be!)
When I tested these queries on a subset of the data on my own server, they took several hours to complete. I expect that, if I tried to execute them against the whole dataset locally, it would lock up my SQL Server for days (or perhaps it would die trying), which is a situation I'm trying to avoid.
So I'm looking for a short-term alternative where I can set those queries to be executed somewhere else, and just retrieve the processed table when they are complete.
I understand that the SQL Azure platform is designed to offer flexible capacity (in terms of storage), and also scales to accommodate e.g. an increased number of transactions. The typical example application cited seems to be to provide a DB backend for a web application/store that experiences rapidly increasing or fluctuating demand. However, what I haven't been able to find many details of is whether SQL Azure is suitable to accommodate individual long-running queries, executed serially.
Just to be clear -
- I expect this to be a one-off operation. Or perhaps, to be performed maybe once annually.
- Once the processing is complete, I have no intentions to continue to host the data "in the cloud" - I'd like to retrieve the processed dataset and host it on-site again.
- The ease of getting data to/back from the platform is obviously important, since I'm not looking to permenantly "migrate" anything. If I understand correctly, you can't backup/restore databases to Azure, and scripting the data is going to be very painful.
- I am comfortable with Management Studio, and any platform that allowed me to use this as an interface to run the queries and perform spot checks of the results would be a benefit.
If anybody has any experiences with using SQL Azure for this kind of activity, or could perhaps suggest an alternative, I'd be very grateful!
I'm genuinely not sure if SQL Azure is suitable for this task - there's no problems in terms of storage, but I don't know how good it's architecture is for long running tasks. In particular, see:
SQL Azure Database provides a large-scale multi-tenant database service on shared resources. In order to provide a good experience to all SQL Azure Database customers, your connection to the service may be closed due to the following conditions:
- Excessive resource usage
- Long-running queries
- Long-running single transactions, between the BEGIN TRAN and END TRAN statements
- Idle connections
This is different from how an on-premise instance of SQL Server works.
from: http://msdn.microsoft.com/en-us/library/ee730903.aspx
So I would be worried that SQL Azure might not work for your long queries - unless you can break them up into lots of short queries.
If SQL Azure can't work for you, then it might be better for you to deploy a separate SQL instance somewhere (perhaps to an AWS instance?) to do these one-off calculations.
It depends on the nature of the workload. You mention "executing a series of computationally-expensive queries"; however it is not clear to me if you have a lot of small, but repeated commands or one big job that needs to work for the entire duration of the batch. The former might work in SQL Azure with a form of connection retry logic, while the latter may not. In either case, you can also consider refactoring the processing logic in .NET.
Indeed, most batch processing activities are redesigned in the cloud as worker processes due to the SQL Azure throttling mechanism; basically a .NET code would run in Windows Azure, read the data it needs from SQL Azure, perform the needed calculations in memory and save the results back in SQL Azure. Depending on the type of workload, this is probably the best approach since you may be able to design it in a way that scales out nicely; hence potentially reducing the total execution time significantly (assuming you can break the data crunching logic in smaller pieces and execute it in .NET instead of SQL Azure).
Regarding backuping/restoring the data to an on-premise server, you have a few options that do not involve data scripting. If you decide to try to refactor in .NET, we can discuss these options further.
A few points/questions:
- Is the code you're executing written in T-SQL or in some other programming language?
- Can the processing be executed in parallel, or must it be sequential?
- Where are the current bottlenecks? Is it in computation or data retrieval/storage?
Given what you've said so far and problems I've seen in the past with large databases, I would question whether SQL Server is an appropriate storage technology at all. True, it's intended for transactional based queries, but you only have a single database table. That means the entire "relational database" aspect sort of goes out the window unless it's self referencing (which would create a world of other problems, so I'll ignore that for now and assume that's not the case). There are certainly ways to make sure that you don't run into race conditions when processing the data using NoSQL storage and I can't imagine that transactions would be absolutely necessary. When doing computations, if storage of the result fails, you retry. Worst case scenario, you redo the computation.
32 GB of data for a SQL Server in a single table is a LOT of data and I'm guessing that there are probably indices of some kind in there. If you don't have your SQL Server configured properly (using a lot of physical spindles and splicing the data across them), you can easily run into major performance issues in SQL due to disk I/O.
Chances are really good that Microsoft will be able to scale SQL Azure quite a bit better than the average SQL developer because they know how it's supposed to be done. However, that doesn't mean that there aren't limitations to throughput or how quickly you can query/add data because there are.
My advice would be to look into using Azure Tables (which is basically a NoSQL table) because it would allow you to partition the data across multiple nodes. This partitioning allows you to scale the amount of data they hold to 100TB while not compromising the speed of the queries.
In addition, a 32GB SQL Azure database will cost you $400/month while 40GB of Azure Table storage with 5 million storage transactions would only cost you $11/month. You'll have to add in the "cost" of the worker nodes, but they should theoretically be equivalent. So the Tables option is cheaper per month, but if it's a business backing the project, then the cost is probably far less than the development time going into it.
You'll need to account for the time to pipe 32GB of data into the cloud. Loading a SQL database can take quite a while, and you're somehow going to need to get the data there. Depends on how quickly you can pipe the data into the cloud, and whether you can start processing before it's all there.
The problem I think you'll run into is that in order to use Azure Tables instead of SQL Azure, you're going to need to make some tradeoffs. You'll likely need to convert the data into Azure tables, then write the processing code, etc. At the end of the day, it may not be worth it.
However, I don't think there's enough information here to make that call just yet. The really big question is whether or not there's an opportunity to parallelize the processing and how long you estimate processing would take on a single machine. The next questions to answer are how long it will take to build vs how much time you have to do it.
Judging from your comments about locking the database for days, I don't think it's out of line to assume that you're probably experiencing database issues right now. Depending on the additional processing you expect in the future, you might not have a choice but to evaluate a NoSQL option.
I hate to give an "it depends" answer here, but if you give some additional details, I'd be happy to update this and give you a better sense of where to go and what to do.
精彩评论