solution to perform lots of calculations on 3 million data points and make charts
i have an excel spreadsheet that is about 300,000 rows and about 100 columns
i need to perform various functions on this spreadsheet and out of this spreadsheet i need to create about 3000 other spreadsheets which are SIGNIFICANTLY smaller
for every created spreadsheet i will need to have a separate powerpoint file that will have an automatically generated graph
i've done lots of VBA programming, but i am a little lost with this project
- if i dump the data into a mysql file wo开发者_如何学运维uld it be easier for me to handle my task?
- is it feasible to do this all in VBA excel?
- is it possible to easily add graphs from excel into powerpoint programmatically? or perhaps should i use a different solution for graphs?
It depends strongly on how you plan to process the data. If you plan to write code in Excel, it makes much more sense to leave it in Excel. Having said that, I would dump the data to CSV (comma-delimited) for further processing with a different tool, like Python.
Everything is always feasible given enough time and money. If you're like most other programmers, you don't have too much of either, so you want the most efficient solution, or close to it. If it were me, I would write code in Python to read the data from a CSV file, perform all required operations, and save the 3000 separate output sets as individual CSV files which can be imported back into Excel.
Charts can be tricky to create and manipulate from VBA. I would use a Python library like Matplotlib to produce all graphical output, which would be saved to disk as PNG images, which can be inserted into the Powerpoint presentation(s).
Python is mentioned here only as an example. You should use a tool that you feel most familiar with; however, the concepts of processing the data programmatically (not via interconnected cell references and formulas with a little VBA thrown in to copy sheets and so on) should still apply, and will be your best way forward here. I have done a ton of the kind of work you describe. Get the data into CSV and process the data with code.
This is certainly feasible in all respects, but VBA may be too much overhead for this because of it's heavy-handed nature in opening and closing the Excel and PowerPoint instances for 3000 spreadsheets and presentations. If it's a one-time solution and you'll only ever need to do it this once though, VBA is certainly fast to develop for, so you could save a lot upfront just by using the object model. One other option is to do this from an Interop app in C# or VB.NET where you may have more control over your environment, like garbage collection.
However, if you're working with Excel 2007/2010 (I assume you are because of the 300k rows), I would do something different. I'd do the calc routines on the main XLSX in VBA and then use Open XML to process and create the 3000 spreadsheets and presentations with charts. (Note: I wouldn't use Open XML on the main XLSX because it doesn't actually render built-in calculations - you would still need to open the XLSX to "hydrate" the spreadsheet - so VBA would be better in this instance).
If you're new to Open XML, there's a lot to learn upfront, so the juice may not be worth the squeeze. But articles like this are very helpful if you do want to know or already Open XML, which is a great starting point (as it deals with charts as well). But you could also use a wrapper on Open XML SDK like Simple OOXML that is quite good for starting out.
Take a look at the open-source statistical system called "R". It's quite good at programatically generating graphs and charts from real-world datasets.
http://www.r-project.org/
I can't answer 2. and 3. for you, but regarding 1: I'd definitely recommend against that, based on your question... of course, you didn't explain exactly what kind of operations you need to perform on the data, so chances are I'm wrong here.
Your situation reminds me of the saying about regexes: "Some people, when they encounter a problem, will immediately try to solve it using a regular expression. Now they have two problems". You don't want an additional problem.
If you must use a database to do this (simply because doing it in Excel isn't performant enough), I'd stick with something Microsoft like Access or SQL Server, which will save you some trouble probably. (never thought I'd be saying this)
精彩评论