What's the best language/technique to perform advanced data cleansing and formatting on a SQL/MySQL/PostgreSQL table?
We have a bunch of utility scripts in Visual FoxPro, which we use to interactively cleanse/format data. We'd like to start migrating this code to make use of other database platforms, like MySQL or SQLite.
For instance we have a script that we run which converts the name and/or address lines to proper upper/lower case. This code goes through an entire table and analyzes/fixes each row. There are others, that do things like parse and standardize the address and even duplicate detection...
We're thinking of migrating the code to Python and possibly using something like SQLAlchemy as a "middleman".
In Visual FoxPro the database/tables are integrated so we can just open the table and run commands. MySQL is different in that we need to extract data from it, then work on that extracted data, then update the table.
What would be the best approach?
I see several possibilities:
1) Extract the the entire data set to be worked on, say all the address fields, if that's what we're going to be working with, then updating it all and writing it all back...
2) Extract the data set in chunks, so as to not potentially consume vast amounts of system memory... then update and write back
3) Generate SQL code, perhaps with the help of a tool like SQLAlchemy,开发者_开发技巧 that gets sent to and executed by the server...
4) ??? Anything else I didn't think of?
It seems like you're trying to do several things all at once. Could you take a step-by-step approach? Perhaps cleansing the data as they are right now using your normal, usual scripts. Then migrate the database to MySQL.
It is easy to migrate the database if VisualFoxPro offers a way to export the database to, say, CSV. You can then import that CSV into MySQL directly, with very little trouble. That gives you two databases that should be functionally identical. Of course, you have to prove that they are indeed identical, which isn't too hard but is time-consuming. You might be able to use SQLAlchemy to help.
When the MySQL database is right, that's the time to port your cleansing scripts to Python or something and get those working.
That's how I would approach this problem: break it into pieces and not try to do too much in any single step.
HTH
精彩评论