Excel Regex, or export to Python? ; "Vlookup" in Python?
We have an Excel file with a worksheet containing people records.
1. Phone Number Sanitation
One of the fields is a phone number field, which contains phone numbers in the format e.g.:
+XX(Y)ZZZZ-ZZZZ
(where X, Y and Z are integers).
There are also some records which have less digits, e.g.:
+XX(Y)ZZZ-ZZZZ
And others with really screwed up formats:
+XX(Y)ZZZZ-ZZZZ / ZZZZ
or:
ZZZZZZZZ
We need to sanitise these all into the format:
0YZZZZZZZZ (or OYZZZZZZ with those with less digits).
2. Fill in Supervisor Details
Each person also has a supervisor, given as an numeric ID. We need to do a lookup to get the name and email address of that supervisor, and add it to the line.
This lookup will be firstly on the same worksheet (i.e. searching itself), and it can then fallback to another workbook with more people.
3. Approach?
For the first issue, I was thinking of using regex in Excel/VBA somehow, to do the parsing. My Excel-fu isn't the best, but I suppose I can learn...lol. Any particular points on this one?
However, would I be better off exporting the XLS to a CSV (e.g. using xlrd), then using Python to fix up the phone numbers?
For the second approach, I was thinking of just using vlookups in开发者_StackOverflow社区 Excel, to pull in the data, and somehow, having it fall through, first on searching itself, then on the external workbook, then just putting in error text. Not sure how to do that last part.
However, if I do happen to choose to export to CSV and do it in Python, what's an efficient way of doing the vlookup? (Should I convert to a dict, or just iterate? Or is there a better, or more idiomatic way?)
Cheers, Victor
In general, avoid Excel formulas; use xlrd to extract the data that you need, then forget it came from Excel and manipulate the data using Python. E.g. addressing the xlrd / vlookup question: the best way would be to create a dictionary ONCE from the relevant parts of the 2 columns containing the keys and values.
Using xlrd to export to CSV and then reading it back is a waste of time AND loses valuable information (like what is the actual type of the data in the Excel cell). If your data was in a database would you export it to CSV and read it back??
If you go the VBA route, it may pay to take a look at Tushar Mehta's documentation. If you go the Python route, you could try parsing to CSV or, alternatively, just manipulating things in memory and writing via XLWT (which would be my preferred technique). You may also consider just modifying the Excel data directly using COM calls, based on something like this. Finally, if you're committed to doing this outside of Excel, you might take a look at Jython and Apache POI. Not the most lightweight solution, but POI is the most feature-complete library I know of that does not depend on running on Windows.
As other's have observed in comments, it's hard to be concrete with such a broad question. Hopefully something here gets you started...
精彩评论