Parsing CSV with commas in fields
I have a csv file with commas inside of fields that are non-enclosed. I unfortunately must parse this file and cannot get it replaced with a prop开发者_JS百科erly formatted one.
I really don't even know where to begin.
OK. What I'm seeing is the following: You have about 8,000 rows that essentially have a CSV syntax error in them. You can manually figure out which they are, but manually fixing 8,000 entries is a bit much.
The obvious first approach would be to try to see how it is that you can manually figure out which columns have this issue. If it is something you can define rules for, you are in business. If its simple enough, you can write a small text editor macro to go through the file and do it for you. If your text editor doesn't support macros. Use awk
. If you are on Windows and don't have awk
, then go get it.
If it is too complicated for that, fix your real problem. Go fix whatever generated this CSV file to generate it right. If it was someone else's code you don't have access to, tell them to fix it. "You are generating 8,000 unparsable entries" seems like a pretty good argument in my book. Sooner or later they will probably generate a new revision of this file for you to process, so this is really the Right Thing to do.
There's probably nothing you can do with it short of analyzing the records manually in a text editor. The comma delimiters are essentially useless if there is no discernable way to distinguish them from valid commas in the data.
If you can get a cleaner file from whoever created the bad one, that's probably far less trouble than trying to fix up the one you've got.
You could run an excel macro to reformat the comma's to some other character (let's say $, something not in your file) for the time being, then once you've parsed the file you could run the results through some code to reformat the character back into the original commas.
EDIT: I am assuming that you have access to the original file seeing as you've tagged excel here?
I think the best you can hope for is 80% automatic, which means you'll be doing over 1,000 manually best case. You just need to be clever about the data that's there. Read each line in and count the commas. If it's the right amount, write it out to a new file. If it's too many, send it to the exception handler.
Start with what you absolutely know about the data. Is the first column a TimeStamp? If you know that, you can go from "20 commas when there should be 18" to "19 commas when there should be 17". I know that doesn't exactly lift your spirits but it's progress. Is there a location, like a plant name, somewhere in there? Maybe you can develop a list from the good data and search for it in the bad data. If column 7 should be the plant name, go through your list of plant names and see if one of them exists. If so, count the commas between that and the start and between that and the end (or another good comma location that you've established).
If you have some unique data, you can regex to find it's location in the string and again, count commas before and after to see if it's where it should be. Like if you have a Lat/Long reading or a part number that's in the format 99A99-999.
If you can post five or ten rows of good data, maybe someone can suggest more specific ways to identify columns and their locations.
Good luck.
精彩评论