开发者

What is an easy way to clean an unparsable csv file

The csv file was created correctly but the name and address fields contain every piece of punctuation there is available. So when you try to import into mysql you get parsing errors. For example the name field could look like this, "john ""," doe". I have no control over the data I receive so I'm unable to stop people from inputting garbage data. From the example above you can see that if you consider the outside quotes to be the enclosing quotes then it is right but of course mysql, excel, libreoff开发者_运维技巧ice, and etc see a whole new field. Is there a way to fix this problem? Some fields I found even have a backslash before the last enclosing quote. I'm at a loss as I have 17 million records to import.

I have windows os and linux so whatever solution you can think of please let me know.


This may not be a usable answer but someone needs to say it. You shouldn't have to do this. CSV is a file format with an expected data encoding. If someone is supplying you a CSV file then it should be delimited and escaped properly, otherwise its a corrupted file and you should reject it. Make the supplier re-export the file properly from whatever data store it was exported from.

If you asked someone to send you JPG and they send what was a proper JPG file with every 5th byte omitted or junk bytes inserted you wouldnt accept that and say "oh, ill reconstruct it for you".


You don't say if you have control over the creation of the CSV file. I am assuming you do, as if not, the CVS file is corrupt and cannot be recovered without human intervention, or some very clever algorithms to "guess" the correct delimiters vs the user entered ones.

Convert user entered tabs (assuming there are some) to spaces and then export the data using TABS separator.

If the above is not possible, you need to implement an ESC sequence to ensure that user entered data is not treated as a delimiter.


Your title asks: What is an easy way to clean an unparsable csv file

If it is unparseable, that means that you can't correctly break it up into fields. So you can't clean it.

Your first sentence states: The csv file was created correctly but the name and address fields contain every piece of punctuation there is available.

If the csv file was created correctly, then you can split it into fields correctly. So you can clean it.

Only punctuation? You are lucky. Unvalidated text fields in databases commonly contain nasties like tab, carriage return, line feed, and even Ctrl-Z.

Who says it's "unparsable"? On what grounds? What is their definition of "parsable"?

Who says it was "created correctly"? On what grounds? What is their definition of "correct"?

Could you perhaps show us the relevant parts of say 5 or so lines that are causing you grief? Edit your question and format the examples as code, to make them easier to read. Make it obvious where previous/next fields stop/start e.g.

...,"john ""," doe",...

By the way, the above is NOT "right" under any interpretation; it can't be right, with an ODD number of quote characters none of which is escaped.

My definition of correct: Here is how to emit a CSV field that can be parsed no matter what is in the database [caveat: Python csv module barfs on `\x00']:

if '"' in field:
    output = '"' + field.replace('"', '""') + '"'
elif any of comma, line feed, carriage return in field: # pseudocode
    output = '"' + field + '"'
else:
    output = field


That's a really tough issue. I don't know of any real way to solve it, but maybe you could try splitting on ",", cleaning up the items in the resulting array (unicorns :) ) and then re-joining the row?


MySQL import has many parameters including escape characters. Given the example, I think the quotes are escaped by putting a quote in the front. So an import with esaped by '"' would work.


First of all - find all kinds of mistake. And then just replace them with empty strings. Just do it! If you need this corrupted data - only you can recover it.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜