Upload and parse csv file with "universal newline" in python on Google App Engine
I'm uploading a csv/tsv file from a form in GAE, and I try to parse the file with python csv module.
Like describe here, uploaded files in GAE are strings.
So I treat my uploaded string a file-like object :file = self.request.get('catalog')
catalog = cs开发者_运维知识库v.reader(StringIO.StringIO(file),dialect=csv.excel_tab)
But new lines in my files are not necessarily '\n' (thanks to excel..), and it generated an error :
Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?Does anyone know how to use StringIO.StringIO to treat strings like files open in universal-newline?
How about:
file = self.request.get('catalog')
file = '\n'.join(file.splitlines())
catalog = csv.reader(StringIO.StringIO(file),dialect=csv.excel_tab)
or as pointed out in the comments, csv.reader()
supports input from a list, so:
file = self.request.get('catalog')
catalog = csv.reader(file.splitlines(),dialect=csv.excel_tab)
or if in the future request.get
supports read modes:
file = self.request.get('catalog', 'rU')
catalog = csv.reader(StringIO.StringIO(file),dialect=csv.excel_tab)
The solution described here should work. By defining an iterator class as follows, which loads the blob 1MB at a time, splits the lines using .splitlines() and then feeds lines to the CSV reader one at a time, the newlines can be handled without having to load the whole file into memory.
class BlobIterator:
"""Because the python csv module doesn't like strange newline chars and
the google blob reader cannot be told to open in universal mode, then
we need to read blocks of the blob and 'fix' the newlines as we go"""
def __init__(self, blob_reader):
self.blob_reader = blob_reader
self.last_line = ""
self.line_num = 0
self.lines = []
self.buffer = None
def __iter__(self):
return self
def next(self):
if not self.buffer or len(self.lines) == self.line_num + 1:
self.buffer = self.blob_reader.read(1048576) # 1MB buffer
self.lines = self.buffer.splitlines()
self.line_num = 0
# Handle special case where our block just happens to end on a new line
if self.buffer[-1:] == "\n" or self.buffer[-1:] == "\r":
self.lines.append("")
if not self.buffer:
raise StopIteration
if self.line_num == 0 and len(self.last_line) > 0:
result = self.last_line + self.lines[self.line_num] + "\n"
else:
result = self.lines[self.line_num] + "\n"
self.last_line = self.lines[self.line_num + 1]
self.line_num += 1
return result
Then call this like so:
blob_reader = blobstore.BlobReader(blob_key)
blob_iterator = BlobIterator(blob_reader)
reader = csv.reader(blob_iterator)
精彩评论