开发者

Upload and parse csv file with "universal newline" in python on Google App Engine

I'm uploading a csv/tsv file from a form in GAE, and I try to parse the file with python csv module.

Like describe here, uploaded files in GAE are strings.

So I treat my uploaded string a file-like object :

file = self.request.get('catalog')
catalog = cs开发者_运维知识库v.reader(StringIO.StringIO(file),dialect=csv.excel_tab)

But new lines in my files are not necessarily '\n' (thanks to excel..), and it generated an error :

Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?

Does anyone know how to use StringIO.StringIO to treat strings like files open in universal-newline?


How about:

file = self.request.get('catalog')
file  = '\n'.join(file.splitlines())
catalog = csv.reader(StringIO.StringIO(file),dialect=csv.excel_tab)

or as pointed out in the comments, csv.reader() supports input from a list, so:

file = self.request.get('catalog')
catalog = csv.reader(file.splitlines(),dialect=csv.excel_tab)

or if in the future request.get supports read modes:

file = self.request.get('catalog', 'rU')
catalog = csv.reader(StringIO.StringIO(file),dialect=csv.excel_tab)


The solution described here should work. By defining an iterator class as follows, which loads the blob 1MB at a time, splits the lines using .splitlines() and then feeds lines to the CSV reader one at a time, the newlines can be handled without having to load the whole file into memory.

class BlobIterator:
    """Because the python csv module doesn't like strange newline chars and
    the google blob reader cannot be told to open in universal mode, then
    we need to read blocks of the blob and 'fix' the newlines as we go"""

    def __init__(self, blob_reader):
        self.blob_reader = blob_reader
        self.last_line = ""
        self.line_num = 0
        self.lines = []
        self.buffer = None

    def __iter__(self):
        return self

    def next(self):
        if not self.buffer or len(self.lines) == self.line_num + 1:
            self.buffer = self.blob_reader.read(1048576)  # 1MB buffer
            self.lines = self.buffer.splitlines()
            self.line_num = 0

            # Handle special case where our block just happens to end on a new line
            if self.buffer[-1:] == "\n" or self.buffer[-1:] == "\r":
                self.lines.append("")

        if not self.buffer:
            raise StopIteration

        if self.line_num == 0 and len(self.last_line) > 0:
            result = self.last_line + self.lines[self.line_num] + "\n"
        else:
            result = self.lines[self.line_num] + "\n"

        self.last_line = self.lines[self.line_num + 1]
        self.line_num += 1

        return result

Then call this like so:

blob_reader = blobstore.BlobReader(blob_key)
blob_iterator = BlobIterator(blob_reader)
reader = csv.reader(blob_iterator)
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜