Processing a Django UploadedFile as UTF-8 with universal newlines
In my django application, I provide a form that allows users to upload a file. The file can be in a variety of formats (Excel, CSV), come from a variety of platforms (Mac, Linux, Windows), and be encoded in a variety of encodings (ASCII, UTF-8).
For the purpose of this question, let's assume that I have a view which is receiving request.FILES['file']
, which is an instance of InMemoryUploadedFile
, called file
. My problem is that InMemoryUploadedFile
objects (like file
):
- Do not support UTF-8 encoding (I see a
\xef\xbb\xbf
at the beginning of the file, which as I understand is a flag meaning 'this file is UTF-8'). - Do not support universal newlines (which probably the majority of the files uploaded to this system will need).
Complicating the issue is that I wish to pass the file in to the python csv
module, which does not natively support Unicode. I will happily accept answers that avoid this issue - once I get django playing nice with UTF-8 I'm sure I can bludgeon csv
into doing the same. (Similarly, please ignore the requirement to support Excel - I am waiting until CSV works before I tackle parsing Excel files.)
I have tried using StringIO
,mmap
,codec
, and any of a wide variety of ways of accessing the data in an InMemoryUploadedFile
object. Each approach has yielded differing errors, none so far have been perfect. This shows some of the code that I feel came the closest:
import csv
import codecs
class CSVParser:
def __init__(self,file):
# 'file' is assumed to be an InMemoryUploadedFile object.
dialect = csv.Sniffer().sniff(codecs.EncodedFile(file,"utf-8").read(1024))
file.open() # seek to 0
self.reader = csv.reader(codecs.EncodedFile(file,"utf-8"),
dialect=dialect)
try:
self.field_names = self.reader.next()
except StopIteration:
# The file was empty - this is not allowed.
raise ValueError('Unrecognized format (empty file)')
if len(self.field_names) <= 1:
# This probably isn't a CSV file at all.
# Note that the csv module will (incorrectly) parse ALL files, even
# binary data. This will catch most such files.
开发者_如何学Go raise ValueError('Unrecognized format (too few columns)')
# Additional methods snipped, unrelated to issue
Please note that I haven't spent too much time on the actual parsing algorithm so it may be wildly inefficient, right now I'm more concerned with getting encoding to work as expected.
The problem is that the results are also not encoded, despite being wrapped in the Unicode codecs.EncodedFile
file wrapper.
EDIT: It turns out, the above code does in fact work. codecs.EncodedFile(file,"utf-8")
is the ticket. It turns out the reason I thought it didn't work was that the terminal I was using does not support UTF-8. Live and learn!
As mentioned above, the code snippet I provided was in fact working as intended - the problem was with my terminal, and not with python encoding.
If your view needs to access a UTF-8 UploadedFile
, you can just use utf8_file = codecs.EncodedFile(request.FILES['file_field'],"utf-8")
to open a file object in the correct encoding.
I also noticed that, at least for InMemoryUploadedFile
s, opening the file through the codecs.EncodedFile
wrapper does NOT reset the seek()
position of the file descriptor. To return to the beginning of the file (again, this may be InMemoryUploadedFile
specific) I just used request.FILES['file_field'].open()
to send the seek()
position back to 0.
I use the csv.DictReader and it appears to be working well. I attached my code snippet, but it is basically the same as another answer here.
import csv as csv_mod
import codecs
file = request.FILES['file']
dialect = csv_mod.Sniffer().sniff(codecs.EncodedFile(file,"utf-8").read(1024))
file.open()
csv = csv_mod.DictReader( codecs.EncodedFile(file,"utf-8"), dialect=dialect )
For CSV and Excel upload to django, this site may help.
精彩评论