parse.unquote_plus TypeError
I'm trying to format a file so that it can be inserted into a database, the file is originally compressed and arround 1.3MB big. Each line looks something like this:
398,%7EAnoniem+001%7E,543,480,7525010,1775,0
This is how the code looks like that parses this file:
Village = gzip.open(Root+'\\data'+'\\' +str(Newest_Date[0])+'\\' +str(Newest_Date[1])+'\\' +str(Newest_Date[2])\
+'\\'+str(Newest_Date[3])+' village.gz');
Village_Parsed = str
for line in Village:
Village_Parsed = Village_Parsed + urllib.parse.unquote_plus(line);
print(Village.readline());
When I run the program I get this error:
Village_开发者_开发问答Parsed = Village_Parsed + urllib.parse.unquote_plus(line);
file "C:\Python31\lib\urllib\parse.py", line 404, in unquote_plus string = string.replace('+', ' ') TypeError: expected an object with the buffer interface
Any idea what is wrong here? Thanks in advance for any help :)
PROBLEM 1 is that urllib.unquote_plus doesn't like the line
that you have fed it. The message should be "Please supply a str object" :-) I suggest that you fix problem 2 below, and insert:
print('line', type(line), repr(line))
immediately after your for
statement so that you can see what you are getting in line
.
You will find that it returns bytes objects:
>>> [line for line in gzip.open('test.gz')]
[b'nudge nudge\n', b'wink wink\n']
Using a mode of 'r' has scant effect:
>>> [line for line in gzip.open('test.gz', 'r')]
[b'nudge nudge\n', b'wink wink\n']
I suggest that instead of passing line
to the parsing routine you pass line.decode('UTF-8')
... or whatever encoding was used when the gz file was written.
PROBLEM 2 is in this line:
Village_Parsed = str
str
is a type. You need an empty str object. To get that, you could call the type i.e. str()
which is formally correct but impractical/unusual/scoffable/weird when compared to using a string constant ''
... so do this:
Village_Parsed = ''
You also have PROBLEM 3: your last statement is trying to read the gz file after EOF.
import gzip, os, urllib.parse
archive_relpath = os.sep.join(map(str, Newest_Date[:4])) + ' village.gz'
archive_path = os.path.join(Root, 'data', archive_relpath)
with gzip.open(archive_path) as Village:
Village_Parsed = ''.join(urllib.parse.unquote_plus(line.decode('ascii'))
for line in Village)
print(Village_Parsed)
Output:
398,~Anoniem 001~,543,480,7525010,1775,0
NOTE: RFC 3986 - Uniform Resource Identifier (URI): Generic Syntax says:
This specification does not mandate any particular character encoding for mapping between URI characters and the octets used to store or transmit those characters. When a URI appears in a protocol element, the character encoding is defined by that protocol; without such a definition, a URI is assumed to be in the same character encoding as the surrounding text.
Therefore 'ascii'
in the line.decode('ascii')
fragment should be replaced by whatever character encoding you've used to encode your text.
精彩评论