开发者

Python - problem with accented chars when scraping data from website

I'm Nicola, a new user of Python without a real background in computer programming. Therefore, I'd really need some help with a problem I have. I wrote a code to scrape data from this webpage:

http://finanzalocale.interno.it/sitophp/showQuadro.php?codice=2080500230&tipo=CO&descr_ente=MODENA&anno=2009&cod_modello=CCOU&sigla=MO&tipo_cert=C&isEuro=0&quadro=02

Basically, the goal of my code is to scrape the data from all the tables in the page and write them in a txt file. Here I paste my code:

#!/usr/bin/env python


from mechanize import Browser
from BeautifulSoup import BeautifulSoup
import urllib2, os


def extract(soup):
table = soup.findAll("table")[1]
for row in table.findAll('tr')[1:19]:
        col = row.findAll('td')
        voce = col[0].string
        accertamento = col[1].string
        competenza = col[2].string
        residui = col[3].string
        record = (voce, accertamento, competenza, residui)
        print >> outfile, "|".join(record)

table = soup.findAll("table")[2]
for row in table.findAll('tr')[1:21]:
        col = row.findAll('td')
        voce = col[0].string
        accertamento = col[1].string
        competenza = col[2].string
        residui = col[3].string
        record = (voce, accertamento, competenza, residui)
        print >> outfile, "|".join(record)

table = soup.findAll("table")[3]
for row in table.findAll('tr')[1:44]:
        col = row.findAll('td')
        voce = col[0].string
        accertamento = col[1].string
        competenza = col[2].string
        residui = col[3].string
        record = (voce, accertamento, competenza, residui)
        print >> outfile, "|".j开发者_如何转开发oin(record)

table = soup.findAll("table")[4]
for row in table.findAll('tr')[1:18]:
        col = row.findAll('td')
        voce = col[0].string
        accertamento = col[1].string
        competenza = col[2].string
        residui = col[3].string
        record = (voce, accertamento, competenza, residui)
        print >> outfile, "|".join(record)

    table = soup.findAll("table")[5]
for row in table.findAll('tr')[1:]:
        col = row.findAll('td')
        voce = col[0].string
        accertamento = col[1].string
        competenza = col[2].string
        residui = col[3].string
        record = (voce, accertamento, competenza, residui)
        print >> outfile, "|".join(record)

    table = soup.findAll("table")[6]
for row in table.findAll('tr')[1:]:
        col = row.findAll('td')
        voce = col[0].string
        accertamento = col[1].string
        competenza = col[2].string
        residui = col[3].string
        record = (voce, accertamento, competenza, residui)
        print >> outfile, "|".join(record)


outfile = open("modena_quadro02.txt", "w")
br = Browser()
br.set_handle_robots(False)
url = "http://finanzalocale.interno.it/sitophp/showQuadro.php?codice=2080500230&tipo=CO&descr_ente=MODENA&anno=2009&cod_modello=CCOU&sigla=MO&tipo_cert=C&isEuro=0&quadro=02"
page1 = br.open(url)
html1 = page1.read()
soup1 = BeautifulSoup(html1)
extract(soup1)
outfile.close()

Everything would work fine, but the first column of some tables in that page contains words with accented characters. When I run the code, I get the following:

Traceback (most recent call last):
File "modena2.py", line 158, in <module>
  extract(soup1)
File "modena2.py", line 98, in extract
  print >> outfile, "|".join(record)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 32: ordinal not in range(128)

I know that the problem is with the encoding of the accented characters. I tried to find a solution to this, but it really goes beyond my knowledge. I want to thank in advance everybody that is going to help me.I really appreciate it! And sorry if the question is too basic, but, as I said, I'm just getting started with python and I'm learning everything by myself.

Thanks! Nicola


I'm going to try again based on feedback. Since you are using the print statement to produce the output, your output must be bytes not characters (that's the reality of present day operating systems). By default Python's sys.stdout (what the print statement writes to) uses the 'ascii' character encoding. Because only byte values 0 to 127 are defined by ASCII, those are the only byte values you can print. Hence the error for byte value '\xe0'.

You can change the character encoding of sys.stdout to UTF-8 by doing this:

import codecs, sys
sys.stdout = codecs.getwriter('UTF-8')(sys.stdout)
print u'|'.join([u'abc', u'\u0100'])

The print statement above will not complain about printing a Unicode string that cannot be represented in the ASCII encoding. However, the below code, which prints bytes not characters, produces a UnicodeDecodeError exception, so beware:

import codecs, sys
sys.stdout = codecs.getwriter('UTF-8')(sys.stdout)
print '|'.join(['abc', '\xe0'])

You may find that your code is trying to print characters, and that setting the character encoding of sys.stdout to UTF-8 (or ISO-8859-1) fixes it. But you might find that the code is trying to print bytes (obtained from the BeautifulSoup API), in which case the fix might be something like this:

import codecs, sys
sys.stdout = codecs.getwriter('UTF-8')(sys.stdout)
print '|'.join(['abc', '\xe0']).decode('ISO-8859-1')

I'm not familiar with the BeautifulSoup package, but I advise testing it with various documents to see whether its detection of character encoding is correct. Your code is not explicitly providing an encoding, and it is clearly deciding on an encoding on its own. If that decision comes from the meta encoding tag, then great.


edit: I just tried it, and as I am assuming you want a table in the end, here is a solution that leads to a csv.

from mechanize import Browser
from BeautifulSoup import BeautifulSoup
import urllib2, os
import csv


def extract(soup):
    table = soup.findAll("table")[1]
    for row in table.findAll('tr')[1:19]:
            col = row.findAll('td')
            voce = col[0].string
            accertamento = col[1].string
            competenza = col[2].string
            residui = col[3].string
            record = (voce, accertamento, competenza, residui)
            outfile.writerow([s.encode('utf8') if type(s) is unicode else s for s in record])

    # swap print for outfile statement in all other blocks as well
    # ... 

outfile = csv.writer(open(r'modena_quadro02.csv','wb'))
br = Browser()
br.set_handle_robots(False)
url = "http://finanzalocale.interno.it/sitophp/showQuadro.php?codice=2080500230&tipo=CO&descr_ente=MODENA&anno=2009&cod_modello=CCOU&sigla=MO&tipo_cert=C&isEuro=0&quadro=02"
page1 = br.open(url)
html1 = page1.read()
soup1 = BeautifulSoup(html1)
extract(soup1)


I had a similar issue last week. It was easy to fix in my IDE (PyCharm).

Here was my fix:

Starting from PyCharm menu bar: File -> Settings... -> Editor -> File Encodings, then set: "IDE Encoding", "Project Encoding" and "Default encoding for properties files" ALL to UTF-8 and she now works like a charm.

Hope this helps!


The issue is with printing Unicode text to a binary file:

>>> print >>open('e0.txt', 'wb'), u'\xe0'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 0: ordinal not in range(128)

To fix it, either encode the Unicode text into bytes (u'\xe0'.encode('utf-8')) or open the file in the text mode:

#!/usr/bin/env python
from __future__ import print_function
import io

with io.open('e0.utf8.txt', encoding='utf-8') as file:
    print(u'\xe0', file=file)


Try changing this line:

html1 = page1.read()

To this:

html1 = page1.read().decode(encoding)

where encoding would be, for example, 'UTF-8', 'ISO-8859-1' etc. I'm not familiar with the mechanize package but hopefully there is a way to discover the encoding of the document returned by the read() method. It seems that the read() method is giving you a byte-string, not a character string, and therefore the join call later on must assume ASCII as the encoding.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜