How to: remove part of a Unicode string in Python following a special character
first a short summery:
python ver: 3.1 system: Linux (Ubuntu)
I am trying to do some data retrieval through Python and BeautifulSoup.
Unfortunately some of the tables I am trying to process contains cells where the following text string exists:
789.82 ± 10.28
For this i to work i need two things:
How do i handle "weird" symbols such as: ± and how do i remove the par开发者_JS百科t of the string containing: ± and everything to the right of this?
Currently i get an error like: SyntaxError: Non-ASCII charecter '\xc2' in file ......
Thank you for your help
[edit]:
# dataretriveal from html files from DETHERM
# -*- coding: utf8 -*-
import sys,os,re
from BeautifulSoup import BeautifulSoup
sys.path.insert(0, os.getcwd())
raw_data = open('download.php.html','r')
soup = BeautifulSoup(raw_data)
for numdiv in soup.findAll('div', {"id" : "sec"}):
currenttable = numdiv.find('table',{"class" : "data"})
if currenttable:
numrow=0
for row in currenttable.findAll('td', {"class" : "dataHead"}):
numrow=numrow+1
for col in currenttable.findAll('td'):
col2 = ''.join(col.findAll(text=True))
if col2.index('±'):
col2=col2[:col2.indeindex('±')]
print(col)
print(numrow)
ref=numdiv.find('a')
niceref=''.join(ref.findAll(text=True))
print(niceref)
Now this code is followed by an error of:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)
Where did the ASCII reference pop up from ?
You need to have your Python file encoded in utf-8. Otherwise, it's quite trivial:
>>> s = '789.82 ± 10.28'
>>> s[:s.index('±')]
'789.82 '
>>> s.partition('±')
('789.82 ', '±', ' 10.28')
精彩评论