python xml.dom.minidom.parse and utf-8 xml file with cyrillic
sorry for my english :)
I have a problem with parsing xml file in utf-8 with cyrillic text in content
some rows from xml:
............
<programme start="20110405022000 +0300"
stop="20110405031000 +0300" channel="4000"> <title
lang="bul">Модерно</title> <sub-title
lang="bul"></sub-title> <desc
lang="bul">Тоук шоу. Модерно е токшоу
с водещ и продуцент Радост Драганова.
Предаването разисква всички теми,
които интересуват жените, като им
помага да изглеждат по-добре и да се
чувстват по-добре</desc> <category
lang="bul">0</category> </programme>
<programme start="20110405031000 +0300"
stop="20110405050000 +0300" channel="4000"> <title
lang="bul">Клонинг</title> <sub-title
lang="bul"></sub-title> <desc
lang="bul">Еп. 89 и 90, сериал.
Любовта между Хаде и Лукас се ражда в
Мароко, където двамата се запознават.
Но мюсюлманските обичаи разделят
влюбе开发者_如何学Cните. Хаде е родена и израснала в
САЩ, но след смъртта на майка си
заминава за Мароко при чичо си
Али</desc> <category
lang="bul">0</category> </programme>
............
i use DOMTree = xml.dom.minidom.parse("text.xml") and get a error:
Traceback (most recent call last):
File "t3.py", line 9, in <module>
DOMTree = parse(datasource)
File "/usr/lib/python2.6/xml/dom/minidom.py", line 1918, in parse
return expatbuilder.parse(file)
File "/usr/lib/python2.6/xml/dom/expatbuilder.py", line 928, in parse
result = builder.parseFile(file)
File "/usr/lib/python2.6/xml/dom/expatbuilder.py", line 207, in parseFile
parser.Parse(buffer, 0)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 10, column 18
line 10, column 18 is first cyrillic symbol. in xml file first row is
<?xml version="1.0" encoding="utf-8"?>
Any ideas?
Your xml file has to be well formed, i.e. it must have only one root element. Try adding root tags at the beginning and the end of your input file.
You say """If i change coding in first row to koi8-r it works. But i want to work with utf-8."""
I presume that you mean that it works if the XML file starts with
<?xml version="1.0" encoding="KOI8-R" ?>
If that is true, then your file is encoded in KOI8-R
.
If you want to work with UTF-8 input files, then you should NOT encode your files in KOI8-R, or you should transcode the file(s) from KOI8-R to UTF-8.
If "i want to work with utf-8" means something else, please explain.
I would suggest using chardet. the following code might help. I have xml data as GB3212. I have used chardet simply to convert my source to utf-8. I hope this helps.
xml_data_type = chardet.detect(xml_data_source)['encoding']
print xml_data_type '''check your encoding'''
xml_data = xml_data.decode(xml_data_encoding)
xml_data = xml.data.encode("utf-8")
精彩评论