python xml.dom.minidom.parse and utf-8 xml file with cyrillic

2023-02-23 01:56 问答作者：

sorry for my english :)

I have a problem with parsing xml file in utf-8 with cyrillic text in content

some rows from xml:

............

<programme start="20110405022000 +0300"
stop="20110405031000 +0300" channel="4000"> <title
lang="bul">Модерно</title> <sub-title
lang="bul"></sub-title> <desc
lang="bul">Тоук шоу. Модерно е токшоу
с водещ и продуцент Радост Драганова.
Предаването разисква всички теми,
които интересуват жените, като им
помага да изглеждат по-добре и да се
чувстват по-добре</desc> <category
lang="bul">0</category> </programme>
<programme start="20110405031000 +0300"
stop="20110405050000 +0300" channel="4000"> <title
lang="bul">Клонинг</title> <sub-title
lang="bul"></sub-title> <desc
lang="bul">Еп. 89 и 90, сериал.
Любовта между Хаде и Лукас се ражда в
Мароко, където двамата се запознават.
Но мюсюлманските обичаи разделят
влюбе开发者_如何学Cните. Хаде е родена и израснала в
САЩ, но след смъртта на майка си
заминава за Мароко при чичо си
Али</desc> <category
lang="bul">0</category> </programme>

............

i use DOMTree = xml.dom.minidom.parse("text.xml") and get a error:

Traceback (most recent call last):
  File "t3.py", line 9, in <module>
    DOMTree = parse(datasource)
  File "/usr/lib/python2.6/xml/dom/minidom.py", line 1918, in parse
    return expatbuilder.parse(file)
  File "/usr/lib/python2.6/xml/dom/expatbuilder.py", line 928, in parse
    result = builder.parseFile(file)
  File "/usr/lib/python2.6/xml/dom/expatbuilder.py", line 207, in parseFile
    parser.Parse(buffer, 0)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 10, column 18

line 10, column 18 is first cyrillic symbol. in xml file first row is

<?xml version="1.0" encoding="utf-8"?>

Any ideas?

Your xml file has to be well formed, i.e. it must have only one root element. Try adding root tags at the beginning and the end of your input file.

You say """If i change coding in first row to koi8-r it works. But i want to work with utf-8."""

I presume that you mean that it works if the XML file starts with

<?xml version="1.0" encoding="KOI8-R" ?>

If that is true, then your file is encoded in KOI8-R.

If you want to work with UTF-8 input files, then you should NOT encode your files in KOI8-R, or you should transcode the file(s) from KOI8-R to UTF-8.

If "i want to work with utf-8" means something else, please explain.

I would suggest using chardet. the following code might help. I have xml data as GB3212. I have used chardet simply to convert my source to utf-8. I hope this helps.

xml_data_type = chardet.detect(xml_data_source)['encoding']

print xml_data_type '''check your encoding'''

xml_data = xml_data.decode(xml_data_encoding)

xml_data = xml.data.encode("utf-8")

继续阅读：python utf-8 xml

python xml.dom.minidom.parse and utf-8 xml file with cyrillic

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？