开发者

XML read error because of bad UTF8 encoding

I'm trying to create a script to export my comments to Disqus and, in order to do that, I need to make a huge XML file.

I have a problem with encodement in UTF 8. It's supposed that the file is in UTF-8 but I need to make utf8_decode in order to have my Spanish elements shown properly.

The file generated is like that:

<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
    xmlns:content="http://purl.org/rss/1.0/modules/content/"
    xmlns:dsq="http://www.disqus.com/"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
    xmlns:wp="http://wordpress.org/export/1.0/"
>
<channel>
    <wp:comment>
        <wp:comment_id>26</wp:comment_id>
        <wp:comment_author>KA_DIE</wp:comment_author>
        <wp:comment_author_email> </wp:comment_author_email>
        <wp:comment_author_url></wp:comment_author_url>
        <wp:comment_author_IP> </wp:comment_author_IP>
        <wp:comment_date_gmt>2009-07-16 18:53:19</wp:comment_date_gmt>
        <wp:comment_content><![CDATA[WTF TEH Gladios en español <br />tnx tnx <br />me usta mucho esa web estoy pendiente mucho se su actualziacion es buen saber ke esta en español <br />x que solo entendia el 80, 90% de la paguina jiji]]></wp:comment_content>
        <wp:comment_approve开发者_如何学Pythond>1</wp:comment_approved>
        <wp:comment_parent>0</wp:comment_parent>
    </wp:comment>
</channel>
</rss>

Deleted data for security reasons such as IP or email. As you can see, it contains "ñ" letter. But the XML shown throws an error:

XML read error: bad composed

I don't know the exactly translation but it crash in the content line. The code is generated with this:

public function generateXmlElement (){
            $xml = "<wp:comment>
                        <wp:comment_id>$this->id</wp:comment_id>
                        <wp:comment_author>$this->author</wp:comment_author>
                        <wp:comment_author_email>$this->author_email</wp:comment_author_email>
                        <wp:comment_author_url>$this->author_url</wp:comment_author_url>
                        <wp:comment_author_IP>$this->author_ip</wp:comment_author_IP>
                        <wp:comment_date_gmt>$this->date</wp:comment_date_gmt>
                        <wp:comment_content><![CDATA[$this->content]]></wp:comment_content>
                        <wp:comment_approved>$this->approved</wp:comment_approved>
                        <wp:comment_parent>0</wp:comment_parent>
            </wp:comment>";
            return $xml;
        }

And then fwrite to a file.

Do you know what should be the problem?


The problem is most likely that your XML isn't UTF-8 encoded, but is actually something else (ISO-8859-1?). The character 'ñ' (U+00F1) is encoded in UTF-8 as 2 octets 0xC3B1. In both the Windows 1252 code page and ISO-8859 encodings, 'ñ' is a single octet 0xF1.

Does your XML file have a Unicode BOM (U+FEFF) at the beginning of the file? The BOM, if present, indicates the encoding and byte order.

  • 0xEFBBBF: UTF-8. Byte order isn't signicant.
  • Byte order is signicant for UTF-16 and UTF-32:
    • 0xFFFE: UTF-16, little-endian
    • 0xFEFF: (big-endian)
    • 0xFFFE0000: UTF-32, little-endian
    • 0x0000FEFF: UTF-32, big-endian

The XML standard says that if no BOM is present and no XML declaration indicating encoding is present, that the document shall be interpreted as UTF-8 encoded by default. I believe it's left fuzzy as to what happens if their is a discrepancy between BOM (if present) and encoding specified in the XML declaration.

It may be that your file has an incorrect XML declaration (e.g., rather than saying UTF-8, the XMl declaration should say something like ISO-8859-1.


You should be using a proper XML library to generate XML. LibXML2 comes bundled with PHP and is accessible from PHP's DOM API. That will handle the encoding issues, among other things. As is usually the case with such things, it's an upfront learning investment the benefit of which will not immediately be clear. But a benefit there is.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜