开发者

parsing xml and output encoding in php

I generate a lot of posts in Wordpress from an XML file. The worry: accented ch开发者_JAVA百科aracters.

The header of the stream is:

<? Xml version = "1.0" encoding = "ISO-8859-15"?>

Here is the complete flux : http://flux.netaffiliation.com/rsscp.php?maff=177053821BA2E13E910D54

My site is in utf8.

So I use the function utf8_encode ... but that does not solve the problem, the accents are always misunderstood.

Does anyone have an idea?

EDIT 04-10-2011 18:02 (french hour) :

Here is the complete flux : http://flux.netaffiliation.com/rsscp.php?maff=177053821BA2E13E910D54

Here is my code :

/**
 * parse an rss flux from netaffiliation and convert each item to posts
 * @var $flux = external link
 * @return bool
 */
private function parseFluxNetAffiliation($flux)
{
    $content = file_get_contents($flux);
    $content = iconv("iso-8859-15", "utf-8", $content);

    $xml = new DOMDocument;
    $xml->loadXML($content);

    //get the first link : http://www.netaffiliation.com
    $link = $xml->getElementsByTagName('link')->item(0);
    //echo $link->textContent;

    //we get all items and create a multidimentionnal array
    $items = $xml->getElementsByTagName('item');

    $offers = array();
    //we walk items
    foreach($items as $item)
    {
        $childs = $item->childNodes;

        //we walk childs
        foreach($childs as $child)
        {
            $offers[$child->nodeName][] = $child->nodeValue;
        }

    }
    unset($offers['#text']);

    //we create one article foreach offer
    $nbrPosts = count($offers['title']);

    if($nbrPosts <= 0) 
    {
        echo self::getFeedback("Le flux ne continent aucune offre",'error');
        return false;
    }

    $i = 0;
    while($i < $nbrPosts)
    {
        // Create post object
        $description = '<p>'.$offers['description'][$i].'</p><p><a href="'.$offers['link'][$i].'" target="_blank">'.$offers['link'][$i].'</a></p>';

        $my_post = array(
            'post_title' => $offers['title'][$i],
            'post_content' => $description,
            'post_status' => 'publish',
            'post_author' => 1,
            'post_category' => array(self::getCatAffiliation())
        );

        // Insert the post into the database
        if(!wp_insert_post($my_post));;

        $i++;
    }

    echo self::getFeedback("Le flux a généré {$nbrPosts} article(s) depuis le flux NetAffiliation dans la catégorie affiliation",'updated');
    return false;

}

All the posts are generated but... the accented chars are ugly. You can see the result here: http://monsieur-mode.com/test/


There are plenty difficulties which you have to master when swapping between different encodings. Also, encodings which use more than one byte to encode characters (so-called multibyte-encodings) like UTF-8, which is used by WordPress, deserve special attention in PHP.

  • First, make sure that all the files you create are saved with the same encoding as they will be served. For example, make sure you set the same encoding as in the "Save as..."-dialog as you use in the HTTP Content-Type header.
  • Second, you need to verify that the input has the same encoding as the file you want to deliver. In your case, the input file has the encoding ISO-8859-15, so you'll need to convert it to UTF-8 using iconv().
  • Third, you must know that PHP doesn't natively support multibyte-encodings such as UTF-8. Functions such as htmlentities() will produce strange characters. For many of these functions, there are multibyte-alternatives, which are prefixed with mb_. If your encoding is UTF-8, check your files for such functions and replace them if necessary.

For more information about these topics, see Wikipedia about variable-width encodings, and the page in the PHP-Manual.


By default, most application work with UTF-8 data and output UTF-8 content. Wordpress should definitely not be apart and surely works on a UTF-8 basis.

I would simply not convert at all any information when printing, but instead change your header to UTF-8 instead of ISO-8859-15.


If your incoming XML data is ISO-8859-15, use iconv() to convert it:

$stream = file_get_contents("stream.xml");
$stream = iconv("iso-8859-15", "utf-8", $stream);


mb_convert_encoding()saves my life.

Here is my solution :

    $content = preg_replace('/ encoding="ISO-8859-15"/is','',$content);
    $content = mb_convert_encoding($content,"UTF-8");
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜