parsing xml and output encoding in php
I generate a lot of posts in Wordpress from an XML file. The worry: accented ch开发者_JAVA百科aracters.
The header of the stream is:
<? Xml version = "1.0" encoding = "ISO-8859-15"?>
Here is the complete flux : http://flux.netaffiliation.com/rsscp.php?maff=177053821BA2E13E910D54
My site is in utf8.
So I use the function utf8_encode ... but that does not solve the problem, the accents are always misunderstood.
Does anyone have an idea?
EDIT 04-10-2011 18:02 (french hour) :
Here is the complete flux : http://flux.netaffiliation.com/rsscp.php?maff=177053821BA2E13E910D54
Here is my code :
/**
* parse an rss flux from netaffiliation and convert each item to posts
* @var $flux = external link
* @return bool
*/
private function parseFluxNetAffiliation($flux)
{
$content = file_get_contents($flux);
$content = iconv("iso-8859-15", "utf-8", $content);
$xml = new DOMDocument;
$xml->loadXML($content);
//get the first link : http://www.netaffiliation.com
$link = $xml->getElementsByTagName('link')->item(0);
//echo $link->textContent;
//we get all items and create a multidimentionnal array
$items = $xml->getElementsByTagName('item');
$offers = array();
//we walk items
foreach($items as $item)
{
$childs = $item->childNodes;
//we walk childs
foreach($childs as $child)
{
$offers[$child->nodeName][] = $child->nodeValue;
}
}
unset($offers['#text']);
//we create one article foreach offer
$nbrPosts = count($offers['title']);
if($nbrPosts <= 0)
{
echo self::getFeedback("Le flux ne continent aucune offre",'error');
return false;
}
$i = 0;
while($i < $nbrPosts)
{
// Create post object
$description = '<p>'.$offers['description'][$i].'</p><p><a href="'.$offers['link'][$i].'" target="_blank">'.$offers['link'][$i].'</a></p>';
$my_post = array(
'post_title' => $offers['title'][$i],
'post_content' => $description,
'post_status' => 'publish',
'post_author' => 1,
'post_category' => array(self::getCatAffiliation())
);
// Insert the post into the database
if(!wp_insert_post($my_post));;
$i++;
}
echo self::getFeedback("Le flux a généré {$nbrPosts} article(s) depuis le flux NetAffiliation dans la catégorie affiliation",'updated');
return false;
}
All the posts are generated but... the accented chars are ugly. You can see the result here: http://monsieur-mode.com/test/
There are plenty difficulties which you have to master when swapping between different encodings. Also, encodings which use more than one byte to encode characters (so-called multibyte-encodings) like UTF-8, which is used by WordPress, deserve special attention in PHP.
- First, make sure that all the files you create are saved with the same encoding as they will be served. For example, make sure you set the same encoding as in the "Save as..."-dialog as you use in the HTTP
Content-Type
header. - Second, you need to verify that the input has the same encoding as the file you want to deliver. In your case, the input file has the encoding
ISO-8859-15
, so you'll need to convert it toUTF-8
usingiconv()
. - Third, you must know that PHP doesn't natively support multibyte-encodings such as
UTF-8
. Functions such ashtmlentities()
will produce strange characters. For many of these functions, there are multibyte-alternatives, which are prefixed withmb_
. If your encoding isUTF-8
, check your files for such functions and replace them if necessary.
For more information about these topics, see Wikipedia about variable-width encodings, and the page in the PHP-Manual.
By default, most application work with UTF-8 data and output UTF-8 content. Wordpress should definitely not be apart and surely works on a UTF-8 basis.
I would simply not convert at all any information when printing, but instead change your header to UTF-8 instead of ISO-8859-15.
If your incoming XML data is ISO-8859-15, use iconv()
to convert it:
$stream = file_get_contents("stream.xml");
$stream = iconv("iso-8859-15", "utf-8", $stream);
mb_convert_encoding()
saves my life.
Here is my solution :
$content = preg_replace('/ encoding="ISO-8859-15"/is','',$content);
$content = mb_convert_encoding($content,"UTF-8");
精彩评论