PHP: Simple XML and different codepages and getting the data correctly
I am working on this project where I receive different XML files from different sources. My PHP script should read them, parse them, and store them into the mysql database.
To parse the XML files, I use the SimpleXMLElement class in PHP. I receive files from Belgium in UTF-8 encoding, from Germany in iso-8859-1 encoding, from the Czech Republic in cp1250, and so on...开发者_如何学JAVA
When I pass the xml-data to SimpleXMLElement and print an asXML() on this object, I see the xml data correctly as it was in the original xml file. When I try to assign a field to a PHP-variable and print this variable on the screen, the text looks corrupted, and is of course also corrupted when inserted into the mysql database.
Example:
The XML:
<?xml version="1.0" encoding="cp1250"?>
...
<name>Labe Dìèín - Rozb 741,85km ; Dìèín - Rozb 741,85km </name>
...
The PHP code:
$sxml = file_get_contents("test.xml");
$xml = new SimpleXMLElement($sxml);
//echo $xml->asXML() . "\n"; // content will show up correctly in the shell
$name = (string)$xml->ftm->fairway_section->geo_object->name;
echo $name . "\n";
Result of the code (on linux bash shell) moves the cursor upwards and then prints: bÃn - Rozb 741,85km ; DÄ (the cursor movement is of course related to the incorrect characters that are printed out by PHP)
I think that PHP converts its data to UTF-8 to store it in a string parameter, so I presumed that using mb_convert_encoding to convert from UTF-8 to cp1250 would show the correct result, but it doesn't. Also I should be able to store the data in a format that is combinable with all the other sources.
I don't know much about encodings/codepages, this is probably why I can't get it to work right, but what I do know is that if I copy/paste the texts from the different languages to for example a new UltraEdit file, all of them show up right. How does UltraEdit handle this? Does it use UTF-8 (which I presume can show anything?)
How can I convert my data so that it will always show up, with whatever encoding on the source?
Try iconv instead:
$str = iconv('UTF-8', 'WINDOWS-1250', $str);
The problem is your input file is malformed. There is no character ì
(latin small letter I with grave) in Windows-1250. See here.
The closest character is U+00ED (LATIN SMALL LETTER I WITH ACUTE).
The fact such character shows correctly in the shell is likely fortuitous.
精彩评论