PHP: Simple XML and different codepages and getting the data correctly

2023-02-09 14:14 问答作者：

I am working on this project where I receive different XML files from different sources. My PHP script should read them, parse them, and store them into the mysql database.

To parse the XML files, I use the SimpleXMLElement class in PHP. I receive files from Belgium in UTF-8 encoding, from Germany in iso-8859-1 encoding, from the Czech Republic in cp1250, and so on...开发者_如何学JAVA

When I pass the xml-data to SimpleXMLElement and print an asXML() on this object, I see the xml data correctly as it was in the original xml file. When I try to assign a field to a PHP-variable and print this variable on the screen, the text looks corrupted, and is of course also corrupted when inserted into the mysql database.

Example:

The XML:

<?xml version="1.0" encoding="cp1250"?>
...
<name>Labe Dìèín - Rozb 741,85km  ;  Dìèín - Rozb 741,85km </name>
...

The PHP code:

$sxml = file_get_contents("test.xml");
$xml = new SimpleXMLElement($sxml);
//echo $xml->asXML() . "\n"; // content will show up correctly in the shell
$name = (string)$xml->ftm->fairway_section->geo_object->name;
echo $name . "\n";

Result of the code (on linux bash shell) moves the cursor upwards and then prints: bÃn - Rozb 741,85km ; DÄ (the cursor movement is of course related to the incorrect characters that are printed out by PHP)

I think that PHP converts its data to UTF-8 to store it in a string parameter, so I presumed that using mb_convert_encoding to convert from UTF-8 to cp1250 would show the correct result, but it doesn't. Also I should be able to store the data in a format that is combinable with all the other sources.

I don't know much about encodings/codepages, this is probably why I can't get it to work right, but what I do know is that if I copy/paste the texts from the different languages to for example a new UltraEdit file, all of them show up right. How does UltraEdit handle this? Does it use UTF-8 (which I presume can show anything?)

How can I convert my data so that it will always show up, with whatever encoding on the source?

Try iconv instead:

$str = iconv('UTF-8', 'WINDOWS-1250', $str);

The problem is your input file is malformed. There is no character ì (latin small letter I with grave) in Windows-1250. See here.

The closest character is U+00ED (LATIN SMALL LETTER I WITH ACUTE).

The fact such character shows correctly in the shell is likely fortuitous.

继续阅读：character-encoding codepages php xml

PHP: Simple XML and different codepages and getting the data correctly

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？