How can I convert an XML document from Latin-1 to UTF-8 in Perl?
We at the company want to convert all the sites we are hosting from Latin-1 to UTF-8. After a ot of googling, we have our Perl script almost complete. The only thing that is missing now are the XML files.
What is the best way to convert XML from Latin-1 to UTF-8 and is it useful?
I am asking becaus开发者_JAVA技巧e we are unsure about it since most entries on Google explain how to do the exact opposite. Some even say that utf8 may cause problems with XML. Can you enlighten us on the whole XML Encoding Issue?
What are you converting? The data or the XML tags or something else?
I think you just need to read it as Latin-1 and rewrite it as UTF-8 unless your source does something really weird. The decoding and encoding happens for you at the filehandle level. Once you have it in Perl, it's internally UTF-8 already.
What do you have so far? What problems are you having?
Is your situation too complicated to merely use xmllint?
xmllint --encode utf8 --output filename.xml filename.xml.latin1
If you are using XML::Parser, see Juerd's Unicode Advice about that module.
If you are converting more than just XML files, iconv might help:
iconv -f ISO-8859-1 -t UTF-8 filename.txt.latin1 > filename.txt
I'd use xmllint --encode utf8 FILE-NAME
, sample:
xmllint --encode utf8 --output test.xml test.xml
will correctly convert test.xml
(whatever encoding it may have) to UTF-8 including the XML prologue.
As brian mentioned its internally UTF-8 in Perl. Perl will convert it whether you want it or not.
The trickery is connected to the UTF8 flag, which is a bit flag attached to each string. For the data that XML::Parser returns, that UTF8 flag is set.
If ever you want ot get rid of this behaviour, clear the UTF8 flag. One way you can do it, is like this:
sub de_utf8 {
use bytes;
return "$_[0]";
}
This way, the resulting string will be the same byte data as the original string.
EDIT: A bit off the topic of the OP... sorry.
精彩评论