Reading and Writing XML files with unknown encoding in Perl?
I am picking up pieces of someone else's large project and trying to right the wrongs. The problem is, I'm just not sure what the correct ways are.
So, I am cURLing a bunch of HTML pages, then writing it to files with simple commands like:
$src = `curl http://google.com`;
open FILE, ">output.html";
print FILE $src;
close FILE;
Now I wanted those to be saved as UTF-8. What is it saved as? Then I am reading the html file in using the same basic 'open' command, parsin开发者_如何学编程g the html with regex calls, and using string concatenation to make a big string and writing it to an XML file (using the same code as above). I have already started using XML::Writer instead, but now I must go through and fix the files that have inaccurate encoding.
So, I don't have the html anymore, but I still have the XML that have to display proper characters. Here is an example: http://filevo.com/wkkixmebxlmh.html
The main problem is detecting and replacing the character in question with a "\x{2019}" that displays in editors properly. But I can't figure out a regex to actually capture the character in the wild.
UPDATE:
I still cannot detect the ALT-0146 character that's in the XML file I uploaded to Filevo above. I've tried opening it in UTF-8, and searching for /\x{2019}/, /chr(0x2019)/, and just /’/, nothing.
Discovering the encoding of a HTML document is hard. See http://blog.whatwg.org/the-road-to-html-5-character-encoding and especially that it requires a "7-step algorithm; step 4 has 2 sub-steps, the first of which has 7 branches, one of which has 8 sub-steps, one of which actually links to a separate algorithm that itself has 7 steps... It goes on like that for a while."
This is what I used for a my limited needs in parsing HTML files.
my $CHARACTER_SET_CLASS = '\w:.()-';
# X(HT)?ML: http://www.w3.org/International/O-charset
/\<\?xml [^>]*(?<= )encoding=[\'\"]?([$CHARACTER_SET_CLASS]+)/ ||
# X?HTML: http://blog.whatwg.org/the-road-to-html-5-character-encoding
/\<meta [^>]*\bcharset=["']?([$CHARACTER_SET_CLASS]+)/i ||
# CSS: http://www.w3.org/International/questions/qa-css-charset
/\@charset "([^\"]*)"/ ||
To make sure you are producing output in UTF-8, apply the utf8
layer to the output stream using binmode
open FILE, '>output.html';
binmode FILE, ':utf8';
or in the 3-argument open
call
open FILE, '>:utf8', 'output.html'
Arbitrary input is trickier. If you are lucky, HTML input will tell you its encoding early on:
wget http://www.google.com/ -O foo ; head -1 foo
<!doctype html><html><head><meta http-equiv="content-type" content="text/html;
charset=ISO-8859-1"><title>Google</title><script>window.google=
{kEI:"xgngTYnYIoPbgQevid3cCg",kEXPI:"23933,28505,29134,29229,29658,
29695,29795,29822,29892,30111,30174,30215,30275,30562",kCSI:
{e:"23933,28505,29134,29229,29658,29695,29795,29822,29892,30111,
30174,30215,30275,30562",ei:"xgngTYnYIoPbgQevid3cCg",expi:
"23933,28505,29134,29229,29658,29695,29795,29822,29892,30111,
30174,30215,30275,30562"},authuser:0,ml:function(){},kHL:"en",
time:function(){return(new Date).getTime()},
Ah, there it is: <meta http-equiv="content-type" content="text/html;
charset=ISO-8859-1">
. Now you may continue to read input as raw bytes and find some way to decode those bytes with the known encoding. CPAN can help with this.
I am referring to the updated part of your question (next time open a new one for a separate topic). This is a hex dump of your file (please refrain in the future from making helpers jump through burning hoops to get at your example data):
0000 3c 78 6d 6c 3e 0d 0a 3c 70 65 72 73 6f 6e 4e 61 <xml>␍< personNa
0010 6d 65 3e 47 2e 20 50 65 74 65 72 20 44 61 80 41 me>G. Pe ter Da�A
0020 6c 6f 69 61 3c 2f 70 65 72 73 6f 6e 4e 61 6d 65 loia</pe rsonName
0030 3e 0d 0a 3c 2f 78 6d 6c 3e 0d 0a >␍</xml >␍
You said you know the character should be ’
, but it got totally mangled. It can't be 0x80 in any encoding. This looks like a paste accident where you transferred data between editors/clipboards instead of dealing with just files. If that's not the case, then your cow orker produced a wrong you are not able to right algorithmically.
精彩评论