Reading and Writing XML files with unknown encoding in Perl?

2023-03-09 01:30 问答作者：

I am picking up pieces of someone else's large project and trying to right the wrongs. The problem is, I'm just not sure what the correct ways are.

So, I am cURLing a bunch of HTML pages, then writing it to files with simple commands like:

$src = `curl http://google.com`;
open FILE, ">output.html";
print FILE $src;
close FILE;

Now I wanted those to be saved as UTF-8. What is it saved as? Then I am reading the html file in using the same basic 'open' command, parsin开发者_如何学编程g the html with regex calls, and using string concatenation to make a big string and writing it to an XML file (using the same code as above). I have already started using XML::Writer instead, but now I must go through and fix the files that have inaccurate encoding.

So, I don't have the html anymore, but I still have the XML that have to display proper characters. Here is an example: http://filevo.com/wkkixmebxlmh.html

The main problem is detecting and replacing the character in question with a "\x{2019}" that displays in editors properly. But I can't figure out a regex to actually capture the character in the wild.

UPDATE:

I still cannot detect the ALT-0146 character that's in the XML file I uploaded to Filevo above. I've tried opening it in UTF-8, and searching for /\x{2019}/, /chr(0x2019)/, and just /’/, nothing.

Discovering the encoding of a HTML document is hard. See http://blog.whatwg.org/the-road-to-html-5-character-encoding and especially that it requires a "7-step algorithm; step 4 has 2 sub-steps, the first of which has 7 branches, one of which has 8 sub-steps, one of which actually links to a separate algorithm that itself has 7 steps... It goes on like that for a while."

This is what I used for a my limited needs in parsing HTML files.

my $CHARACTER_SET_CLASS = '\w:.()-';

     # X(HT)?ML: http://www.w3.org/International/O-charset
     /\<\?xml [^>]*(?<= )encoding=[\'\"]?([$CHARACTER_SET_CLASS]+)/ ||
     # X?HTML: http://blog.whatwg.org/the-road-to-html-5-character-encoding
     /\<meta [^>]*\bcharset=["']?([$CHARACTER_SET_CLASS]+)/i ||
     # CSS: http://www.w3.org/International/questions/qa-css-charset
     /\@charset "([^\"]*)"/ ||

To make sure you are producing output in UTF-8, apply the utf8 layer to the output stream using binmode

open FILE, '>output.html';
binmode FILE, ':utf8';

or in the 3-argument open call

open FILE, '>:utf8', 'output.html'

Arbitrary input is trickier. If you are lucky, HTML input will tell you its encoding early on:

wget http://www.google.com/ -O foo ; head -1 foo

<!doctype html><html><head><meta http-equiv="content-type" content="text/html; 
charset=ISO-8859-1"><title>Google</title><script>window.google=
{kEI:"xgngTYnYIoPbgQevid3cCg",kEXPI:"23933,28505,29134,29229,29658,
29695,29795,29822,29892,30111,30174,30215,30275,30562",kCSI:
{e:"23933,28505,29134,29229,29658,29695,29795,29822,29892,30111,
30174,30215,30275,30562",ei:"xgngTYnYIoPbgQevid3cCg",expi:
"23933,28505,29134,29229,29658,29695,29795,29822,29892,30111,
30174,30215,30275,30562"},authuser:0,ml:function(){},kHL:"en",
time:function(){return(new Date).getTime()},

Ah, there it is: <meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">. Now you may continue to read input as raw bytes and find some way to decode those bytes with the known encoding. CPAN can help with this.

I am referring to the updated part of your question (next time open a new one for a separate topic). This is a hex dump of your file (please refrain in the future from making helpers jump through burning hoops to get at your example data):

0000  3c 78 6d 6c 3e 0d 0a 3c  70 65 72 73 6f 6e 4e 61  <xml>␍␤< personNa
0010  6d 65 3e 47 2e 20 50 65  74 65 72 20 44 61 80 41  me>G. Pe ter Da�A
0020  6c 6f 69 61 3c 2f 70 65  72 73 6f 6e 4e 61 6d 65  loia</pe rsonName
0030  3e 0d 0a 3c 2f 78 6d 6c  3e 0d 0a                 >␍␤</xml >␍␤

You said you know the character should be ’, but it got totally mangled. It can't be 0x80 in any encoding. This looks like a paste accident where you transferred data between editors/clipboards instead of dealing with just files. If that's not the case, then your cow orker produced a wrong you are not able to right algorithmically.

继续阅读：latin1 perl utf-8 xml

Reading and Writing XML files with unknown encoding in Perl?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？