开发者

Is there a way to check a file encoding using JavaScript?

Here's my case: I'm working with a very big project that contains lots of files. Some of these files are encoded in UTF-8, other in ANSI. We need to convert all the files to UTF-8, because we decided this will be the default in our next projects. This is a big concern because we're Brazilian and we have common words using characters like á, ç, ê, ü, etc. So having multiple files in multiple charset-encodes generated a serious issue.

Anyway, I've come to this JS file that converts ANSI files to UTF-8, copying them to another folder and preserving the originals:

var indir = "in";
var outdir = "out";
function ansiToUtf8(fin, fout) {
    var ansi = WScript.CreateObject("ADODB.Stream");
    ansi.Open();
    ansi.Charset = "x-ansi";
    ansi.LoadFromFile(fin);
    var utf8 = WScript.CreateObject("ADODB.Stream");
    utf8.Open();
    utf8.Charset = "UTF-8";
    utf8.WriteText(ansi.ReadText());
    utf8.SaveToFile(fout, 2 /*adSaveCreateOverWrite*/);
    ansi.Close();
    utf8.Close();
}
var fso = WScript.CreateObject("Scripting.FileSystemObject");
var folder = fso.GetFolder(indir);
var fc = new Enumerator(folder.files);
for (; !fc.atEnd(); fc.moveNext()) {
    var file = fc.item();
    ansiToUtf8(indir+"\\"+file.name, outdir+"\\"+file.name);
}

which I run using this in command line

cscript /Nologo ansi2utf8.js

The problem is that this script runs through all the files, even the ones that are already in UTF-8, and this results in breaking my special characters. So I need to check if the file encoding is already UTF-8, and run my code only if it is ANSI. How can I do that?

Also, my script is running only through the 'in' folder. I'm still thinking in a easy way to make it go inside folde开发者_高级运维rs that are in this folder and run there too.


Does your UTF-8 files have a byte order mark? In that case you could simply check the value of the first 3 bytes to determine if the files are UTF-8 or not. Otherwise the standard method is to check if the file is legal UTF-8 all the way through, if so it is most likely supposed to be read as UTF-8.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜