Is there a way to check a file encoding using JavaScript?
Here's my case: I'm working with a very big project that contains lots of files. Some of these files are encoded in UTF-8, other in ANSI. We need to convert all the files to UTF-8, because we decided this will be the default in our next projects. This is a big concern because we're Brazilian and we have common words using characters like á, ç, ê, ü, etc. So having multiple files in multiple charset-encodes generated a serious issue.
Anyway, I've come to this JS file that converts ANSI files to UTF-8, copying them to another folder and preserving the originals:
var indir = "in";
var outdir = "out";
function ansiToUtf8(fin, fout) {
var ansi = WScript.CreateObject("ADODB.Stream");
ansi.Open();
ansi.Charset = "x-ansi";
ansi.LoadFromFile(fin);
var utf8 = WScript.CreateObject("ADODB.Stream");
utf8.Open();
utf8.Charset = "UTF-8";
utf8.WriteText(ansi.ReadText());
utf8.SaveToFile(fout, 2 /*adSaveCreateOverWrite*/);
ansi.Close();
utf8.Close();
}
var fso = WScript.CreateObject("Scripting.FileSystemObject");
var folder = fso.GetFolder(indir);
var fc = new Enumerator(folder.files);
for (; !fc.atEnd(); fc.moveNext()) {
var file = fc.item();
ansiToUtf8(indir+"\\"+file.name, outdir+"\\"+file.name);
}
which I run using this in command line
cscript /Nologo ansi2utf8.js
The problem is that this script runs through all the files, even the ones that are already in UTF-8, and this results in breaking my special characters. So I need to check if the file encoding is already UTF-8, and run my code only if it is ANSI. How can I do that?
Also, my script is running only through the 'in' folder. I'm still thinking in a easy way to make it go inside folde开发者_高级运维rs that are in this folder and run there too.
Does your UTF-8 files have a byte order mark? In that case you could simply check the value of the first 3 bytes to determine if the files are UTF-8 or not. Otherwise the standard method is to check if the file is legal UTF-8 all the way through, if so it is most likely supposed to be read as UTF-8.
精彩评论