Encoding conversion for large file
I am faced with a large (~ 18 GB) file, exported from SQL Server as a Unicode text file, which means its encoding is UTF-16 (little endian开发者_Python百科). The file is now stored in a computer running Linux, but I have not figured out a way to convert it to UTF-8.
At first I tried using iconv, but the file is too large for that. My next approach was using split and converting the files one by one, but that didn't work either - there were a lot of errors during the conversions.
So, any ideas on how to convert this to UTF-8? Any help will be much appreciated.
Since you're using SQL server, I assume your platform is Windows. In the simplest case you can write quick an dirty .NET application, which reads the source line-by-line and writes the converted file as it goes. Something like this:
using System;
using System.IO;
using System.Text;
namespace UTFConv {
class Program {
static void Main(string[] args) {
try {
Encoding encSrc = Encoding.Unicode;
Encoding encDst = Encoding.UTF8;
uint lines = 0;
using (StreamReader src = new StreamReader(args[0], encSrc)) {
using (StreamWriter dest = new StreamWriter(args[1], false, encDst)) {
string ln;
while ((ln = src.ReadLine()) != null) {
lines++;
dest.WriteLine(ln);
}
}
}
Console.WriteLine("Converted {0} lines", lines);
} catch (Exception x) {
Console.WriteLine("Problem converting the file: {0}", x.Message);
}
}
}
}
Just open Visual Studio, start a new C# Console Application project, paste this code in there, compile, and run it from the command line. The first argument is your source file, the second argument is your destination file. Should work.
精彩评论