Efficient way to read and Cut File
What I need to do is that I have few files (txt) about 2GB each. I need to cut does files let's say whenever '%%XGF NEW_SET' mark appears I need to create new file and store it. I think that this mark appears about every 40-50 lines. Each line has from bout 4-20 chars. So I need to cut the big files into thousands of small ones and then process them later. I thought of a sample code like that.
DirectoryInfo di = new DirectoryInfo(ConfigurationManager.AppSettings["BilixFilesDir"]);
var files = di.GetFiles();
int count = 0;
bool hasObject = false;
StringBuilder sb = new StringBuilder();
string line = "";
开发者_如何学Python foreach (var file in files)
{
using (StreamReader sr = new StreamReader(file.FullName,Encoding.GetEncoding(1250)))
{
while ((line = sr.ReadLine()) != null)
{
//when new file starts
if (line.Contains("%%XGF NEW_SET"))
{
//when new file existed I need to store old one
if (hasObject)
{
File.WriteAllText(string.Format("{0}/{1}-{2}", ConfigurationManager.AppSettings["OutputFilesDir"], count++, file.Name), sb.ToString());
sb.Length = 0;
sb.Capacity = 0;
}
//setting exist flag
hasObject = true;
}
//when there is no new object
else
//when object exists adding new lines
if (hasObject)
sb.AppendLine(line);
}
//when all work done saving last object
if (hasObject)
{
File.WriteAllText(string.Format("{0}/{1}-{2}", ConfigurationManager.AppSettings["OutputFilesDir"], count++, file.Name), sb.ToString());
sb.Length = 0;
sb.Capacity = 0;
}
}
}
}
So my sample looks like that but I need high efficiency. Any ideas how I can improve my solution? Thanks
What sort of efficiency do you need, compared with what your current code gives you?
Personally I'd probably do it slightly differently - keep a reader and a writer open all the time, and write each line that you read, unless it's a "cut" line, in which case you just close the existing writer and start a new one. I wouldn't particularly expect a different in efficiency there though.
I would eliminate the need for StringBuilder completely, by creating an output file stream into which is written until the next object comes. Then switch to a new file stream on a new object.
Thanks for all the tips. After taking then into consideration I've modified my code into sommething like this:
DirectoryInfo di = new DirectoryInfo(ConfigurationManager.AppSettings["BilixFilesDir"]);
//getting all files from dir
var files = di.GetFiles();
int count = 0;
bool hasObject = false;
string line = "";
StreamWriter sw = null;
foreach (var file in files)
{
using (StreamReader sr = new StreamReader(file.FullName, Encoding.GetEncoding(1250)))
{
while ((line = sr.ReadLine()) != null)
{
//when new file starts
if (line.Contains("%%XGF NEW_SET"))
{
//when new file existed I need to store old one
if (hasObject)
{
sw.Close();
}
else
{
//creating new file and setting exist flag
hasObject = true;
sw = new StreamWriter(string.Format("{0}/{1}-{2}", ConfigurationManager.AppSettings["OutputFilesDir"], count++, file.Name));
//Bill bill = new Bill();
}
}
else
//when object exists adding new lines
if (hasObject)
sw.WriteLine(line);
}
//when all work done saving last object
if (hasObject)
{
sw.Close();
hasObject = false;
}
}
}
sw.Dispose();
What do you think about sommething like that?
One more thing I need to do: My big file can store different documents. All of them have different marking for start. Let's say there are 20 kinds of documents. Sometimes there is the same marking start but inside the document there are some additional markings that allow me to recognise type of document. What i mean is that for example 2 documents has the same marking start like "%%XGF NEW_SET" but one has latter on marking like "BILL_A" and other doesn't. And I have to create one more file for every cut file with some indexes from the document and a string which contains the type. So basicly before saving my StreamWriter I have to extract all those indexes and the type of document that's way I thought about the StringBuilder. So it's a next place when I need this high efficiency. Any good tips?
There are many different ways to read in and write out files in .NET. I have written a benchmark program and give the results in my blog:
http://designingefficientsoftware.wordpress.com/2011/03/03/efficient-file-io-from-csharp
I recommend using the Windows ReadFile and WriteFile methods if you need performance. Avoid any of the asynchronous methods since my benchmark results show that you get better performance with synchronous I/O methods - at least for FileStream which is the fastest .NET class for reading files in. I wrote a class in C# that encapsulates the functionality of the ReadFile and WriteFile functionality which makes it quite easy to use.
Another interesting result is that it looked at things like reading lines .vs. reading data in blocks of 65,536 bytes each and parsing it into lines. It turns out that reading the data in blocks and then parsing it into lines inside your program is more efficient. My download has some examples of how to do that.
I would love it if you would download it and try it out and report back either here or leave a comment on my blog if it is faster than StreamReader. According to my limited benchmarks, it is significantly faster.
Another idea to improve the performance of your program is to create multiple threads and have each thread process a file. Since you said that you have a few large files, I would break it up so that each large file has a separate thread.
If you are doing a lot of work with strings, then you should definitely be using StringBuilder. But, perhaps a more efficient way would be to read the data into a byte array and then build a byte array for output. I would be surprised if that was not more efficient than using StringBuilder.
Bob Bryan MCSD
精彩评论