Efficient way to read and Cut File

2023-02-10 03:21 问答作者：

What I need to do is that I have few files (txt) about 2GB each. I need to cut does files let's say whenever '%%XGF NEW_SET' mark appears I need to create new file and store it. I think that this mark appears about every 40-50 lines. Each line has from bout 4-20 chars. So I need to cut the big files into thousands of small ones and then process them later. I thought of a sample code like that.

        DirectoryInfo di = new DirectoryInfo(ConfigurationManager.AppSettings["BilixFilesDir"]);
        var files = di.GetFiles();
        int count = 0;
        bool hasObject = false;
        StringBuilder sb = new StringBuilder();
        string line = "";
    开发者_如何学Python    foreach (var file in files)
        {
            using (StreamReader sr = new StreamReader(file.FullName,Encoding.GetEncoding(1250)))
            {
                while ((line = sr.ReadLine()) != null)
                {
                    //when new file starts
                    if (line.Contains("%%XGF NEW_SET"))
                    {
                        //when new file existed I need to store old one
                        if (hasObject)
                        {
                            File.WriteAllText(string.Format("{0}/{1}-{2}", ConfigurationManager.AppSettings["OutputFilesDir"], count++, file.Name), sb.ToString());
                            sb.Length = 0;
                            sb.Capacity = 0;

                        }
                        //setting exist flag 
                        hasObject = true;
                    }
                    //when there is no new object
                    else
                        //when object exists adding new lines
                        if (hasObject)
                            sb.AppendLine(line);
                }
                //when all work done saving last object
                if (hasObject)
                {
                    File.WriteAllText(string.Format("{0}/{1}-{2}", ConfigurationManager.AppSettings["OutputFilesDir"], count++, file.Name), sb.ToString());
                    sb.Length = 0;
                    sb.Capacity = 0;
                }
            }
        }
    }

So my sample looks like that but I need high efficiency. Any ideas how I can improve my solution? Thanks

What sort of efficiency do you need, compared with what your current code gives you?

Personally I'd probably do it slightly differently - keep a reader and a writer open all the time, and write each line that you read, unless it's a "cut" line, in which case you just close the existing writer and start a new one. I wouldn't particularly expect a different in efficiency there though.

I would eliminate the need for StringBuilder completely, by creating an output file stream into which is written until the next object comes. Then switch to a new file stream on a new object.

Thanks for all the tips. After taking then into consideration I've modified my code into sommething like this:

DirectoryInfo di = new DirectoryInfo(ConfigurationManager.AppSettings["BilixFilesDir"]);
//getting all files from dir
var files = di.GetFiles();
int count = 0;
bool hasObject = false;
string line = "";
StreamWriter sw = null;
foreach (var file in files)
{
    using (StreamReader sr = new StreamReader(file.FullName, Encoding.GetEncoding(1250)))
    {
        while ((line = sr.ReadLine()) != null)
        {
            //when new file starts
            if (line.Contains("%%XGF NEW_SET"))
            {
                //when new file existed I need to store old one
                if (hasObject)
                {
                    sw.Close();
                }
                else
                {
                    //creating new file and setting exist flag
                    hasObject = true;
                    sw = new StreamWriter(string.Format("{0}/{1}-{2}", ConfigurationManager.AppSettings["OutputFilesDir"], count++, file.Name));
                    //Bill bill = new Bill();                              
                }
            }
            else
                //when object exists adding new lines
                if (hasObject)
                    sw.WriteLine(line);
        }
        //when all work done saving last object
        if (hasObject)
        {
            sw.Close();
            hasObject = false;
        }
    }
}
sw.Dispose();

What do you think about sommething like that?

One more thing I need to do: My big file can store different documents. All of them have different marking for start. Let's say there are 20 kinds of documents. Sometimes there is the same marking start but inside the document there are some additional markings that allow me to recognise type of document. What i mean is that for example 2 documents has the same marking start like "%%XGF NEW_SET" but one has latter on marking like "BILL_A" and other doesn't. And I have to create one more file for every cut file with some indexes from the document and a string which contains the type. So basicly before saving my StreamWriter I have to extract all those indexes and the type of document that's way I thought about the StringBuilder. So it's a next place when I need this high efficiency. Any good tips?

There are many different ways to read in and write out files in .NET. I have written a benchmark program and give the results in my blog:

http://designingefficientsoftware.wordpress.com/2011/03/03/efficient-file-io-from-csharp

I recommend using the Windows ReadFile and WriteFile methods if you need performance. Avoid any of the asynchronous methods since my benchmark results show that you get better performance with synchronous I/O methods - at least for FileStream which is the fastest .NET class for reading files in. I wrote a class in C# that encapsulates the functionality of the ReadFile and WriteFile functionality which makes it quite easy to use.

Another interesting result is that it looked at things like reading lines .vs. reading data in blocks of 65,536 bytes each and parsing it into lines. It turns out that reading the data in blocks and then parsing it into lines inside your program is more efficient. My download has some examples of how to do that.

I would love it if you would download it and try it out and report back either here or leave a comment on my blog if it is faster than StreamReader. According to my limited benchmarks, it is significantly faster.

Another idea to improve the performance of your program is to create multiple threads and have each thread process a file. Since you said that you have a few large files, I would break it up so that each large file has a separate thread.

If you are doing a lot of work with strings, then you should definitely be using StringBuilder. But, perhaps a more efficient way would be to read the data into a byte array and then build a byte array for output. I would be surprised if that was not more efficient than using StringBuilder.

Bob Bryan MCSD

继续阅读：.net file-io performance

Efficient way to read and Cut File

更多精彩内容

精彩评论

最新问答

北医三院三代试管养囊一次费用是多少？贵不贵？？

下周全市中小学校恢复线下教学，如何让孩子收心准备开学？？

飞利浦液晶电视,不小心按了TV,现实无信号,要怎么样才能返回电视...？

如何治疗输卵管阻塞？

理光短焦家用投影仪pjk366蓝光3d超清家用会议教学家用投影机如何样？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？