Any way to speed up this file parsing algorithm?

2023-03-29 18:13 问答作者：

I wrote a small algorithm using LINQ to read in a bunch of files (about 30mb) and store them in memory, currently it takes about a minute for the program to finish reading in all files, however I need this process to only take a few seconds.

Code:

List<ClimateDailyData> dailyData = new List<ClimateDailyData>();
if (File.Exists(FileName))
{
    StreamReader reader = new StreamReader(FileName);
    try
    {
        List<string[]> lines = 
            Regex.Split(reader.ReadToEnd(), Environment.NewLine)
                .Where(l => !String.IsNullOrWhiteSpace(l) && !String.IsNullOrEmpty(l))
                .Select(l => l.Trim().Split(new char[] { ' ', '\t' })
                .Where(f => !String.IsNullOrWhiteSpace(f) && !String.IsNullOrEmpty(f))
                .Select(f => f.Trim())
                .ToArray())
                .ToList();
        Latitude = double.Parse(lines[0][0]);
        Longitude = double.Parse(lines[0][1]);
        lines.RemoveRange(0, 2);
        foreach (string[] fields in lines)
        {
            ClimateDailyData dayData = new ClimateDailyData();
            dayData.DayDate = DateTime.ParseExact(fields[0], "yyyyMMdd", 
                                   CultureInfo.InvariantCulture, DateT开发者_开发百科imeStyles.None);
            dayData.MaxTemp = double.Parse(fields[2]);
            dayData.MinTemp = double.Parse(fields[3]);
            dayData.Rain = double.Parse(fields[4]);
            dayData.Pan = double.Parse(fields[5]);
            dailyData.Add(dayData);
        }
    }
    finally { reader.Close(); }
}
SetValue(() => DailyData, dailyData);

Can anyone sugest how I could speed this code up? The majority of the time seems to be involved with parsing the individual file fields (especially the date field).

However if it cannot be sped up I will simply make it so each individual file is loaded as required.

Thanks, Alex.

EDIT: Also I decided to just store a few fields from each file rather then all file data and then load the rest of the data in a seperate thread and make it avaiable to the user as it finishes loading.

So now it only takes 2.7seconds.

As noted in comments, it's an odd way of reading lines - but I wouldn't use File.ReadAllLines, I'd use File.ReadLines if you're using .NET 4 - that only reads one line at a time.

Beyond that - you definitely don't need to call ToArray and ToList... I'd also use Select and ToList with Skip to create dailyData. Also, String.IsNullOrWhiteSpace already returns false if the string is empty, so you can remove those calls.

After splitting, you're currently trimming and removing any empty/whitespace entries. You can remove empty entries with StringSplitOptions.RemoveEmptyEntries and if you're confident that the only whitespace in a line would be space or tab, you then don't need to worry about trimming or anything else. If you have other whitespace which needs trimming, it could still be a problem - but I doubt that's the case. One big benefit of that is that you can use the array returned by Split directly, rather than copying it to another collection.

private static readonly char[] Delimiters = { ' ', '\t' };

...

List<ClimateDailyData> dailyData;
if (!File.Exists(FileName))
{
    dailyData = new List<ClimateDailyData>();
}
else
{
    dailyData = File.ReadLines(FileName)
       .Where(l => !String.IsNullOrWhiteSpace(l))
       .Select(l => l.Trim()
                     .Split(Delimiters, StringSplitOptions.RemoveEmptyEntries))
       .Select(fields => new ClimateDailyData
               {
                   DayDate = DateTime.ParseExact(fields[0], "yyyyMMdd",
                                                 CultureInfo.InvariantCulture,
                                                 DateTimeStyles.None),
                   MaxTemp = double.Parse(fields[2]),
                   MinTemp = double.Parse(fields[3]),
                   Rain = double.Parse(fields[4]),
                   Pan = double.Parse(fields[5])
               })
       .ToList();
}
SetValue(() => DailyData, dailyData);

Here is my solution. I does not use those fancy Linq queries :-), but it has some advantages:

you don't read more than the needed fields
you never allocate any List or Array which saves the memory especially for big collections, and should pleases the garbage collector (you can always do a new List<ClimateDailyData>(...) at then end if needed)

I also didn't use Split in case the lines are really long. It's all 'yield' based (the framework should have an IEnumerable version of split IMHO...)

public static IEnumerable<ClimateDailyData> ReadDailyData(string fileName)
{
    if (fileName == null)
        throw new ArgumentNullException("fileName");

    if (!File.Exists(fileName))
        yield break;

    int lineIndex = -1;
    foreach (string line in File.ReadLines(fileName))
    {
        lineIndex++;
        ClimateDailyData dayData = new ClimateDailyData();
        int i = 0;
        foreach (string field in ReadDailyDataFields(line))
        {
            if (lineIndex == 0)
            {
                // handle latitude stuff
                continue;
            }
            switch(i)
            {
                case 0:
                    dayData.DayDate = DateTime.ParseExact(field, "yyyyMMdd", CultureInfo.InvariantCulture, DateTimeStyles.None);
                    break;
                case 2:
                    dayData.MaxTemp = double.Parse(field);
                    break;
                case 3:
                    dayData.MinTemp = double.Parse(field);
                    break;
                case 4:
                    dayData.Rain = double.Parse(field);
                    break;
                case 5:
                    dayData.Pan = double.Parse(field);
                    break;
                default:
                    break;
            }
            i++;
        }
        yield return dayData;
    }
}

public static IEnumerable<string> ReadDailyDataFields(string text)
{
    if (text == null)
        yield break;

    int lastPos = 0;
    for (int i = 0; i < text.Length; i++)
    {
        if ((text[i] == ' ') || (text[i] == '\t'))
        {
            if (i > lastPos)
            {
                string field = text.Substring(lastPos, i - lastPos).Trim();
                if (field.Length > 0)
                    yield return field;
                lastPos = i + 1;
            }
        }
    }

    if (text.Length > lastPos)
    {
        string field = text.Substring(lastPos, text.Length - lastPos).Trim();
        if (field.Length > 0)
            yield return field;
    }
}

Assuming that all the time is being spent creating lists and parsing dates, eliminating those could definitely help. Here's a combination of Jon's and Rick's answers:

List<ClimateDailyData> dailyData;
if (!File.Exists(FileName))
{
    dailyData = new List<ClimateDailyData>();
}
else
{
    dailyData = File
             .ReadLines(FileName)
             .Where(line => !String.IsNullOrWhiteSpace(line))
             .Skip(2)
             .Select(line => line.Split(new [] { ' ', '\t' },
                                        StringSplitOptions.RemoveEmptyEntries))
             .Select(fields => new ClimateDailyData
              {
                   DayDate = new DateTime(
                       int.Parse(fields[0].Substring(0, 4)),
                       int.Parse(fields[0].Substring(4, 2)),
                       int.Parse(fields[0].Substring(6, 2))),
                   MaxTemp = double.Parse(fields[2]),
                   MinTemp = double.Parse(fields[3]),
                   Rain = double.Parse(fields[4]),
                   Pan = double.Parse(fields[5])
              })
             .ToList();
}
SetValue(() => DailyData, dailyData);

It depends a lot on the size of your data, but if you're willing to put some effort in you can see much better performance than that in .NET.

...

~~This will certainly be faster.~~ It's not. After reflecting in and checking it out, it seems it uses a special string construct internally to parse out the date. That'll teach me for opening my mouth without profiling first :).

Getting rid of intermediate representations and working directly from char[]s will help for sure though. For the fastest implementation you want to do it all by pulling from FileStream into a fixed char[] buffer by using StreamReader, only creating string instances for conversion. I can also say for sure that Regex and String.Format will absolutely murder your performance.

I recently wrote a xml parser using this technique combined with a yielding IEnumerable. Even on my SSD disk access is over 95% of execution time.

However, I'm dealing with files in the 200MB-2GB range. This stuff is hard to get right if you're not used to it and in your case going that far may be overkill.

Some hints:

Meassure where most time is spend. Don't guess!
Use File.ReadAllLines so you get rid of the unneeded RegEx.Split
The call of Trim and the check for emptyness can probably be turned around and you can get rid of one check.
If you change your iteration a little bit, you don't have to build a List out of the LINQ result. You can just use an iterator, which will probably reduce memory management overhead.
Splitting of a single line into an array would happen in the inner loop, which once again will reduce memory allocation overhead.
ClimateDailyData should get a usefull constructor, which should manage the parsing job. It's probably not performance relevant, but will make the code cleaner which is in generall a good thing.

继续阅读：.net .net-4.0 algorithm linq

Any way to speed up this file parsing algorithm?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？