Any way to speed up this file parsing algorithm?
I wrote a small algorithm using LINQ to read in a bunch of files (about 30mb) and store them in memory, currently it takes about a minute for the program to finish reading in all files, however I need this process to only take a few seconds.
Code:
List<ClimateDailyData> dailyData = new List<ClimateDailyData>();
if (File.Exists(FileName))
{
StreamReader reader = new StreamReader(FileName);
try
{
List<string[]> lines =
Regex.Split(reader.ReadToEnd(), Environment.NewLine)
.Where(l => !String.IsNullOrWhiteSpace(l) && !String.IsNullOrEmpty(l))
.Select(l => l.Trim().Split(new char[] { ' ', '\t' })
.Where(f => !String.IsNullOrWhiteSpace(f) && !String.IsNullOrEmpty(f))
.Select(f => f.Trim())
.ToArray())
.ToList();
Latitude = double.Parse(lines[0][0]);
Longitude = double.Parse(lines[0][1]);
lines.RemoveRange(0, 2);
foreach (string[] fields in lines)
{
ClimateDailyData dayData = new ClimateDailyData();
dayData.DayDate = DateTime.ParseExact(fields[0], "yyyyMMdd",
CultureInfo.InvariantCulture, DateT开发者_开发百科imeStyles.None);
dayData.MaxTemp = double.Parse(fields[2]);
dayData.MinTemp = double.Parse(fields[3]);
dayData.Rain = double.Parse(fields[4]);
dayData.Pan = double.Parse(fields[5]);
dailyData.Add(dayData);
}
}
finally { reader.Close(); }
}
SetValue(() => DailyData, dailyData);
Can anyone sugest how I could speed this code up? The majority of the time seems to be involved with parsing the individual file fields (especially the date field).
However if it cannot be sped up I will simply make it so each individual file is loaded as required.
Thanks, Alex.
EDIT: Also I decided to just store a few fields from each file rather then all file data and then load the rest of the data in a seperate thread and make it avaiable to the user as it finishes loading.
So now it only takes 2.7seconds.
As noted in comments, it's an odd way of reading lines - but I wouldn't use File.ReadAllLines
, I'd use File.ReadLines
if you're using .NET 4 - that only reads one line at a time.
Beyond that - you definitely don't need to call ToArray
and ToList
... I'd also use Select
and ToList
with Skip
to create dailyData
. Also, String.IsNullOrWhiteSpace
already returns false
if the string is empty, so you can remove those calls.
After splitting, you're currently trimming and removing any empty/whitespace entries. You can remove empty entries with StringSplitOptions.RemoveEmptyEntries
and if you're confident that the only whitespace in a line would be space or tab, you then don't need to worry about trimming or anything else. If you have other whitespace which needs trimming, it could still be a problem - but I doubt that's the case. One big benefit of that is that you can use the array returned by Split
directly, rather than copying it to another collection.
private static readonly char[] Delimiters = { ' ', '\t' };
...
List<ClimateDailyData> dailyData;
if (!File.Exists(FileName))
{
dailyData = new List<ClimateDailyData>();
}
else
{
dailyData = File.ReadLines(FileName)
.Where(l => !String.IsNullOrWhiteSpace(l))
.Select(l => l.Trim()
.Split(Delimiters, StringSplitOptions.RemoveEmptyEntries))
.Select(fields => new ClimateDailyData
{
DayDate = DateTime.ParseExact(fields[0], "yyyyMMdd",
CultureInfo.InvariantCulture,
DateTimeStyles.None),
MaxTemp = double.Parse(fields[2]),
MinTemp = double.Parse(fields[3]),
Rain = double.Parse(fields[4]),
Pan = double.Parse(fields[5])
})
.ToList();
}
SetValue(() => DailyData, dailyData);
Here is my solution. I does not use those fancy Linq queries :-), but it has some advantages:
- you don't read more than the needed fields
- you never allocate any List or Array which saves the memory especially for big collections, and should pleases the garbage collector (you can always do a
new List<ClimateDailyData>(...)
at then end if needed) I also didn't use Split in case the lines are really long. It's all 'yield' based (the framework should have an IEnumerable version of split IMHO...)
public static IEnumerable<ClimateDailyData> ReadDailyData(string fileName) { if (fileName == null) throw new ArgumentNullException("fileName"); if (!File.Exists(fileName)) yield break; int lineIndex = -1; foreach (string line in File.ReadLines(fileName)) { lineIndex++; ClimateDailyData dayData = new ClimateDailyData(); int i = 0; foreach (string field in ReadDailyDataFields(line)) { if (lineIndex == 0) { // handle latitude stuff continue; } switch(i) { case 0: dayData.DayDate = DateTime.ParseExact(field, "yyyyMMdd", CultureInfo.InvariantCulture, DateTimeStyles.None); break; case 2: dayData.MaxTemp = double.Parse(field); break; case 3: dayData.MinTemp = double.Parse(field); break; case 4: dayData.Rain = double.Parse(field); break; case 5: dayData.Pan = double.Parse(field); break; default: break; } i++; } yield return dayData; } } public static IEnumerable<string> ReadDailyDataFields(string text) { if (text == null) yield break; int lastPos = 0; for (int i = 0; i < text.Length; i++) { if ((text[i] == ' ') || (text[i] == '\t')) { if (i > lastPos) { string field = text.Substring(lastPos, i - lastPos).Trim(); if (field.Length > 0) yield return field; lastPos = i + 1; } } } if (text.Length > lastPos) { string field = text.Substring(lastPos, text.Length - lastPos).Trim(); if (field.Length > 0) yield return field; } }
Assuming that all the time is being spent creating lists and parsing dates, eliminating those could definitely help. Here's a combination of Jon's and Rick's answers:
List<ClimateDailyData> dailyData;
if (!File.Exists(FileName))
{
dailyData = new List<ClimateDailyData>();
}
else
{
dailyData = File
.ReadLines(FileName)
.Where(line => !String.IsNullOrWhiteSpace(line))
.Skip(2)
.Select(line => line.Split(new [] { ' ', '\t' },
StringSplitOptions.RemoveEmptyEntries))
.Select(fields => new ClimateDailyData
{
DayDate = new DateTime(
int.Parse(fields[0].Substring(0, 4)),
int.Parse(fields[0].Substring(4, 2)),
int.Parse(fields[0].Substring(6, 2))),
MaxTemp = double.Parse(fields[2]),
MinTemp = double.Parse(fields[3]),
Rain = double.Parse(fields[4]),
Pan = double.Parse(fields[5])
})
.ToList();
}
SetValue(() => DailyData, dailyData);
It depends a lot on the size of your data, but if you're willing to put some effort in you can see much better performance than that in .NET.
...
This will certainly be faster. It's not. After reflecting in and checking it out, it seems it uses a special string construct internally to parse out the date. That'll teach me for opening my mouth without profiling first :).
Getting rid of intermediate representations and working directly from char[]s will help for sure though. For the fastest implementation you want to do it all by pulling from FileStream into a fixed char[] buffer by using StreamReader, only creating string instances for conversion. I can also say for sure that Regex and String.Format will absolutely murder your performance.
I recently wrote a xml parser using this technique combined with a yielding IEnumerable. Even on my SSD disk access is over 95% of execution time.
However, I'm dealing with files in the 200MB-2GB range. This stuff is hard to get right if you're not used to it and in your case going that far may be overkill.
Some hints:
- Meassure where most time is spend. Don't guess!
- Use
File.ReadAllLines
so you get rid of the unneededRegEx.Split
- The call of
Trim
and the check for emptyness can probably be turned around and you can get rid of one check. - If you change your iteration a little bit, you don't have to build a
List
out of the LINQ result. You can just use an iterator, which will probably reduce memory management overhead. - Splitting of a single line into an array would happen in the inner loop, which once again will reduce memory allocation overhead.
ClimateDailyData
should get a usefull constructor, which should manage the parsing job. It's probably not performance relevant, but will make the code cleaner which is in generall a good thing.
精彩评论