C# & LINQ; Grouping files into 5MB Groups
Greetings, I'm trying to write a Linq query to run on a list of filenames which returns a list of files Grouped into 5MB chunks. So each group will contain a list of filenames whose total/summed MB is 5MB maximum.
I'm okay with Linq but this one I d开发者_开发知识库on't know where to begin. Help
DirectoryInfo di = new DirectoryInfo (@"x:\logs");
List<string> FileList = di.GetFiles ("*.xml")
var Grouped = FileList =>
Yeah, you can do this with LINQ.
var groupedFiles = files.Aggregate(
new List<List<FileInfo>>(),
(groups, file) => {
List<FileInfo> group = groups.FirstOrDefault(
g => g.Sum(f => f.Length) + file.Length <= 1024 * 1024 * 5
);
if (group == null) {
group = new List<FileInfo>();
groups.Add(group);
}
group.Add(file);
return groups;
}
);
This algorithm is greedy. It just finds the first list it can shove the FileInfo
into without blowing past the upper bound of 5MB. It isn't optimal in terms of minimizing the number of groups but you didn't state that as a constraint. I think an OrderBy(f => f.Length)
before the call to Aggregate
would help but I don't really have time to think deeply about that right now.
Look at this StackOverflow question to start with. It addresses grouping into sublists. The trick then is detecting the size of the files in the group by
clause. This may be an answer where not using LINQ may be clearer than using it.
Part of the problem is that you have a list of file names. You need a list of the File objects so you can query the size of the file through LINQ. In Linq 4.0 you have a group-by-into construct that should be what you want.
Here's one way:
- Define a type that takes a file size as input and returns a value which increments as a specified max is reached and resets. (This type is responsible for maintaining its own state.)
- Group by the values returned by this type.
Code example:
// No idea what a better name for this would be...
class MaxAmountGrouper
{
readonly int _max;
int _id;
int _current;
public MaxAmountGrouper(int max)
{
_max = max;
}
public int GetGroupId(int amount)
{
_current += amount;
if (_current >= _max)
{
_current = 0;
return _id++;
}
return _id;
}
}
Usage:
const int BytesPerMb = 1024 * 1024;
DirectoryInfo directory = new DirectoryInfo(@"x:\logs");
FileInfo[] files = directory.GetFiles("*.xml");
var grouper = new MaxAmountGrouper(5 * BytesPerMb);
var groups = files.GroupBy(f => grouper.GetGroupId((int)f.Length));
foreach (var g in groups)
{
long totalSize = g.Sum(f => f.Length);
Console.WriteLine("Group {0}: {1} MB", g.Key, totalSize / BytesPerMb);
foreach (FileInfo f in g)
{
Console.WriteLine("File: {0} ({1} MB)", f.Name, f.Length / BytesPerMb);
}
Console.WriteLine();
}
I would first throw the file list to a SQL table. Something like this but with the size column included:
CREATE TABLE #DIR (fileName varchar(100))
INSERT INTO #DIR
EXEC master..xp_CmdShell 'DIR C:\RTHourly\*.xml /B'
Then it would be a select statements something like:
SELECT *,
CASE WHEN SIZE < 5 THEN 1
WHEN SIZE < 10 THEN 2
...
END AS Grouping
FROM #DIR
ORDER BY Grouping, FileName, Size
There is a security setting you have to change real quick on SQL Server to do this. See the blog posting HERE.
精彩评论