I have a non-performant method, how can I improve its efficiency?
I have a simple m开发者_运维百科ethod to compare an array of FileInfo objects against a list of filenames to check what files have been already been processed. The unprocessed list is then returned.
The loop of this method iterates for about 250,000 FileInfo objects. This is taking an obscene amount of time to compete.
The inefficiency is obviously the Contains method call on the processedFiles collection.
First how can I check to make sure my suspicion is true about the cause and secondly, how can I improve the method to speed the process up?
public static List<FileInfo> GetUnprocessedFiles(FileInfo[] allFiles, List<string> processedFiles)
{
List<FileInfo> unprocessedFiles = new List<FileInfo>();
foreach (FileInfo fileInfo in allFiles)
{
if (!processedFiles.Contains(fileInfo.Name))
{
unprocessedFiles.Add(fileInfo);
}
}
return unprocessedFiles;
}
A List<T>
's Contains
method runs in linear time, since it potentially has to enumerate the entire list to prove the existence / non-existence of an item. I would suggest you use aHashSet<string>
or similar instead. A HashSet<T>
's Contains
method is designed to run in constant O(1)
time, i.e it shouldn't depend on the number of items in the set.
This small change should make the entire method run in linear time:
public static List<FileInfo> GetUnprocessedFiles(FileInfo[] allFiles,
List<string> processedFiles)
{
List<FileInfo> unprocessedFiles = new List<FileInfo>();
HashSet<string> processedFileSet = new HashSet<string>(processedFiles);
foreach (FileInfo fileInfo in allFiles)
{
if (!processedFileSet.Contains(fileInfo.Name))
{
unprocessedFiles.Add(fileInfo);
}
}
return unprocessedFiles;
}
I would suggest 3 improvements, if possible:
- For extra efficiency, store the processed files in a set at the source, so that this method takes an
ISet<T>
as a parameter. This way, you won't have to reconstruct the set every time. - Try not to mix and match different representations of the same entity (
string
andFileInfo
) in this fashion. Pick one and go with it. - You might also want to consider the
HashSet<T>.ExceptWith
method instead of doing the looping yourself. Bear in mind that this will mutate the collection.
If you can use LINQ, and you can afford to build up a set on every call, here's another way:
public static IEnumerable<string> GetUnprocessedFiles
(IEnumerable<string> allFiles, IEnumerable<string> processedFiles)
{
// null-checks here
return allFiles.Except(processedFiles);
}
I would try to convert the processedFiles List to a HashSet. With a list, it needs to iterate the list everytime you call contains. A HashSet is an O(1) operation.
You could use a dictionary/hastable like class to speed up the lookup process significantly. Even translation the incoming List into an hashtable once, then using that one will be much quicker than what you're using.
- Sort the searched array by file name
- employ
Array.BinarySearch<T>()
to search the array. This should come out at about O(logN) efficiency.
to check if a list contains an element is faster with a sorted list
Just to be excessively pedantic ...
If you know that both lists are sorted (FileInfo lists often come pre-sorted, so this approach might be applicable to you), then you can achieve true linear performance without the time and memory overhead of a hashset. Hashset construction still requires linear time to build, so complexity is closer to O(n + m); the hashset has to internally allocate additional object references for at most 250k strings in your case and that's going to cost in GC terms.
Something like this half-baked generalisation might help:
public static IEnumerable<string> GetMismatches(IList<string> fileNames, IList<string> processedFileNames, StringComparer comparer)
{
var filesIndex = 0;
var procFilesIndex = 0;
while (filesIndex < fileNames.Count)
{
if (procFilesIndex >= processedFileNames.Count)
{
yield return files[filesIndex++];
}
else
{
var rc = comparer.Compare(fileNames[filesIndex], processedFileNames[procFilesIndex]);
if (rc != 0)
{
if (rc < 0)
{
yield return files[filesIndex++];
}
else
{
procFilesIndex++;
}
}
else
{
filesIndex++;
procFilesIndex++;
}
}
}
yield break;
}
I would strongly agree with Ani that sticking to a generic or canonical type is A Very Good Thing Indeed. But I'll give mine -1 for unfinished generalisation and -1 for elegance...
精彩评论