I have a non-performant method, how can I improve its efficiency?

2023-01-23 06:57 问答作者：

I have a simple m开发者_运维百科ethod to compare an array of FileInfo objects against a list of filenames to check what files have been already been processed. The unprocessed list is then returned.

The loop of this method iterates for about 250,000 FileInfo objects. This is taking an obscene amount of time to compete.

The inefficiency is obviously the Contains method call on the processedFiles collection.

First how can I check to make sure my suspicion is true about the cause and secondly, how can I improve the method to speed the process up?

public static List<FileInfo> GetUnprocessedFiles(FileInfo[] allFiles, List<string> processedFiles)
{
List<FileInfo> unprocessedFiles = new List<FileInfo>();
foreach (FileInfo fileInfo in allFiles)
{
    if (!processedFiles.Contains(fileInfo.Name))
    {
        unprocessedFiles.Add(fileInfo);
    }
    }
    return unprocessedFiles;
}

A List<T>'s Contains method runs in linear time, since it potentially has to enumerate the entire list to prove the existence / non-existence of an item. I would suggest you use aHashSet<string> or similar instead. A HashSet<T>'s Containsmethod is designed to run in constant O(1) time, i.e it shouldn't depend on the number of items in the set.

This small change should make the entire method run in linear time:

public static List<FileInfo> GetUnprocessedFiles(FileInfo[] allFiles, 
                                         List<string> processedFiles)
{
   List<FileInfo> unprocessedFiles = new List<FileInfo>();
   HashSet<string> processedFileSet = new HashSet<string>(processedFiles);

   foreach (FileInfo fileInfo in allFiles)
   {
       if (!processedFileSet.Contains(fileInfo.Name))
       {
           unprocessedFiles.Add(fileInfo);
       }
    }

   return unprocessedFiles;
}

I would suggest 3 improvements, if possible:

For extra efficiency, store the processed files in a set at the source, so that this method takes an ISet<T> as a parameter. This way, you won't have to reconstruct the set every time.
Try not to mix and match different representations of the same entity (string and FileInfo) in this fashion. Pick one and go with it.
You might also want to consider the HashSet<T>.ExceptWith method instead of doing the looping yourself. Bear in mind that this will mutate the collection.

If you can use LINQ, and you can afford to build up a set on every call, here's another way:

public static IEnumerable<string> GetUnprocessedFiles
 (IEnumerable<string> allFiles, IEnumerable<string> processedFiles)
{
  // null-checks here
  return allFiles.Except(processedFiles);     
}

I would try to convert the processedFiles List to a HashSet. With a list, it needs to iterate the list everytime you call contains. A HashSet is an O(1) operation.

You could use a dictionary/hastable like class to speed up the lookup process significantly. Even translation the incoming List into an hashtable once, then using that one will be much quicker than what you're using.

Sort the searched array by file name
employ Array.BinarySearch<T>() to search the array. This should come out at about O(logN) efficiency.

to check if a list contains an element is faster with a sorted list

Just to be excessively pedantic ...

If you know that both lists are sorted (FileInfo lists often come pre-sorted, so this approach might be applicable to you), then you can achieve true linear performance without the time and memory overhead of a hashset. Hashset construction still requires linear time to build, so complexity is closer to O(n + m); the hashset has to internally allocate additional object references for at most 250k strings in your case and that's going to cost in GC terms.

Something like this half-baked generalisation might help:

public static IEnumerable<string> GetMismatches(IList<string> fileNames, IList<string> processedFileNames, StringComparer comparer)
    {
        var filesIndex = 0;
        var procFilesIndex = 0;

        while (filesIndex < fileNames.Count)
        {
            if (procFilesIndex >= processedFileNames.Count)
            {
                yield return files[filesIndex++];
            }
            else
            {
                var rc = comparer.Compare(fileNames[filesIndex], processedFileNames[procFilesIndex]);
                if (rc != 0)
                {
                    if (rc < 0)
                    {
                        yield return files[filesIndex++];
                    }
                    else
                    {
                        procFilesIndex++;
                    }
                }
                else
                {
                    filesIndex++;
                    procFilesIndex++;
                }
            }
        }

        yield break;
    }

I would strongly agree with Ani that sticking to a generic or canonical type is A Very Good Thing Indeed. But I'll give mine -1 for unfinished generalisation and -1 for elegance...

继续阅读：performance time-complexity

I have a non-performant method, how can I improve its efficiency?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？