Is there any way I can speed up the opening and hashing of 15,000 small files in C#?

2022-12-15 14:59 问答作者：

I'm working on SHA1 checksum hashing 15,000 images (40KB - 1.0MB each, approximately 1.8GB total). I'd like to speed this up as it is going to be a key operation in my program and right now it is taking between 500-600 seconds.

I've tried the following which took 500 seconds:

 public string GetChecksum(string filePath)
        {
            FileStream fs = new FileStream(filePath, FileMode.Open);
            using (SHA1Managed sha1 = new SHA1Managed())
            {
                return BitConverter.ToString(sha1.ComputeHash(fs));
            }

        }

Then I thought maybe the chunks SHA1Managed() was reading in were too small so I used a BufferedReader and increased the buffer size to greater than the size of any of the files I'm reading in.

 public string GetChecksum(string filePath)
        {
            using (var bs = new Buffered开发者_运维百科Stream(File.OpenRead(filePath), 1200000))
            {
                using (SHA1Managed sha1 = new SHA1Managed())
                {
                    return BitConverter.ToString(sha1.ComputeHash(bs));
                }
            }
        }

This actually took 600 seconds.

Is there anything I can do to speed up these IO operations, or am I stuck with what I got?

As per x0n's suggestion I tried just reading in each file into a byte array and discarding the result. It appears I'm IO bound as this took ~480 seconds in itself.

You are creating and destroying the SHA1Managed class for EVERY file; this is horrifically inefficient. Create it once, and call ComputeHash 15,000 times instead and you'll get a huge performance increase (IMO.)

public Dictionary<string,string> GetChecksums(string[] filePaths)
{ 
    var checksums = new Dictionary<string,string>(filePaths.length);

    using (SHA1Managed sha1 = new SHA1Managed()) 
    { 
         foreach (string filePath in filePaths) {
              using (var fs = File.OpenRead(filePath)) {
                  checksums.Add(filePath, BitConverter.ToString(sha1.ComputeHash(fs)));
              }
         }         
    }
    return checksums;
}

The SHA1Managed class is particularly slow to create/destroy because it calls out to p/invoke native win32 classes.

-Oisin

Profile it first.

Try dotTrace: http://www.jetbrains.com/profiler/

You didn't say whether your operation is CPU bound, or IO bound.

With a hash, I would suspect it is CPU bound. If it is CPU bound, you will see the CPU saturated (100% utilized) during the computation of the SHA hashes. If it is IO bound, the CPU will not be saturated.

If it is CPU bound, and you have a multi-CPU or multi-core machine (true for most laptops built in the last 2 years, and almost all servers built since 2002), then you can get an instant increase by using multiple threads, and multiple Sha1Managed() instances, and computing the SHA's in parallel. If it's a dual-core machine - 2x. If it's a dual-core 2-cpu machine (typical server) you'll get 4x throughput.

By the way, when a single-threaded program like yours "saturates" the CPU on a dual-core machine, will show up as 50% utilization in Windows Task Manager.

You need to manage the workflow through the threads, to keep track of which thread is working on which file. But this isn't hard to do.

Use a "ramdisk" - build a file system in memory.

Have you tried using the SHA1CryptoServiceProvider class instead of SHA1Managed? SHA1CryptoServiceProvider is implemented in native code, not managed code, and was much quicker in my experience. For example:

public static byte[] CreateSHA1Hash(string filePath)
{
    byte[] hash = null;



    using (SHA1CryptoServiceProvider sha1 = new SHA1CryptoServiceProvider())
    {
        using(FileStream fs = new FileStream(filePath, FileMode.Open, FileAccess.Read, FileShare.ReadWrite, 131072))
        {
            hash = sha1.ComputeHash(fs);
        }

        //hash = sha1.ComputeHash(File.OpenRead(filePath));
    }

    return hash;
}

Also, with 15000 files I would use a file enumerator approach (ie WinAPI: FindFirstFile(), FindNextFile()) rather than the standard .NET Directory.GetFiles().

Directory.GetFiles loads all file paths into memory in one go. This is often much slower than enumerating files directory by directory using the WinAPI functions.

继续阅读：.net file-io

Is there any way I can speed up the opening and hashing of 15,000 small files in C#?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？