开发者

Is there any way I can speed up the opening and hashing of 15,000 small files in C#?

I'm working on SHA1 checksum hashing 15,000 images (40KB - 1.0MB each, approximately 1.8GB total). I'd like to speed this up as it is going to be a key operation in my program and right now it is taking between 500-600 seconds.

I've tried the following which took 500 seconds:

 public string GetChecksum(string filePath)
        {
            FileStream fs = new FileStream(filePath, FileMode.Open);
            using (SHA1Managed sha1 = new SHA1Managed())
            {
                return BitConverter.ToString(sha1.ComputeHash(fs));
            }

        }

Then I thought maybe the chunks SHA1Managed() was reading in were too small so I used a BufferedReader and increased the buffer size to greater than the size of any of the files I'm reading in.

 public string GetChecksum(string filePath)
        {
            using (var bs = new Buffered开发者_运维百科Stream(File.OpenRead(filePath), 1200000))
            {
                using (SHA1Managed sha1 = new SHA1Managed())
                {
                    return BitConverter.ToString(sha1.ComputeHash(bs));
                }
            }
        }

This actually took 600 seconds.

Is there anything I can do to speed up these IO operations, or am I stuck with what I got?


As per x0n's suggestion I tried just reading in each file into a byte array and discarding the result. It appears I'm IO bound as this took ~480 seconds in itself.


You are creating and destroying the SHA1Managed class for EVERY file; this is horrifically inefficient. Create it once, and call ComputeHash 15,000 times instead and you'll get a huge performance increase (IMO.)

public Dictionary<string,string> GetChecksums(string[] filePaths)
{ 
    var checksums = new Dictionary<string,string>(filePaths.length);

    using (SHA1Managed sha1 = new SHA1Managed()) 
    { 
         foreach (string filePath in filePaths) {
              using (var fs = File.OpenRead(filePath)) {
                  checksums.Add(filePath, BitConverter.ToString(sha1.ComputeHash(fs)));
              }
         }         
    }
    return checksums;
}

The SHA1Managed class is particularly slow to create/destroy because it calls out to p/invoke native win32 classes.

-Oisin


Profile it first.

Try dotTrace: http://www.jetbrains.com/profiler/


You didn't say whether your operation is CPU bound, or IO bound.

With a hash, I would suspect it is CPU bound. If it is CPU bound, you will see the CPU saturated (100% utilized) during the computation of the SHA hashes. If it is IO bound, the CPU will not be saturated.

If it is CPU bound, and you have a multi-CPU or multi-core machine (true for most laptops built in the last 2 years, and almost all servers built since 2002), then you can get an instant increase by using multiple threads, and multiple Sha1Managed() instances, and computing the SHA's in parallel. If it's a dual-core machine - 2x. If it's a dual-core 2-cpu machine (typical server) you'll get 4x throughput.

By the way, when a single-threaded program like yours "saturates" the CPU on a dual-core machine, will show up as 50% utilization in Windows Task Manager.

You need to manage the workflow through the threads, to keep track of which thread is working on which file. But this isn't hard to do.


Use a "ramdisk" - build a file system in memory.


Have you tried using the SHA1CryptoServiceProvider class instead of SHA1Managed? SHA1CryptoServiceProvider is implemented in native code, not managed code, and was much quicker in my experience. For example:

public static byte[] CreateSHA1Hash(string filePath)
{
    byte[] hash = null;



    using (SHA1CryptoServiceProvider sha1 = new SHA1CryptoServiceProvider())
    {
        using(FileStream fs = new FileStream(filePath, FileMode.Open, FileAccess.Read, FileShare.ReadWrite, 131072))
        {
            hash = sha1.ComputeHash(fs);
        }

        //hash = sha1.ComputeHash(File.OpenRead(filePath));
    }

    return hash;
}

Also, with 15000 files I would use a file enumerator approach (ie WinAPI: FindFirstFile(), FindNextFile()) rather than the standard .NET Directory.GetFiles().

Directory.GetFiles loads all file paths into memory in one go. This is often much slower than enumerating files directory by directory using the WinAPI functions.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜