Improve performance of sorting files by extension
With a given array of file names, the most simpliest way to sort it by file extension is like this:
Array.Sort(fileNames,
(x, y) => Path.GetExtension(x).CompareTo(Path.GetExtension(y)));
The problem is that on very long list (~800k) it takes very long to sort, while sorting by the whole file name is faster for a couple of seconds!
Theoretical, there is a way to optimize it: instead of using Path.GetExtension()
and compare the newly created extension-only-strings, we can provide a Comparison than compares the existing filename strings starting from the LastIndexOf('.')
without creating new strings.
Now, suppose i found the LastIndexOf('.')
, i want to reuse native .NET's StringComparer and apply it only to the part on string aft开发者_运维百科er the LastIndexOf('.')
, to preserve all culture consideration. Didn't found a way to do that.
Any ideas?
Edit:
With tanascius's idea to use char.CompareTo()
method, i came with my Uber-Fast-File-Extension-Comparer, now it sorting by extension 3x times faster! it even faster than all methods that uses Path.GetExtension()
in some manner. what do you think?
Edit 2:
I found that this implementation do not considering culture since char.CompareTo()
method do not considering culture, so this is not a perfect solution.
Any ideas?
public static int CompareExtensions(string filePath1, string filePath2)
{
if (filePath1 == null && filePath2 == null)
{
return 0;
}
else if (filePath1 == null)
{
return -1;
}
else if (filePath2 == null)
{
return 1;
}
int i = filePath1.LastIndexOf('.');
int j = filePath2.LastIndexOf('.');
if (i == -1)
{
i = filePath1.Length;
}
else
{
i++;
}
if (j == -1)
{
j = filePath2.Length;
}
else
{
j++;
}
for (; i < filePath1.Length && j < filePath2.Length; i++, j++)
{
int compareResults = filePath1[i].CompareTo(filePath2[j]);
if (compareResults != 0)
{
return compareResults;
}
}
if (i >= filePath1.Length && j >= filePath2.Length)
{
return 0;
}
else if (i >= filePath1.Length)
{
return -1;
}
else
{
return 1;
}
}
Create a new array that contains each of the filenames in ext.restofpath
format (or some sort of pair/tuple format that can default sort on the extension without further transformation). Sort that, then convert it back.
This is faster because instead of having to retrieve the extension many times for each element (since you're doing something like N log N
compares), you only do it once (and then move it back once).
Not the most memory efficient but the fastest according to my tests:
SortedDictionary<string, List<string>> dic = new SortedDictionary<string, List<string>>();
foreach (string fileName in fileNames)
{
string extension = Path.GetExtension(fileName);
List<string> list;
if (!dic.TryGetValue(extension, out list))
{
list = new List<string>();
dic.Add(extension, list);
}
list.Add(fileName);
}
string[] arr = dic.Values.SelectMany(v => v).ToArray();
Did a mini benchmark on 800k randomly generated 8.3 filenames:
Sorting items with Linq to Objects... 00:00:04.4592595
Sorting items with SortedDictionary... 00:00:02.4405325
Sorting items with Array.Sort... 00:00:06.6464205
You can write a comparer that compares each character of the extension. char
has a CompareTo()
, too (see here).
Basically you loop until you have no more chars left in at least one string or one CompareTo()
returns a value != 0.
EDIT: In response to the edits of the OP
The performance of your comparer method can be significantly improved. See the following code. Additionally I added the line
string.Compare( filePath1[i].ToString(), filePath2[j].ToString(),
m_CultureInfo, m_CompareOptions );
to enable the use of CultureInfo
and CompareOptions
. However this slows down everything compared to a version using a plain char.CompareTo()
(about factor 2). But, according to my own SO question this seems to be the way to go.
public sealed class ExtensionComparer : IComparer<string>
{
private readonly CultureInfo m_CultureInfo;
private readonly CompareOptions m_CompareOptions;
public ExtensionComparer() : this( CultureInfo.CurrentUICulture, CompareOptions.None ) {}
public ExtensionComparer( CultureInfo cultureInfo, CompareOptions compareOptions )
{
m_CultureInfo = cultureInfo;
m_CompareOptions = compareOptions;
}
public int Compare( string filePath1, string filePath2 )
{
if( filePath1 == null || filePath2 == null )
{
if( filePath1 != null )
{
return 1;
}
if( filePath2 != null )
{
return -1;
}
return 0;
}
var i = filePath1.LastIndexOf( '.' ) + 1;
var j = filePath2.LastIndexOf( '.' ) + 1;
if( i == 0 || j == 0 )
{
if( i != 0 )
{
return 1;
}
return j != 0 ? -1 : 0;
}
while( true )
{
if( i == filePath1.Length || j == filePath2.Length )
{
if( i != filePath1.Length )
{
return 1;
}
return j != filePath2.Length ? -1 : 0;
}
var compareResults = string.Compare( filePath1[i].ToString(), filePath2[j].ToString(), m_CultureInfo, m_CompareOptions );
//var compareResults = filePath1[i].CompareTo( filePath2[j] );
if( compareResults != 0 )
{
return compareResults;
}
i++;
j++;
}
}
}
Usage:
fileNames1.Sort( new ExtensionComparer( CultureInfo.GetCultureInfo( "sv-SE" ),
CompareOptions.StringSort ) );
the main problem here is that you are calling Path.GetExtension multiple times for each path. if this is doing a quicksort then you could expect Path.GetExtension to be called anywhere from log(n) to n times where n is the number of items in your list for each item in the list. So you are going to want to cache the calls to Path.GetExtension.
if you were using linq i would suggest something like this:
filenames.Select(n => new {name=n, ext=Path.GetExtension(n)})
.OrderBy(t => t.ext).ToArray();
this ensures that Path.GetExtension is only called once for each filename.
精彩评论