Help understanding C# optimization
I was playing with C# and wanted to speed up a program. I made changes and was able to do so. However, I need help understanding why the change made it faster.
I've attempted to reduce the code to something easier to understand in a question. Score1 and Report1 is the slower way. Score2 and Report2 is the faster way. The first method first stores a string and an int in a struct in parallel. Next, in a serial loop, it loops through an array of those structs and writes their data to a buffer. The second method first writes the data to a string buffer in parallel. Next, in a serial loop, it writes the string data to a buffer. Here are some sample run times:
Run 1 Total Average Time = 0.492087 sec Run 2 Total Average Time = 0.273619 sec
When I was working with an earlier non-parallel version of this, the times were almost the same. Why the difference with the parallel version?
Even if I reduce the loop in Report1 to write a single line of output to the buffer it is still slower (total time about .42 sec).
Here is the simplified code:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Diagnostics;
using System.Threading.Tasks;
using System.IO;
namespace OptimizationQuestion
{
class Program
{
struct ValidWord
{
public string word;
public int score;
}
ValidWord[] valid;
StringBuilder output;
int total;
public void Score1(string[] words)
{
valid = new ValidWord[words.Length];
for (int i = 0; i < words.Length; i++)
{
StringBuilder builder = new StringBuilder();
foreach (char c in words[i])
{
if (c != 'U')
builder.Append(c);
}
if (words[i].Length == 3)
{
valid[i] = new ValidWord
{ word = builder.ToString(), score = words[i].Length };
}
}
}
public void Report1(StringBuilder outputBuffer)
{
int total = 0;
foreach (ValidWord wordInfo in valid)
{
if (wordInfo.score > 0)
{
outputBuffer.AppendLine(String.Format("{0} {1}", wordInfo.word.ToString(), wordInfo.score));
total += wordInfo.score;
}
}
outputBuffer.AppendLine(string.Format("Total = {0}", total));
}
public void Score2(string[] words)
{
output = new StringBuilder();
total = 0;
for (int i = 0; i < words.Length; i++)
{
StringBuilder builder = new StringBuilder();
foreach (char c in words[i])
{
if (c != 'U')
builder.Append(c);
}
if (words[i].Length == 3)
{
output.AppendLine(String.Format("{0} {1}", builder.ToString(), words[i].Length));
total += words[i].Length;
}
}
}
public void Report2(StringBuilder outputBuffer)
{
outputBuffer.Append(output.ToString());
outputBuffer.AppendLine(string.Format("Total = {0}", total));
}
static void Main(string[] args)
{
Program[] program = new Program[100];
for (int i = 0; i < program.Length; i++)
program[i] = new Program();
string[] words = File.ReadAllLines("words.txt");
Stopwatch stopwatch = new Stopwatch();
const int TIMING_REPETITIONS = 20;
double averageTime1 = 0.0;
StringBuilder output = new StringBuilder();
for (int i = 0; i < TIMING_REPETITIONS; ++i)
{
stopwatch.Reset();
stopwatch.Start();
output.Clear();
Parallel.ForEach<Program>(program, p =>
{
p.Score1(words);
});
for (int k = 0; k < program.Length; k++开发者_如何转开发)
program[k].Report1(output);
stopwatch.Stop();
averageTime1 += stopwatch.Elapsed.TotalSeconds;
GC.Collect();
}
averageTime1 /= (double)TIMING_REPETITIONS;
Console.WriteLine(string.Format("Run 1 Total Average Time = {0:0.000000} sec", averageTime1));
double averageTime2 = 0.0;
for (int i = 0; i < TIMING_REPETITIONS; ++i)
{
stopwatch.Reset();
stopwatch.Start();
output.Clear();
Parallel.ForEach<Program>(program, p =>
{
p.Score2(words);
});
for (int k = 0; k < program.Length; k++)
program[k].Report2(output);
stopwatch.Stop();
averageTime2 += stopwatch.Elapsed.TotalSeconds;
GC.Collect();
}
averageTime2 /= (double)TIMING_REPETITIONS;
Console.WriteLine(string.Format("Run 2 Total Average Time = {0:0.000000} sec", averageTime2));
Console.ReadLine();
}
}
}
First of all, you are parallelizing the repeated runs. This will improve your benchmark time, but may not help out your real production run time very much. To accurately measure how long it will take to actually run through one word list, you need to have exactly one word list going at a time. Otherwise, the individual threads processing all the lists are competing with each other to some extent for system resources and the time per list suffers, even if the time to do all the lists in total is faster.
To speed up the time to process one word list, you want to processes the individual words in the list in parallel, for exactly one list at a time. To get enough definition/size for a good measurement, either make the list very long or process the list many times in serial.
In your case, this gets a bit tricky because the stringbuilder needed for your final product is not documented as being thread-safe. It's not that bad, though. Here's an example of calling parallel foreach for a single word list:
var locker = new Object(); //I'd actually make this static, but it should end up as a closure and so still work
var OutputBuffer = new StringBuilder(); // you can improve things futher if you can make a good estimate for the final size and force it allocate all the memory it will need up front
int score = 0;
Parallel.ForEach(words, w =>
{
// We want to push as much of the work to the individual threads as possible.
// If run in 1 thread, a stringbuilder per word would be bad.
// Run in parallel, it allows us to do a little more of the work outside of locked code.
var buf = new StringBuilder(w.Length + 5);
string word = buf.Append(w.Where(c=>c!='U').Concat(' ').ToArray()).Append(w.Length).ToString();
lock(locker)
{
OutputBuffer.Append(word);
score += w.Length;
}
});
OutputBuffer.Append("Total = ").Append(score);
Just call that 20 times in a normal sequentially processed for loop. Again, it might finish the benchmarks a little slower, but I think it will perform real-world a little faster because of a flaw in your benchmark. Also note that I typed this right into the reply window — I've never event tried to compile it, and so it's not likely to be perfect right out of the gate.
After fixing your benchmark to more accurately reflect how parallel code will impact your real-world processing time, the next step is to do some profiling to see where your program is actually spending it's time. This is how you know what areas to look at for improvement.
Out of curiosity, I'd also like to know how this version performs:
var agg = new {score = 0, OutputBuffer = new StringBuilder()};
agg = words.Where(w => w.Length == 3)
.Select(w => new string(w.Where(c => c!='U').ToArray())
.Aggregate(agg, (a, w) => {a.OutputBuffer.AppendFormat("{0} {1}\n", w, w.Length); score += w.Length;});
agg.OutputBuffer.Append("Total = ").Append(score);
The size of a struct should typically be less than that of a pointer (if performance is the primary issue. Microsoft says that anything less than 16 bytes performs better as a struct if reference type semantics aren't needed), else the overhead for passing it around increases (because it is pass by value) and would be more than it would have been for just passing a pointer. Your struct contains a pointer and an int (making it more than a pointer) so you would be experiencing overhead because of this.
See the When to use structs section of this article.
I tried running it through a profiler, but I'm not trusting the results I got. (Run1 takes less time than Run2 in it.) So there aren't any concrete answers there, but my suspicion is that the valid[] array is the culprit:
That's a potentially large memory allocation that Run1 is doing and Run2 isn't. Allocating big chunks of memory can be time-consuming.
It's possible that array is ending up far from any other working data in physical memory. At the very least, it's big enough to end up in the large object heap, whereas it looks like most everything else is going to end up on the stack or small object heap. That might mean that the Score1 function is having to deal with more cache misses than the Score2 function.
It might be a much smaller issue in the serial code, where you've only got that happening once at any given time. When it's happening for a lot of threads simultaneously, though, the problem might compound so that what originally just caused an extra cache miss or two is now causing page faults.
So there is a post on codeproject that helps answer that.
http://www.codeproject.com/KB/cs/foreach.aspx
There you will see that the generated code is slitely different, so in a long list, you will loose a few cicles for those extra few lines, and it will change the final time.
Well I've just browsed over your code and my first thoughts are action times. In your Score1 you perform new memory allocation for each run
valid[i] = new ValidWord
this in turn lets the application process memory finds and then either initialise it or create a new memory block, setup the values and copies over the newly created block to the original location (I forget which, not the point though).
The point I'm trying to make is that you are now requiring the application to make 14000 memory reads/write operations, all of which is taking x amount of (micro)seconds. And if new memory is being allocated it needs to find correct sized memory sections.
Code Performance Analysis is a rather wide topic, and I guess only embedded programmers truly utilise this on a daily basis. Just remember that every statement you make has operations linked to it. Reading Vector<bool>
and Vector<int>
for instance, the bool will have a slower read time because it needs to split memory into bits and then return a value, where the int can retrieve bigger chunks of memory.
Well, that's my 2 cents worth, hope it gives you a better idea of what to look for. I have a good book at home covering how to go about analysing your code lines and what process times it will use. will see if I can get a hold on it (recently moved) and update the name for you.
What is the goal of the program? Score1 and Score2 don't tell us anything about what the algorithsm are trying to do. It looks likes its trying to find any word that is three letters with all capital 'U's removed is a valid word and is added to the list?
What is the point of calling Parallel.Foreach on a bunch of Program instances when each thing is passed the exact same input? And always creating a StringBuilder for each word is not a good approach. You want to minimize any new calls in performance critical areas in order to reduce the number of times the GC has to kick in.
I ran your program on the text file: http://introcs.cs.princeton.edu/data/words.txt
- Run 1 Total Average Time = 0.160149 sec
- Run 2 Total Average Time = 0.056846 sec
Running it under the VS 2010 Sampling profiler shows that Report1 is roughly 78x slower than Report2 and accounts for most of the difference. Mainly due to all the string.Format and Append calls.
Score1 and Score2 are roughly the same in speed with Score1 going slightly slower because of additional time in StringBuilder.ctor and clr.dll.
But I suspect your algorithm could be rewritten without all the string builders or allocations to be an order of magnitude faster.
Just an idea: I haven't made any measurement but for instance this line:
foreach (char c in words[i])
I think it would be better the to create a temporary variable for the current word.
Also, the iterator of a string might be slower.
The code would become something like that:
var currentWord = words[i];
for (int j = 0; j < currentWord.Length; j++){
char c = currentWord[i];
// ...
}
The new might also be a performance issue as someone already pinpointed. Like I said in my comment, adding more profiling data would help to pin point exactly what's going on. Or have a look at the generated code.
精彩评论