Multi-part question about multi-threading, locks and multi-core processors (multi ^ 3)
I have a program with two methods. The first method takes two arrays as parameters, and performs an operation in which values from one array are conditionally written into the other, like so:
void Blend(int[] dest, int[] src, int offset)
{
    for (int i = 0; i < src.Length; i++)
    {
        int rdr = dest[i + offset];
        dest[i + offset] = src[i] > rdr? src[i] : rdr; 
    }
}
The second method creates two separate sets of int arrays and iterates through them such that each array of one set is Blended with each array from the other set, like so:
void CrossBlend()
{
    int[][] set1 = new int[150][75000]; // we'll pretend this actually compiles
    int[][] set2 = new int[25][10000]; // we'll pretend this actually compiles
    for (int i1 = 0; i1 < set1.Length; i1++)
    {
        for (int i2 = 0; i2 < set2.Length; i2++)
        {
            Blend(set1[i1], set2[i2], 0); // or any offset, doesn't matter
        }
    }
}
First question: Since this apporoach is an obvious candidate for parallelization, is it intrinsically thread-safe? It seems like no, since I can conceive a scenario (unlikely, I think) where one thread's changes are lost because a different threads ~simultaneous operation.
If no, would this:
void Blend(int[] dest, int[] src, int offset)
{
    lock (dest)
    {
        for (int i = 0; i < src.Length; i++)
        {
            int rdr = dest[i + offset];
            dest[i + offset] = src[i] > rdr? src[i] : rdr; 
        }
    }
}
be an effective fix?
Second question: If so, what would be the likely performance cost of using locks like this? I assume that with something like this, if a thread attempts to lock a destination array that is currently locked by another thread, the first thread would block unt开发者_运维问答il the lock was released instead of continuing to process something.
Also, how much time does it actually take to acquire a lock? Nanosecond scale, or worse than that? Would this be a major issue in something like this?
Third question: How would I best approach this problem in a multi-threaded way that would take advantage of multi-core processors (and this is based on the potentially wrong assumption that a multi-threaded solution would not speed up this operation on a single core processor)? I'm guessing that I would want to have one thread running per core, but I don't know if that's true.
The potential contention with CrossBlend is set1 - the destination of the blend. Rather than using a lock, which is going to be comparatively expensive compared to the amount of work you are doing, arrange for each thread to work on it's own destination. That is a given destination (array at some index in set1) is owned by a given task. This is possible since the outcome is independent of the order that CrossBlend processes the arrays in.
Each task should then run just the inner loop in CrossBlend, and the task is parameterized with the index of the dest array (set1) to use (or range of indices.)
You can also parallelize the Blend method, since each index is computed independently of the others, so no contention there. But on todays machines, with <40 cores you will get sufficient parallism just threading the CrossBlend method.
To run effectively on multi-core you can either
- for N cores, divide the problem into N parts. Given that set1 is reasonably large compared to the number of cores, you could just divide set1 into N ranges, and pass each range of indices into N threads running the inner CrossBlend loop. That will give you fairly good parallelism, but it's not optimal. (Some threads will finish sooner and end up with no work to do.)
- A more involved scheme is to make each iteration of the CrossBlend inner loop a separate task. Have N queues (for N cores), and distribute the tasks amongst the queues. Start N threads, with each thread reading it's tasks from a queue. If a threads queue becomes empty, it takes a task from some other thread's queue.
The second approach is best suited to irregularly sized tasks, or where the system is being used for other tasks, so some cores may be time switching between other processes, so you cannot expect that equal amounts of work complete in the roughly same time on different cores.
The first approach is much simpler to code, and will give you a good level of parallelism.
 
         加载中,请稍侯......
 加载中,请稍侯......
      
精彩评论