开发者

How can i parallelize this solver code

I a m trying to optimize and parallelize some code that does simulations in graph structures with links and nodes. A main hot spot is a loop like this:

void ExecuteAll()
{
    for(long i = TotalCount(); i >= 1; i--)
    {
        long k    = linkOrder[i];
        long link = firstLink[k];
        if (link == 0)
            continue;

        double d = 0.;
        for(; link != 0; link = nextLink[link])
        {
            long kk  = getNode(link);
            d  += fak[link]*res[kk];
        }
        d += res[k];
        double d2 = d/fak2[k];
        res[k]   = d2;
        res2[k] += d2;
    }
}

I reworked this to work with multiple threads by implementing a class like this:

class myDemoClass
{    
    bool volatile *isDone;
public:

    void ExecuteSlice() 
    {   
        for(long i = TotalCount() - mThreadIndex; i >= 1; i -= threadCount)
        {
            long k = linkOrder[i];
            Execute(k);
        }
    }

    void Execute(long k)
    {
        long link = firstLink[k];
        if (link == 0)
        {
            isDone[k] = true;
            return;
        }
        double d = 0.;
        for(; link != 0; link = nextLink[link])
        {
            long kk  = getNode(link);

            for(int x = 0; ! isDone[kk]; x++)
                {} // Wait until data is ready. Time too short for sleep or event

            d  += fak[link]*res[kk];
        }
        d += res[k];
        double d2 = d/fak2[k];
        res[k]   = d2;
        res2[k] += d2;

        isDone[k] = true;
    }
}

I create a instance of this class for each thread where each thread operates on a slice of all values for i. I introduced a new array bool volatile *isDone, as i must not use results of non processed nodes. I tried to insert a sleep or Event instead of the for(;...;){} loop but it turned out that the wait states are too short for this.

This seems to work fine. Only 10% of all calls to Execute() have to enter the waiting loop as the graph unfolds more and more from the starting point and the results are correct.

But surprisingly there is no measurable performance gain (or loss) on a 8 core XEON machine when usi开发者_如何转开发ng the new code with 8 threads. I know that this code is not optimal in terms of cache invalidation, mainly as isDone an res are written and read from multiple cores. But in most cases the dereferenced indexes are quite far away from each other. The graph has about 1.000.000 Nodes and links.

How can i improve the parallelism of this code?


You can't just use volatile like that to make code thread-safe. Volatile helps with single-threaded apps that map variables to external devices by ensuring the value is always re-read but for threads it is insufficient as not only can statements be reordered by the compiler but instructions can be reordered by the processor. You should use a proper multithreading primitive for this, either supplied by a library, or a compiler specific implementation (eg. Interlocked functions on Win32, or the Atomic Builtins on gcc). Similarly it's not clear that any of your other data structures are safe for multi-threaded modification.

As for performance, it's hard to tell what the problem is because we don't know anything about your graph structure and your code is too abstract to work out much about it. However, you seem to spend a lot of time iterating through links that may or may not have been processed yet. Ideally you'd do that the other way around and start by processing a link that has no dependencies, then when it's done, start on the links that depended on this one, and so on, meaning there is no waiting. Perhaps something like a topological sort would help here.


OpenMP is a de facto standard, well-supported in current compilers, for parallelization of array processing code into structurally identical tasks that you simply wish to run across multiple cores without worrying too much about thread management. Take a look at that, it may help clarify your thinking and may even solve the problem.

This is best used for purely CPU-bound tasks - it's not clear to me if that's the case for you since you refer to waiting for data to arrive? In that case, multithreading the logic may not help as much as you would expect or hope.


A few thoughts which may be useful:

  • To make sure your threading code works I would test by using some computation that is known to parallelize well (nothing immediately comes to mind as an example). Then you can test the single and multi-threaded implementations and make sure your overall threaded application works as expected.
  • Are you sure that your algorithm parallelizes well or even at all? As Steve mentions CPU bound computations are best suited for parallelism while your code looks like a mix of CPU and IO. Depending what getNode() does you may be IO bound which will limit what you can gain by use a multi-threaded algorithm. Profiling and benchmarking your code will help determine where your best gains from optimization can be applied.
  • Don't use volatile for multi-threaded synchronization as the other poster(s) mentioned. It may indeed work for you now but there is no guarantee that it won't break in the future. Consider the worst case scenario of it breaking subtly sometime months from now that slightly corrupts all your simulation results but not enough to be obvious.
  • The for "wait" loop is also suspect and should be changed for a proper threaded wait. When a thread is "waiting" in this loop it is eating up CPU time which could be better put to use by doing real work in another thread. If this wait loop is only being used 10% of the time your gains, if any, would be small but there is no guarantee its usage will always be this small.
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜