开发者

Partitioning A Big Calculation?

There's a little script that grabs a bunch of data from a database, and does an iterative calculation. There are about 2500 rows used in this calculation, so it's not a huge amount, but my boss wants me to partition开发者_JAVA百科 the calculation anyways (as an exercise).

My general strategy (and I'm just shooting in the dark) is to hit the database, grab the first 50 rows, do each step in the calculation for those 50 rows, store the last row (as the calculation is iterative), grab the next 50 rows from the database and continue this process until all of the rows in the database have been accounted for.

Thoughts on my strategy? Any tips for doing this sort of thing?


One of the first things I learned in programming is that when you don't know how to code something, first write out the process (algorithm) you'd use to solve it yourself, step-by-step, then see how to translate that to code.

Sounds like a good first step for you would be to write out how you would solve the problem on paper--without worrying about partitioning. I know your problem isn't this trivial, but I'm going to use an example of summation.

To find the total of all the records, you would take record0 + record1 + record2 + ... + record2499 = Sum.

With that down, you can then go about seeing if it can be partitioned. For addition, that is easily done because addition is associative. Group up operations, and that's one partition.

Now, if you can't find a way to partition the calculation manually, then it's going to be difficult to try to partition it in code.

But, my first step would be to work it out manually, then look for partition possibilities there.


Here is how I would do it.

  • Dedicate one thread for fetching data
  • Dedicate one thread for processing data

And the code might look like this.

public class Worker
{
  private BlockingQueue<Message> m_Queue = new BlockingQueue<Message>();

  public void Start()
  {
    var fetcher = new Thread(() => { Fetch(); });
    var processor = new Thread(() => { Process(); });
    fetcher.Start();
    processor.Start();
  }

  public void Fetch()
  {
     while (true)
     {
       var packet = GetDataPacketFromDatabase();
       if (packet != null) 
       {
         var message = new Message();
         message.Packet = packet;
         m_Queue.Enqueue(message);
       }
       else
       {
         break; // Stop if there is nothing left to fetch.
       }
     }
  }

  public void Process()
  {
    while (true)
    {
      Message message = m_Queue.Dequeue();
      if (message.Packet 1= null)
      {
        Accumulate(message.Packet);
      }
      else
      {
        break; // Stop if there is nothing left to process.
      }
    }
  }

  private void Accumulate(Packet p)
  {
    // Process the packet and accumulate the results.
  }
}

I should point out that unless you are doing some seriously complex calculations on the returned data (via the Accumulate method in my example) then the processing thread will get starved of work and sit idle for most of the time. I suspect in that case that the whole premise of partitioning and parallelizing processing would wind up being slower than just fetching all 2500 rows at once and processing them serially.


As the calculations don't sounds like they are dependent, this is a perfect example of where threading provide benefits. Make N threads that do the calculations for T(total record count)/N records. Once all the threads have finished, you can do one step to combine all the subtotals generated by each thread.


Without knowing the nature of the calculation, it's hard to say.

When one says partitioning, you are usually implying that the data/process can be parallelized - that the different partitions are independent in some way - and so each partition can be handled independently.

Typically, I don't think of 2500 rows as much, and something like this I might use a persisted computed column in the database, and handle it in the database, perhaps with a trigger for recalculations if a row is changed. Certainly pulling rows out of a database for a calculation can often be less efficient than if the database can either store that information or calculate it on the fly.


Sounds like a job for database cursors (which may be slow), or a while loop or other alternatives.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜