开发者

ExecutorService's surprising performance break-even point --- rules of thumb?

I'm trying to figure out how to correctly use Java's Executors. I realize submitting tasks to an ExecutorService has its own overhead. However, I'm surprised to see it is as high as it is.

My program needs to process huge amount of data (stock market data) with as low latency as possible. Most of the calculations are fairly simple arithmetic operations.

I tried to test something very simple: "Math.random() * Math.random()"

The simplest test runs this computation in a simple loop. The second test does the same computation inside a anonymous Runnable (this is supposed to measure the cost of creating new objects). The third test passes the Runnable to an ExecutorService (this measures the cost of introducing executors).

I ran the tests on my dinky laptop (2 cpus, 1.5 gig ram):

(in milliseconds)
simpleCompuation:47
computationWithObjCreation:62
computationWithObjCreationAndExecutors:422

(about once out of four runs, the first two numbers end up being equal)

Notice that executors take far, far more time than executing on a single thread. The numbers were about the same for thread pool sizes between 1 and 8.

Question: Am I missing something obvious or are these results expected? These results tell me that any task I pass in to an executor must do some non-trivial computation. If I am processing millions of messages, and I need to perform very simple (and cheap) transformations on each message, I still may not be able to use executors...trying to spread computations across multiple CPUs might end up being costlier than just doing them in a single thread. The design decision becomes much more complex than I had originally thought. Any thoughts?


import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;

public class ExecServicePerformance {

 private stat开发者_Go百科ic int count = 100000;

 public static void main(String[] args) throws InterruptedException {

  //warmup
  simpleCompuation();
  computationWithObjCreation();
  computationWithObjCreationAndExecutors();

  long start = System.currentTimeMillis();
  simpleCompuation();
  long stop = System.currentTimeMillis();
  System.out.println("simpleCompuation:"+(stop-start));

  start = System.currentTimeMillis();
  computationWithObjCreation();
  stop = System.currentTimeMillis();
  System.out.println("computationWithObjCreation:"+(stop-start));

  start = System.currentTimeMillis();
  computationWithObjCreationAndExecutors();
  stop = System.currentTimeMillis();
  System.out.println("computationWithObjCreationAndExecutors:"+(stop-start));


 }

 private static void computationWithObjCreation() {
  for(int i=0;i<count;i++){
   new Runnable(){

    @Override
    public void run() {
     double x = Math.random()*Math.random();
    }

   }.run();
  }

 }

 private static void simpleCompuation() {
  for(int i=0;i<count;i++){
   double x = Math.random()*Math.random();
  }

 }

 private static void computationWithObjCreationAndExecutors()
   throws InterruptedException {

  ExecutorService es = Executors.newFixedThreadPool(1);
  for(int i=0;i<count;i++){
   es.submit(new Runnable() {
    @Override
    public void run() {
     double x = Math.random()*Math.random();     
    }
   });
  }
  es.shutdown();
  es.awaitTermination(10, TimeUnit.SECONDS);
 }
}


  1. Using executors is about utilizing CPUs and / or CPU cores, so if you create a thread pool that utilizes the amount of CPUs at best, you have to have as many threads as CPUs / cores.
  2. You are right, creating new objects costs too much. So one way to reduce the expenses is to use batches. If you know the kind and amount of computations to do, you create batches. So think about thousand(s) computations done in one executed task. You create batches for each thread. As soon as the computation is done (java.util.concurrent.Future), you create the next batch. Even the creation of new batches can be done in parralel (4 CPUs -> 3 threads for computation, 1 thread for batch provisioning). In the end, you may end up with more throughput, but with higher memory demands (batches, provisioning).

Edit: I changed your example and I let it run on my little dual-core x200 laptop.

provisioned 2 batches to be executed
simpleCompuation:14
computationWithObjCreation:17
computationWithObjCreationAndExecutors:9

As you see in the source code, I took the batch provisioning and executor lifecycle out of the measurement, too. That's more fair compared to the other two methods.

See the results by yourself...

import java.util.List;
import java.util.Vector;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;

public class ExecServicePerformance {

    private static int count = 100000;

    public static void main( String[] args ) throws InterruptedException {

        final int cpus = Runtime.getRuntime().availableProcessors();

        final ExecutorService es = Executors.newFixedThreadPool( cpus );

        final Vector< Batch > batches = new Vector< Batch >( cpus );

        final int batchComputations = count / cpus;

        for ( int i = 0; i < cpus; i++ ) {
            batches.add( new Batch( batchComputations ) );
        }

        System.out.println( "provisioned " + cpus + " batches to be executed" );

        // warmup
        simpleCompuation();
        computationWithObjCreation();
        computationWithObjCreationAndExecutors( es, batches );

        long start = System.currentTimeMillis();
        simpleCompuation();
        long stop = System.currentTimeMillis();
        System.out.println( "simpleCompuation:" + ( stop - start ) );

        start = System.currentTimeMillis();
        computationWithObjCreation();
        stop = System.currentTimeMillis();
        System.out.println( "computationWithObjCreation:" + ( stop - start ) );

        // Executor

        start = System.currentTimeMillis();
        computationWithObjCreationAndExecutors( es, batches );    
        es.shutdown();
        es.awaitTermination( 10, TimeUnit.SECONDS );
        // Note: Executor#shutdown() and Executor#awaitTermination() requires
        // some extra time. But the result should still be clear.
        stop = System.currentTimeMillis();
        System.out.println( "computationWithObjCreationAndExecutors:"
                + ( stop - start ) );
    }

    private static void computationWithObjCreation() {

        for ( int i = 0; i < count; i++ ) {
            new Runnable() {

                @Override
                public void run() {

                    double x = Math.random() * Math.random();
                }

            }.run();
        }

    }

    private static void simpleCompuation() {

        for ( int i = 0; i < count; i++ ) {
            double x = Math.random() * Math.random();
        }

    }

    private static void computationWithObjCreationAndExecutors(
            ExecutorService es, List< Batch > batches )
            throws InterruptedException {

        for ( Batch batch : batches ) {
            es.submit( batch );
        }

    }

    private static class Batch implements Runnable {

        private final int computations;

        public Batch( final int computations ) {

            this.computations = computations;
        }

        @Override
        public void run() {

            int countdown = computations;
            while ( countdown-- > -1 ) {
                double x = Math.random() * Math.random();
            }
        }
    }
}


This is not a fair test for the thread pool for following reasons,

  1. You are not taking advantage of the pooling at all because you only have 1 thread.
  2. The job is too simple that the pooling overhead can't be justified. A multiplication on a CPU with FPP only takes a few cycles.

Considering following extra steps the thread pool has to do besides object creation and the running the job,

  1. Put the job in the queue
  2. Remove the job from queue
  3. Get the thread from the pool and execute the job
  4. Return the thread to the pool

When you have a real job and multiple threads, the benefit of the thread pool will be apparent.


The 'overhead' you mention is nothing to do with ExecutorService, it is caused by multiple threads synchronizing on Math.random, creating lock contention.

So yes, you are missing something (and the 'correct' answer below is not actually correct).

Here is some Java 8 code to demonstrate 8 threads running a simple function in which there is no lock contention:

import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.CountDownLatch;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;
import java.util.function.DoubleFunction;

import com.google.common.base.Stopwatch;

public class ExecServicePerformance {

    private static final int repetitions = 120;
    private static int totalOperations = 250000;
    private static final int cpus = 8;
    private static final List<Batch> batches = batches(cpus);

    private static DoubleFunction<Double> performanceFunc = (double i) -> {return Math.sin(i * 100000 / Math.PI); };

    public static void main( String[] args ) throws InterruptedException {

        printExecutionTime("Synchronous", ExecServicePerformance::synchronous);
        printExecutionTime("Synchronous batches", ExecServicePerformance::synchronousBatches);
        printExecutionTime("Thread per batch", ExecServicePerformance::asynchronousBatches);
        printExecutionTime("Executor pool", ExecServicePerformance::executorPool);

    }

    private static void printExecutionTime(String msg, Runnable f) throws InterruptedException {
        long time = 0;
        for (int i = 0; i < repetitions; i++) {
            Stopwatch stopwatch = Stopwatch.createStarted();
            f.run(); //remember, this is a single-threaded synchronous execution since there is no explicit new thread
            time += stopwatch.elapsed(TimeUnit.MILLISECONDS);
        }
        System.out.println(msg + " exec time: " + time);
    }    

    private static void synchronous() {
        for ( int i = 0; i < totalOperations; i++ ) {
            performanceFunc.apply(i);
        }
    }

    private static void synchronousBatches() {      
        for ( Batch batch : batches) {
            batch.synchronously();
        }
    }

    private static void asynchronousBatches() {

        CountDownLatch cb = new CountDownLatch(cpus);

        for ( Batch batch : batches) {
            Runnable r = () ->  { batch.synchronously(); cb.countDown(); };
            Thread t = new Thread(r);
            t.start();
        }

        try {
            cb.await();
        } catch (InterruptedException e) {
            throw new RuntimeException(e);
        }        
    }

    private static void executorPool() {

        final ExecutorService es = Executors.newFixedThreadPool(cpus);

        for ( Batch batch : batches ) {
            Runnable r = () ->  { batch.synchronously(); };
            es.submit(r);
        }

        es.shutdown();

        try {
            es.awaitTermination( 10, TimeUnit.SECONDS );
        } catch (InterruptedException e) {
            throw new RuntimeException(e);
        } 

    }

    private static List<Batch> batches(final int cpus) {
        List<Batch> list = new ArrayList<Batch>();
        for ( int i = 0; i < cpus; i++ ) {
            list.add( new Batch( totalOperations / cpus ) );
        }
        System.out.println("Batches: " + list.size());
        return list;
    }

    private static class Batch {

        private final int operationsInBatch;

        public Batch( final int ops ) {
            this.operationsInBatch = ops;
        }

        public void synchronously() {
            for ( int i = 0; i < operationsInBatch; i++ ) {
                performanceFunc.apply(i);
            }
        }
    }


}

Result timings for 120 tests of 25k operations (ms):

  • Synchronous exec time: 9956
  • Synchronous batches exec time: 9900
  • Thread per batch exec time: 2176
  • Executor pool exec time: 1922

Winner: Executor Service.


I don't think this is at all realistic since you're creating a new executor service every time you make the method call. Unless you have very strange requirements that seems unrealistic - typically you'd create the service when your app starts up, and then submit jobs to it.

If you try the benchmarking again but initialise the service as a field, once, outside the timing loop; then you'll see the actual overhead of submitting Runnables to the service vs. running them yourself.

But I don't think you've grasped the point fully - Executors aren't meant to be there for efficiency, they're there to make co-ordinating and handing off work to a thread pool simpler. They will always be less efficient than just invoking Runnable.run() yourself (since at the end of the day the executor service still needs to do this, after doing some extra housekeeping beforehand). It's when you are using them from multiple threads needing asynchronous processing, that they really shine.

Also consider that you're looking at the relative time difference of a basically fixed cost (Executor overhead is the same whether your tasks take 1ms or 1hr to run) compared to a very small variable amount (your trivial runnable). If the executor service takes 5ms extra to run a 1ms task, that's not a very favourable figure. If it takes 5ms extra to run a 5 second task (e.g. a non-trivial SQL query), that's completely negligible and entirely worth it.

So to some extent it depends on your situation - if you have an extremely time-critical section, running lots of small tasks, that don't need to be executed in parallel or asynchronously then you'll get nothing from an Executor. If you're processing heavier tasks in parallel and want to respond asynchronously (e.g. a webapp) then Executors are great.

Whether they are the best choice for you depends on your situation, but really you need to try the tests with realistic representative data. I don't think it would be appropriate to draw any conclusions from the tests you've done unless your tasks really are that trivial (and you don't want to reuse the executor instance...).


Math.random() actually synchronizes on a single Random number generator. Calling Math.random() results in significant contention for the number generator. In fact the more threads you have, the slower it's going to be.

From the Math.random() javadoc:

This method is properly synchronized to allow correct use by more than one thread. However, if many threads need to generate pseudorandom numbers at a great rate, it may reduce contention for each thread to have its own pseudorandom-number generator.


Firstly there's a few issues with the microbenchmark. You do a warm up, which is good. However, it is better to run the test multiple times, which should give a feel as to whether it has really warmed up and the variance of the results. It also tends to be better to do the test of each algorithm in separate runs, otherwise you might cause deoptimisation when an algorithm changes.

The task is very small, although I'm not entirely sure how small. So number of times faster is pretty meaningless. In multithreaded situations, it will touch the same volatile locations so threads could cause really bad performance (use a Random instance per thread). Also a run of 47 milliseconds is a bit short.

Certainly going to another thread for a tiny operation is not going to be fast. Split tasks up into bigger sizes if possible. JDK7 looks as if it will have a fork-join framework, which attempts to support fine tasks from divide and conquer algorithms by preferring to execute tasks on the same thread in order, with larger tasks pulled out by idle threads.


Here are results on my machine (OpenJDK 8 on 64-bit Ubuntu 14.0, Thinkpad W530)

simpleCompuation:6
computationWithObjCreation:5
computationWithObjCreationAndExecutors:33

There's certainly overhead. But remember what these numbers are: milliseconds for 100k iterations. In your case, the overhead was about 4 microseconds per iteration. For me, the overhead was about a quarter of a microsecond.

The overhead is synchronization, internal data structures, and possibly a lack of JIT optimization due to complex code paths (certainly more complex than your for loop).

The tasks that you'd actually want to parallelize would be worth it, despite the quarter microsecond overhead.


FYI, this would be a very bad computation to parallelize. I upped the thread to 8 (the number of cores):

simpleCompuation:5
computationWithObjCreation:6
computationWithObjCreationAndExecutors:38

It didn't make it any faster. This is because Math.random() is synchronized.


The Fixed ThreadPool's ultimate porpose is to reuse already created threads. So the performance gains are seen in the lack of the need to recreate a new thread every time a task is submitted. Hence the stop time must be taken inside the submitted task. Just with in the last statement of the run method.


You need to somehow group execution, in order to submit larger portions of computation to each thread (e.g. build groups based on stock symbol). I got best results in similar scenarios by using the Disruptor. It has a very low per-job overhead. Still its important to group jobs, naive round robin usually creates many cache misses.

see http://java-is-the-new-c.blogspot.de/2014/01/comparision-of-different-concurrency.html


In case it is useful to others, here are test results with a realistic scenario - use ExecutorService repeatedly until the end of all tasks - on a Samsung Android device.

 Simple computation (MS): 102
 Use threads (MS): 31049
 Use ExecutorService (MS): 257

Code:

   ExecutorService executorService = Executors.newFixedThreadPool(1);
        int count = 100000;

        //Simple computation
        Instant instant = Instant.now();
        for (int i = 0; i < count; i++) {
            double x = Math.random() * Math.random();
        }
        Duration duration = Duration.between(instant, Instant.now());
        Log.d("ExecutorPerformanceTest", "Simple computation (MS): " + duration.toMillis());


        //Use threads
        instant = Instant.now();
        for (int i = 0; i < count; i++) {
            new Thread(() -> {
                double x = Math.random() * Math.random();
            }
            ).start();
        }
        duration = Duration.between(instant, Instant.now());
        Log.d("ExecutorPerformanceTest", "Use threads (MS): " + duration.toMillis());


        //Use ExecutorService
        instant = Instant.now();
        for (int i = 0; i < count; i++) {
            executorService.execute(() -> {
                        double x = Math.random() * Math.random();
                    }
            );
        }
        duration = Duration.between(instant, Instant.now());
        Log.d("ExecutorPerformanceTest", "Use ExecutorService (MS): " + duration.toMillis());


I've faced a similar problem, but Math.random() was not the issue.
The problem is having many small tasks that take just a few milliseconds to complete. It is not much but a lot of small tasks in series ends up being a lot of time and I needed to parallelize.

So, the solution I found, and it might work for those of you facing this same problem: do not use any of the executor services. Instead create your own long living Threads and feed them tasks.

Here is an example, just as an idea don't try to copy paste it cause it probably won't work as I am using Kotlin and translating to Java in my head. The concept is what's important:

First, the Thread, a Thread that can execute a task and then continue there waiting for the next one:

public class Worker extends Thread {
  private Callable task;
  private Semaphore semaphore;
  private CountDownLatch latch;

  public Worker(Semaphore semaphore) {
    this.semaphore = semaphore;
  }

  public void run() {
    while (true) {
      semaphore.acquire(); // this will block, the while(true) won't go crazy
      if (task == null) continue;

      task.run();
      if (latch != null) latch.countDown();

      task = null;
    }
  }

  public void setTask(Callable task) {
    this.task = task;
  }

  public void setCountDownLatch(CountDownLatch latch) {
    this.latch = latch;
  }
}

There is two things here that need explanation:

  • the Semaphore: gives you control over how many tasks and when they are executed by this thread
  • the CountDownLatch: is the way to notify someone else that a task was completed

So this is how you would use this Worker, first just a simple example:

  Semaphore semaphore = new Semaphore(0); // initially the semaphore is closed
  Worker worker = new Worker(semaphore);
  worker.start();

  worker.setTask( .. your callable task .. );
  semaphore.release(); // this will allow one task to be processed by the worker

Now a more complicated example, with two Threads and waiting for both to complete using the CountDownLatch:

  Semaphore semaphore1 = new Semaphore(0);
  Worker worker1 = new Worker(semaphore1);
  worker1.start();

  Semaphore semaphore2 = new Semaphore(0);
  Worker worker2 = new Worker(semaphore2);
  worker2.start();

  // same countdown latch for both workers, with a counter of 2
  CountDownLatch countDownLatch = new CountDownLatch(2);
  worker1.setCountDownLatch(countDownLatch);
  worker2.setCountDownLatch(countDownLatch);

  worker1.setTask( .. your callable task .. );
  worker2.setTask( .. your callable task .. );
  semaphore1.release();
  semaphore2.release();

  countDownLatch.await(); // this will block until 2 tasks have been completed

And after that code runs you could just add more tasks to the same threads and reuse them. That's the whole point of this, reusing the threads instead of creating new ones.

It is unpolished as f*** but hopefully this gives you an idea. For me this was an improvement compared to no multi threading. And it was much much better than any executor service with any number of threads in the pool by far.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜