How to get linear speed up with more than 3 thread in my code?
I'm working on a code in openMP. The code have to print in a file all prime number between 2 and 1000000. The serial algorithme take 150 sec to achive all the computation, with two threads export OMP_NUM_THREADS=2
the code run in 81 sec (that means a speed up equal to 1.85). But up to 2 export OMP_THREADS=3,4
threads, the speed up doesn't change. it still equals to ~1.8.
I've also change the scheduling without any change.
Where is my code primes.cpp. You can copy and past it on you editor and compile it with the following lines commands :
~$ g++ primes.cpp -o primes -fopenmp
change the number of process to 2 (or whatever you like)
~$ export OMP_NUM_THREADS=2
change the task scheduling (static, dynamic, guided)
~$ export OMP_SCHEDULE=dynamic,100000
and run it with
~$ ./primes
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <vector>
#include <algorithm>
#include <time.h>
#include <omp.h>
#define SIZE 1000000
using namespace std;
int main(){
// code permettant derecuperer dans un fichier la liste des
// nombres premiers entre O et SIZE
// variables
int cprime;
int chunk;
int lap, loop,开发者_运维百科 i;
int isprime;
int count;
FILE * file;
char * filename;
time_t t1;
vector<int>primelist;
int thread_num;
//omp_sched_t schedule;
// initialisation
t1 = time(NULL);
chunk = 100000;
count = 0;
filename = (char *) malloc(sizeof(char)*100);
strcpy(filename, "primes.txt");
file = fopen(filename, "w");
// ------------- ALGORITHME ---------------
#pragma omp parallel private(thread_num)
{
thread_num = omp_get_thread_num();
if(thread_num == 0)
printf("%d processor are available for work\n", omp_get_num_threads());
#pragma omp barrier
#pragma omp critical
{
printf("I'm processor %d ready for work\n", thread_num);
}
}
#pragma omp parallel for private(cprime, loop, isprime) schedule(runtime) shared(primelist) reduction(+:count)
for(cprime = 2; cprime < SIZE; cprime++){
loop = 1;
isprime = 1;
// looking if it's a prime number
while((++loop<cprime) && isprime){
if(cprime % loop == 0) isprime = 0;
}
if(isprime) {
#pragma omp critical
{
primelist.push_back(loop);
}
count++;
}
#pragma omp critical
{
if(cprime % chunk == 0)
printf("Indicator from thread %d current(size N) : %d\n",omp_get_thread_num(), cprime);
}
}
sort(primelist.begin(), primelist.end());
lap = primelist.size();
for(i = 0; i < lap; i++)
fprintf(file, "%d\n", primelist[i]);
fclose(file);
printf("%d primes where discover between 0 and %d, duration of the operation %d\n", count, SIZE, (int) difftime(time(NULL), t1));
return 0;
}
Runtime environment informations
My computer has 4 processors
I've verify it in the file /proc/cpuinfo
where description goes from processor : 0
to processor 3
. all are Intel(R) Core(TM) i5 CPU M 600 @ 2.53 GHZ
Thanks for any reply
Check the CPU on the computer you're running it on. If it doesn't have more than 2 cores, you're not likely to see much improvement beyond two threads.
Be careful to account for hyper-threaded CPUs that present as twice as many cores as they really have in the OS.
The first blind thing I'd do is to use an OpenMP profiler like in
http://www.vi-hps.org/datapool/page/18/fuerlinger.pdf
in order to figure out if something is wrong with the parallelism. It may be you are seriously contending on the pushback in the middle of the thing and that takes time. Or perhaps the for-loop is not properly parallelized, even though a quick glance does not tell me something is inherently wrong with it.
Next, remember to measure your code against the fastest known serial implementation. There is one in Knuth, TaOCP based upon a sieve which is hard to beat with a parallel algorithm.
First of all you should never expect a linear speed up from a trivial implementation. Only in very few cases parallel implementations will scale linearly for any number of cores.
But there are also a few problems with your code and the way your are measuring your runtime. Both might give you the impression of a bad speed-up.
About your code I have to say that synchronization (in your case by having a critical section) always significantly slows down your software. I already had this problem several times myself. But in contrast to your problem I knew in advance how many elements I will have in my vector. So I could resize the vector first and put in elements at the right position without appending them to the vector. This significantly speeds up the code for many processors. I don't have any similar solution for your problem, though.
There is also some small mistake in your code: your variable count
will not have any predictable value after a few assignments. It should be in a critical section as well (or here you could use an atomic
operation). A much better approach would be to make this variable OpenMP private
in the for loop and use a reduction with +
, like this:
#pragma omp parallel for private(cprime, loop, isprime, count) reduction (+: count) schedule(runtime)
This will yield the correct result for count
after finishing the for loop.
I don't really understand why you are using schedule(runtime)
in your for
-statement or what actually happens here. But you should know that you overwrite the schedule that you previously set with your export
statement.
And now, here is the problem with timing your application: you are timing the whole application and not just the parallel for loop. In this case you should consider that you also include a sequential sort. This restricts the speed-up you can expect from your application. Furthermore, for the initial benchmark of the sequential application you should turn on OpenMP with just a single thread; it will be slower than an application without OpenMP because OpenMP - even with a single thread - will have a small overhead. This might give you the expected 2x speed-up for two threads.
精彩评论