How to "multithread" C code
I have a number crunching application written in C. It is kind of a main loop that for each value calls, for increasin开发者_运维技巧g values of "i", a function that performs some calculations. I read about multithreading, and I am considering learning a bit about it, in C. I wonder if somehow general code like mine could be automatically multithreaded and how.
Thanks
P.D. To get an idea about my code, let's say that it is something like this:
main(...)
for(i=0;i<=ntimes;i++)get_result(x[i],y[i],result[i]);
...
void get_result(float x,float y,float result){
result=sqrt(log (x) + log (y) + cos (exp (x + y));
(and some more similar mathematical operations)
}
If the task is highly parallelizable and your compiler is modern, you could try OpenMP. http://en.wikipedia.org/wiki/OpenMP
One alternative to multithread your code would be using pthreads ( provides more precise control than OpenMP ).
Assuming x
, y
& result
are global variable arrays,
#include <pthread.h>
...
void *get_result(void *param) // param is a dummy pointer
{
...
}
int main()
{
...
pthread_t *tid = malloc( ntimes * sizeof(pthread_t) );
for( i=0; i<ntimes; i++ )
pthread_create( &tid[i], NULL, get_result, NULL );
... // do some tasks unrelated to result
for( i=0; i<ntimes; i++ )
pthread_join( tid[i], NULL );
...
}
(Compile your code with gcc prog.c -lpthread
)
You should have a look at openMP for this. The C/C++ example on this page is similar to your code: https://computing.llnl.gov/tutorials/openMP/#SECTIONS
#include <omp.h>
#define N 1000
main ()
{
int i;
float a[N], b[N], c[N], d[N];
/* Some initializations */
for (i=0; i < N; i++) {
a[i] = i * 1.5;
b[i] = i + 22.35;
}
#pragma omp parallel shared(a,b,c,d) private(i)
{
#pragma omp sections nowait
{
#pragma omp section
for (i=0; i < N; i++)
c[i] = a[i] + b[i];
#pragma omp section
for (i=0; i < N; i++)
d[i] = a[i] * b[i];
} /* end of sections */
} /* end of parallel section */
}
If you prefer not to use openMP you could use either pthreads or clone/wait directly.
No matter which route you choose you are just dividing up your arrays into chunks which each thread will process. If all of your processing is purely computational (as suggested by your example function) then you should do well to have only as many threads as you have logical processors.
There is some overhead with adding threads to do parallel processing, so make sure that you give each thread enough work to make up for it. Usually you will, but if each thread only ends up with 1 computation to do and the computations aren't that difficult to do then you may actually slow things down. You can always have fewer threads than you have processors if that is the case.
If you do have some IO going on in your work then you may find that having more threads than processors is a win because while one thread may be blocking waiting for some IO to complete another thread can be doing its computations. You have to be careful doing IO to the same file in threads, though.
If you are hoping to provide concurrency for a single loop for some kind of scientific computing or similar, OpenMP as @Novikov says really is your best bet; this is what it was designed for.
If you're looking to learn the more classical approach that you would more typically see in an application written in C... On POSIX you want pthread_create()
et al. I'm not sure what your background might be with concurrency in other languages, but before going too deeply into that, you will want to know your synchronization primitives (mutexes, semaphores, etc.) fairly well, as well as understanding when you will need to use them. That topic could be a whole book or set of SO questions unto itself.
C11 threads in glibc 2.28.
Tested in Ubuntu 18.04 (glibc 2.27) by compiling glibc from source: Multiple glibc libraries on a single host
Example from: https://en.cppreference.com/w/c/language/atomic
#include <stdio.h>
#include <threads.h>
#include <stdatomic.h>
atomic_int acnt;
int cnt;
int f(void* thr_data)
{
for(int n = 0; n < 1000; ++n) {
++cnt;
++acnt;
// for this example, relaxed memory order is sufficient, e.g.
// atomic_fetch_add_explicit(&acnt, 1, memory_order_relaxed);
}
return 0;
}
int main(void)
{
thrd_t thr[10];
for(int n = 0; n < 10; ++n)
thrd_create(&thr[n], f, NULL);
for(int n = 0; n < 10; ++n)
thrd_join(thr[n], NULL);
printf("The atomic counter is %u\n", acnt);
printf("The non-atomic counter is %u\n", cnt);
}
GitHub upstream.
Compile and run:
gcc -std=c11 main.c -pthread
./a.out
Possible output:
The atomic counter is 10000
The non-atomic counter is 8644
The non-atomic counter is very likely to be smaller than the atomic one due to racy access across threads to the non atomic variable.
TODO: disassemble and see what ++acnt;
compiles to.
POSIX threads
#define _XOPEN_SOURCE 700
#include <assert.h>
#include <stdlib.h>
#include <pthread.h>
enum CONSTANTS {
NUM_THREADS = 1000,
NUM_ITERS = 1000
};
int global = 0;
int fail = 0;
pthread_mutex_t main_thread_mutex = PTHREAD_MUTEX_INITIALIZER;
void* main_thread(void *arg) {
int i;
for (i = 0; i < NUM_ITERS; ++i) {
if (!fail)
pthread_mutex_lock(&main_thread_mutex);
global++;
if (!fail)
pthread_mutex_unlock(&main_thread_mutex);
}
return NULL;
}
int main(int argc, char **argv) {
pthread_t threads[NUM_THREADS];
int i;
fail = argc > 1;
for (i = 0; i < NUM_THREADS; ++i)
pthread_create(&threads[i], NULL, main_thread, NULL);
for (i = 0; i < NUM_THREADS; ++i)
pthread_join(threads[i], NULL);
assert(global == NUM_THREADS * NUM_ITERS);
return EXIT_SUCCESS;
}
Compile and run:
gcc -std=c99 pthread_mutex.c -pthread
./a.out
./a.out 1
The first run works fine, the second fails due to missing synchronization.
Tested on Ubuntu 18.04. GitHub upstream.
Depending on the OS, you could use posix threads. You could instead implement stack-less multithreading using state machines. There is a really good book entitled "embedded multitasking" by Keith E. Curtis. It's just a neatly crafted set of switch case statements. Works great, I've used it on everything from apple macs, rabbit semiconductor, AVR, PC.
Vali
Intel's C++ compiler is actually capable of automatically paralellizing your code. It's just a compiler switch you need to enable. It doesn't work as well as OpenMP though (ie. it doesn't always succeed or resulting program is slower). From Intel's website: "Auto-parallelization, which is triggered by the -parallel (Linux* OS and Mac OS* X) or /Qparallel (Windows* OS) option, automatically identifies those loop structures that contain parallelism. During compilation, the compiler automatically attempts to deconstruct the code sequences into separate threads for parallel processing. No other effort by the programmer is needed."
a good exercise for learning concurrent programming in any language would be to work on a thread pool implementation.
In this pattern you create some threads in advance. Those threads are treated as an resource. A thread pool object/structure is used to assign user defined task to those threads for execution. When the task is finished you can collect it's results. You can use thread pool as a general purpose design pattern for concurrency.
The main idea could look similar to
#define number_of_threads_to_be_created 42
// create some user defined tasks
Tasks_list_t* task_list_elem = CreateTasks();
// Create the thread pool with 42 tasks
Thpool_handle_t* pool = Create_pool(number_of_threads_to_be_created);
// populate the thread pool with tasks
for ( ; task_list_elem; task_list_elem = task_list_elem->next) {
add_a_task_to_thpool (task_list_elem, pool);
}
// kick start the thread pool
thpool_run (pool);
// Now decide on the mechanism for collecting the results from tasks list.
// Some of the candidates are:
// 1. sleep till all is done (naive)
// 2. pool the tasks in the list for some state variable describing that the task has
// finished. This can work quite well in some situations
// 3. Implement signal/callback mechanism that a task can use to signal that it has
// finished executing.
The mechanism for collecting data from tasks and the amount of threads used in pool should be chosen to reflect your requirements and the capabilities of the hardware and runtime environment.
Also please note that this pattern does not say anything how you should "synchronize" your tasks with each other/outside surroundings. Also error handling can be a bit tricky (example: what to do when one task fails). Those two aspects need to be thought in advance - they can restrict usage of thread pool pattern.
About thread pool:
http://en.wikipedia.org/wiki/Thread_pool_pattern
http://docs.oracle.com/cd/E19253-01/816-5137/ggedn/index.html
A good literature about pthreads to get going:
http://www.advancedlinuxprogramming.com/alp-folder/alp-ch04-threads.pdf
To specifically address the "automatically multithreaded" part of the OP's question:
One really interesting view of how to program parallelism was designed into a language called Cilk Plus invented by MIT and now owned by Intel. To quote Wikipedia, the idea is that
"the programmer should be responsible for exposing the parallelism, identifying elements that can safely be executed in parallel; it should then be left to the run-time environment, particularly the scheduler, to decide during execution how to actually divide the work between processors."
Cilk Plus is a superset of standard C++. It just contains a few extra keywords (_Cilk_spawn
, _Cilk_sync
, and _Cilk_for
) that allow the programmer to tag parts of their program as parallelizable. The programmer does not mandate that any code be run on a new thread, they just allow the lightweight runtime scheduler to spawn a new thread if and only if it is actually the right thing to do under the particular runtime conditions.
To use Cilk Plus, just add its keywords into your code, and build with Intel's C++ compiler.
You can use pthreads to perform multithreading in C. here is a simple example based on pthreads.
#include <pthread.h>
#include <stdio.h>
void *mythread1(); //thread prototype
void *mythread2();
int main(){
pthread_t thread[2];
//starting the thread
pthread_create(&thread[0],NULL,mythread1,NULL);
pthread_create(&thread[1],NULL,mythread2,NULL);
//waiting for completion
pthread_join(thread[0],NULL);
pthread_join(thread[1],NULL);
return 0;
}
//thread definition
void *mythread1(){
int i;
for(i=0;i<5;i++)
printf("Thread 1 Running\n");
}
void *mythread2(){
int i;
for(i=0;i<5;i++)
printf("Thread 2 Running\n");
}
Your code is not automatically multi-threaded by the compiler if that was your question. Please note that the C standards themselves know nothing about multi-threading, since whether you can use multi-threading or not does not depend on the language you use for coding, but on the destination platform you are coding for. Code written in C can run on pretty much anything for that a C compiler exists for. A C compiler even exists for a C64 computer (almost completely ISO-99 conform); however, to support multiple threads, the platform must have an operating system supporting this and usually this means that at least certain CPU functionality must be present. An operating system can do multithreading almost exclusively in software, this will be awfully slow and there won't be memory protection, but it is possible, however even in that case you need at least programmable interrupts.
So how to write multi-threaded C code depends entirely on the operating system of your target platform. There exists POSIX conform systems (OS X, FreeBSD, Linux, etc.) and systems that have their own library for that (Windows). Some systems have more than library for it (e.g. OS X has the POSIX Library, but there is also the Carbon Thread Manager you can use in C (though I think it is rather legacy nowadays).
Of course there exists cross-platform thread libraries and some modern compilers have support for things like OpenMP, where the compiler will automatically build code to create threads on your chosen target platform; but not many compilers do support it and those that do support it are usually not feature complete. Usually you get the widest system support by using POSIX threads, more often called "pthreads". The only major platform not supporting it is Windows and here you can use free 3rd party libraries like this one. Several other ports exists as well (Cygwin has one for sure). If you will have a UI one day of some kind, you may want to use a cross-platform library like wxWidgets or SDL, both offering consistent multi-thread support on all supported platforms.
If an iteration in loop is independent of the ones before it, then there's a very simple approach: try multi-processing, rather than multi-threading.
Say you have 2 cores and ntimes
is 100, then 100/2=50, so create 2 versions of the program where the first iterates from 0 to 49, the other from 50 to 99. Run them both, your cores should be kept quite busy.
This is a very simplistic approach, yet you don't have to mess with thread creation, synchronization, etc
I think all the answers lack a concrete example with implementation of threads across different function, passing parameters and some benchmarks:
// NB: gcc -O3 pthread.c -lpthread && time ./a.out
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <string.h>
#define bool unsigned char
#define true 1
#define false 0
typedef struct my_ptr {
long n;
long i;
} t_my_ptr;
void *sum_primes(void *ptr) {
t_my_ptr *my_ptr = ptr;
if (my_ptr->n < 0 ) // handle misused of function
return (void *)-1;
bool isPrime[my_ptr->i + 1];
memset(isPrime, true, my_ptr->i + 1);
if (my_ptr->n >= 2) { // only one even number can be prime: 2
my_ptr->n += 2;
}
for (long i = 3; i <= my_ptr->i ; i+=2) { // after what only odd numbers can be prime numbers
if (isPrime[i]) {
my_ptr->n += i;
}
for (long j = i * i; j <= my_ptr->i; j+=i*2) // Eratosthenes' Algo, sieve all multiples of current prime, skipping even numbers.
isPrime[j] = false;
}
//printf("%s: %ld\n", __func__, my_ptr->n); // a) if both 'a' and 'b' activated you will notice that both functions are computed asynchronously.
}
void *sum_square(void *ptr) {
t_my_ptr *my_ptr = ptr;
my_ptr->n += (my_ptr->i * my_ptr->i) >> 3;
//printf("%s: %ld\n", __func__, my_ptr->n); // b) if both 'a' and 'b' activated you will notice that both functions are computed asynchronously.
}
void *sum_add_modulo_three(void *ptr) {
t_my_ptr *my_ptr = ptr;
my_ptr->n += my_ptr->i % 3;
}
void *sum_add_modulo_thirteen(void *ptr) {
t_my_ptr *my_ptr = ptr;
my_ptr->n += my_ptr->i % 13;
}
void *sum_add_twice(void *ptr) {
t_my_ptr *my_ptr = ptr;
my_ptr->n += my_ptr->i + my_ptr->i;
}
void *sum_times_five(void *ptr) {
t_my_ptr *my_ptr = ptr;
my_ptr->n += my_ptr->i * 5;
}
void *sum_times_thirteen(void *ptr) {
t_my_ptr *my_ptr = ptr;
my_ptr->n += my_ptr->i * 13;
}
void *sum_times_seventeen(void *ptr) {
t_my_ptr *my_ptr = ptr;
my_ptr->n += my_ptr->i * 17;
}
#define THREADS_NB 8
int main(void)
{
pthread_t thread[THREADS_NB];
void *(*fptr[THREADS_NB]) (void *ptr) = {sum_primes, sum_square,sum_add_modulo_three, \
sum_add_modulo_thirteen, sum_add_twice, sum_times_five, sum_times_thirteen, sum_times_seventeen};
t_my_ptr arg[THREADS_NB];
memset(arg, 0, sizeof(arg));
long iret[THREADS_NB];
for (volatile long i = 0; i < 100000; i++) {
//print_sum_primes(&prime_arg);
//print_sum_square(&square_arg);
for (int j = 0; j < THREADS_NB; j++) {
arg[j].i = i;
//fptr[j](&arg[j]);
pthread_create( &thread[j], NULL, (void *)fptr[j], &arg[j]); // https://man7.org/linux/man-pages/man3/pthread_create.3.html
}
// Wait till threads are complete before main continues. Unless we
// wait we run the risk of executing an exit which will terminate
// the process and all threads before the threads have completed.
for (int j = 0; j < THREADS_NB; j++)
pthread_join(thread[j], NULL);
//printf("Thread 1 returns: %ld\n",iret1); // if we care about the return value
}
for (int j = 0; j < THREADS_NB; j++)
printf("Function %d: %ld\n", j, arg[j].n);
return 0;
}
Output:
Function 0: 15616893616113
Function 1: 41666041650000
Function 2: 99999
Function 3: 599982
Function 4: 9999900000
Function 5: 24999750000
Function 6: 64999350000
Function 7: 84999150000
Conclusion (using 8 threads)
- Without pthread but with optimization flag -O3: 9.2sd
- With pthread and no optimization flag: 31.4sd
- With pthread and optimization flag -O3: 17.8sd
- With pthread and optimization flag -O3 and without pthread_join: 2.0sd. However it doesn't compute the right output since different threads try to access my_ptr->i at the same time.
How comes that multithreading would be slower? It's very simple, initiating a thread is costly in term of cycle, so you have to be sure that your functions are rather complex. This first benchmark is slightly biased as the different functions are very easy to compute.
Conclusion (using 8 threads), replacing the content of each functions with sum_primes (to benchmark the benefits with more advanced computation)
- Without pthread but with auto-vectorization (-O3): 1mn14.4sd
- With pthread but without optimization flags: 2mn18.6sd
- With pthread and with auto-vectorization (-O3): 54.7sd
- With pthread, auto-vectorization and without pthread_join: 2.8sd. However it doesn't compute the right output since different threads try to access my_ptr->i at the same time.
Output:
Function 0: 15616893616113
Function 1: 15616893616113
Function 2: 15616893616113
Function 3: 15616893616113
Function 4: 15616893616113
Function 5: 15616893616113
Function 6: 15616893616113
Function 7: 15616893616113
This is a bit more representative of the real power of multithreading!
Final words
Hence, unless you are multi-threading with complex computation functions, OR in case you don't need to join the threads, it will probably not be worth it due to the cost of initiating threads and also joining them. But again, benchmark it!
Note that auto-vectorization (done through -O3) always yield significant positive results, as there is no cost for using SIMD.
NB2: You can use iret[j] =
to store the result of your thread, they will return 0 on success.
精彩评论