The cost of forking threads: How do I determine the minimum vector size to justify multithreading
I am attempting to calculate dot products of many vector pairs. Each dot product can use multiple threads, but no two or mo开发者_StackOverflow社区re dot products should be done concurrently due to data dependance. What is the minimum vector size here that might justify multithreading? I realize that no one is going to be able to give me an exact answer, so I'm just looking for rough estimates.
To add a little more context to the problem: My program needs to calculate a lot of dot products, but only one dot product can be done at a time.
double serial_dot_product(const double* x, const double* y, size_t length);
I want to rewrite dot product to make it multithreaded. But if length
is too small, using multiple threads doesn't gain anything.
EDIT. Thanks everyone for your input. Looks like the answer depends on too many factor - compiler, CPU, memory, threading library, etc.
The only real way to know is to try it out (and remember that the results may vary depending on what hardware you are running on). You could write a profiling routine to automate the test, and (if you find it's worth it to do so) even have that routine run briefly at your program's startup, to find out what the optimum value is on the current hardware.
#include <sys/time.h>
unsigned long long GetCurrentTime()
{
struct timeval tv;
gettimeofday(&tv, NULL);
return (((unsigned long long)tv.tv_sec)*1000000) + ((unsigned long long)tv.tv_usec);}
}
void FindOptimumMethod()
{
unsigned long long lowestTime = ((unsigned long long)-1);
int bestVectorLength = 0;
int bestNumThreads = 0;
for (int vectorLen=1; vectorLen<100000; vectorLen++)
{
for (int numThreads=1; numThreads<16; numThreads++)
{
unsigned long long startTime = GetCurrentTime();
DoTheCalculation(numThreads, vectorLen);
unsigned long long elapsedTime = GetCurrentTime()-startTime;
if (elapsedTime < lowestTime)
{
lowestTime = elapsedTime;
bestVectorLength = vectorLen;
bestNumThreads = numThreads;
}
}
}
printf("The fastest way is %i threads using a vector length of %i\n", bestNumThreads, bestVectorLength);
}
My experience has been that threads are pretty heavy. Write it so you can pass variable sized blocks of work to the threads. Then you can adjust as necessary.
I would also consider using a library that can use threads or GPU. I would bet a GPU would be exceedingly good at doing dot products.
精彩评论