Questions About Threads
I am new to thread programing and I have a conceptual problem. I am doing matrix multiplication as a project for my class. However, I do it without using threads, and then using threads to compute the scalar product for each cell of the answer matrix, and then once again splitting up the first matrix into proportions so that each thread has a equal portion to compute. My problem is that the scalar product implementation finishes very quickly which is what I expect, but the third implementation doesn't computer the answer much faster than the nonthreaded implementation. For instance, if it were to use 2 threads, it would copute it in roughly half the time because it can work on both halves of the matrix at the same time but that is not the case at all. I feel like there is an issue in the third implementation, I don't think it operates in parallel, the code is below. Can anyone set me straight on this? Not all of the code is relevant to the question but I included it in case the problem is not local. Thanks,
Main Program:
#include <iostream>
#include <cstdio>
#include <cstdlib>
#include <cmath>
#include<fstream>
#include<string>
#include<sstream>
#include <matrix.h>
#include <timer.h>
#include <random_generator2.h>
const float averager=2.0; //used to find the average of the time taken to multiply the matrices.
//Precondition: The matrix has been manipulated in some way and is ready to output the statistics
//Outputs the size of the matrix along with the user elapsed time.
//Postconidition: The stats are outputted to the file that is specified with the number of threads used
//file name example: "Nonparrallel2.dat"
void output(string file, int numThreads , long double time, int n);
//argv[1] = the size of the matrix
//argv[2] = the number of threads to be used.
//argv[3] =
int main(int argc, char* argv[])
{
random_generator rg;
timer t, nonparallel, scalar, variant;
int n, total = 0, numThreads = 0;
long double totalNonP = 0, totalScalar = 0, totalVar = 0;
n = 100;
/*
* check arguments
*/
n = atoi(argv[1]);
n开发者_如何学Python = (n < 1) ? 1 : n;
numThreads = atoi(argv[2]);
/*
* allocated and generate random strings
*/
int** C;
int** A;
int** B;
cout << "**NOW STARTING ANALYSIS FOR " << n << " X " << n << " MATRICES WITH " << numThreads << "!**"<< endl;
for (int timesThrough = 0; timesThrough < averager; timesThrough++)
{
cout << "Creating the matrices." << endl;
t.start();
C = create_matrix(n);
A = create_random_matrix(n, rg);
B = create_random_matrix(n, rg);
t.stop();
cout << "Timer (generate): " << t << endl;
//---------------------------------------------------------Ends non parallel-----------------------------
/*
* run algorithms
*/
cout << "Running non-parallel matrix multiplication: " << endl;
nonparallel.start();
multiply(C, A, B, n);
nonparallel.stop();
//-----------------------------------------Ends non parallel----------------------------------------------
//cout << "The correct matrix" <<endl;
//output_matrix(C, n);
cout << "Timer (multiplication): " << nonparallel << endl;
totalNonP += nonparallel.user();
//D is the transpose of B so that the p_scalarproduct function does not have to be rewritten
int** D = create_matrix(n);
for (int i = 0; i < n; i++)
for(int j = 0; j < n; j++)
D[i][j] = B[j][i];
//---------------------------------------------------Start Threaded Scalar Poduct--------------------------
cout << "Running scalar product in parallel" << endl;
scalar.start();
//Does the scalar product in parallel to multiply the two matrices.
for (int i = 0; i < n; i++)
for (int j = 0; j < n; j++){
C[i][j] = 0;
C[i][j] = p_scalarproduct(A[i],D[j],n,numThreads);
}//ends the for loop with j
scalar.stop();
cout << "Timer (scalar product in parallel): " << scalar << endl;
totalScalar += scalar.user();
//---------------------------------------------------Ends Threaded Scalar Poduct------------------------
//---------------------------------------------------Starts Threaded Variant For Loop---------------
cout << "Running the variation on the for loop." << endl;
boost :: thread** thrds;
//create threads and bind to p_variantforloop_t
thrds = new boost::thread*[numThreads];
variant.start();
for (int i = 1; i <= numThreads; i++)
thrds[i-1] = new boost::thread(boost::bind(&p_variantforloop_t,
C, A, B, ((i)*n - n)/numThreads ,(i * n)/numThreads, numThreads, n));
cout << "before join" <<endl;
// join threads
for (int i = 0; i < numThreads; i++)
thrds[i]->join();
variant.stop();
// cleanup
for (int i = 0; i < numThreads; i++)
delete thrds[i];
delete[] thrds;
cout << "Timer (variation of for loop): " << variant <<endl;
totalVar += variant.user();
//---------------------------------------------------Ends Threaded Variant For Loop------------------------
// output_matrix(A, n);
// output_matrix(B, n);
// output_matrix(E,n);
/*
* free allocated storage
*/
cout << "Deleting Storage" <<endl;
delete_matrix(A, n);
delete_matrix(B, n);
delete_matrix(C, n);
delete_matrix(D, n);
//avoids dangling pointers
A = NULL;
B = NULL;
C = NULL;
D = NULL;
}//ends the timesThrough for loop
//output the results to .dat files
output("Nonparallel", numThreads, (totalNonP / averager) , n);
output("Scalar", numThreads, (totalScalar / averager), n);
output("Variant", numThreads, (totalVar / averager), n);
cout << "Nonparallel = " << (totalNonP / averager) << endl;
cout << "Scalar = " << (totalScalar / averager) << endl;
cout << "Variant = " << (totalVar / averager) << endl;
return 0;
}
void output(string file, int numThreads , long double time, int n)
{
ofstream dataFile;
stringstream ss;
ss << numThreads;
file += ss.str();
file += ".dat";
dataFile.open(file.c_str(), ios::app);
if(dataFile.fail())
{
cout << "The output file didn't open." << endl;
exit(1);
}//ends the if statement.
dataFile << n << " " << time << endl;
dataFile.close();
}//ends optimalOutput function
Matrix file:
#include <matrix.h>
#include <stdlib.h>
using namespace std;
int** create_matrix(int n)
{
int** matrix;
if (n < 1)
return 0;
matrix = new int*[n];
for (int i = 0; i < n; i++)
matrix[i] = new int[n];
return matrix;
}
int** create_random_matrix(int n, random_generator& rg)
{
int** matrix;
if (n < 1)
return 0;
matrix = new int*[n];
for (int i = 0; i < n; i++)
{
matrix[i] = new int[n];
for (int j = 0; j < n; j++)
//rg >> matrix[i][j];
matrix[i][j] = rand() % 100;
}
return matrix;
}
void delete_matrix(int** matrix, int n)
{
for (int i = 0; i < n; i++)
delete[] matrix[i];
delete[] matrix;
//avoids dangling pointers.
matrix = NULL;
}
/*
* non-parallel matrix multiplication
*/
void multiply(int** C, int** A, int** B, int n)
{
if ((C == A) || (C == B))
{
cout << "ERROR: C equals A or B!" << endl;
return;
}
for (int i = 0; i < n; i++)
for (int j = 0; j < n; j++)
{
C[i][j] = 0;
for (int k = 0; k < n; k++)
C[i][j] += A[i][k] * B[k][j];
}
}
void p_scalarproduct_t(int* c, int* a, int* b,
int s, int e, boost::mutex* lock)
{
int tmp;
tmp = 0;
for (int k = s; k < e; k++){
tmp += a[k] * b[k];
//cout << "a[k]= "<<a[k]<<"b[k]= "<< b[k] <<" "<<k<<endl;
}
lock->lock();
*c = *c + tmp;
lock->unlock();
}
int p_scalarproduct(int* a, int* b, int n, int m)
{
int c;
boost::mutex lock;
boost::thread** thrds;
c = 0;
/* create threads and bind to p_merge_sort_t */
thrds = new boost::thread*[m];
for (int i = 0; i < m; i++)
thrds[i] = new boost::thread(boost::bind(&p_scalarproduct_t,
&c, a, b, i*n/m, (i+1)*n/m, &lock));
/* join threads */
for (int i = 0; i < m; i++)
thrds[i]->join();
/* cleanup */
for (int i = 0; i < m; i++)
delete thrds[i];
delete[] thrds;
return c;
}
void output_matrix(int** matrix, int n)
{
cout << "[";
for (int i = 0; i < n; i++)
{
cout << "[ ";
for (int j = 0; j < n; j++)
cout << matrix[i][j] << " ";
cout << "]" << endl;
}
cout << "]" << endl;
}
void p_variantforloop_t(int** C, int** A, int** B, int s, int e, int numThreads, int n)
{
//cout << "s= " <<s<<endl<< "e= " << e << endl;
for(int i = s; i < e; i++)
for(int j = 0; j < n; j++){
C[i][j] = 0;
//cout << "i " << i << " j " << j << endl;
for (int k = 0; k < n; k++){
C[i][j] += A[i][k] * B[k][j];}
}
}//ends the function
My guess is that you're running into False Sharing. Try to use a local variable in p_variantforloop_t
:
void p_variantforloop_t(int** C, int** A, int** B, int s, int e, int numThreads, int n)
{
for(int i = s; i < e; i++)
for(int j = 0; j < n; j++){
int accu = 0;
for (int k = 0; k < n; k++)
accu += A[i][k] * B[k][j];
C[i][j] = accu;
}
}
Based on your responses in the comments, in theory, because you only have a single thread (i.e., CPU) available, all the threaded versions should be the same time as the single-threaded version or longer because of thread management overhead. You shouldn't be seeing any speedup at all since the time slice-taken to solve one part of the matrix is a time-slice that is stolen from another parallel task. With a single CPU you're only time-sharing the CPU's resources--there is no real parallel working going on in a given single slice of time. I would suspect then the reason your second implementation runs faster is because you're doing less pointer dereferencing and memory access in your inner loop. For example, in the main operation C[i][j] += A[i][k] * B[k][j];
from both multiply
and p_variantforloop_t
, you're looking at a lot of operations at the assembly level, many of them memory related. It would look something like the following in "assembly pseudo-code":
1) Move pointer value from address referenced by A
on the stack into register R1
2) Increment the address in register R1
by the value off the stack referenced by the variable i
, j
, or k
3) Move the pointer address value from the address pointed to by R1
into R1
4) Increment the address in R1
by the value off the stack referenced by the variable i
, j
, or k
5) Move the value from the address pointed to by R1
into R1
(so R1
now holds the value of A[i][k]
)
6) Do steps 1-5 for the address referenced by B
on the stack into register R2
(so R2
now holds the value of B[k][j]
)
7) Do steps 1-4 for the address referenced by C
on the stack into register R3
8) Move the value from the address pointed to by R3
into R4
(i.e., R4
holds the actual value at C[i][j]
)
9) Multiply registers R1
and R2
and store in register R5
10) Add registers R4
and R5
and store in R4
11) Move the final value from R4
back into the memory address pointed to by R3
(now C[i][j]
has the final result)
And that's assuming we have 5 general purpose registers to play with, and the compiler properly optimized your C-code to take advantage of them. I left the loop index variables i
, j
, and k
on the stack, so accessing those will take even more time than if they were in registers ... it really depends on how many registers your compiler has to play with on your platform. Additionally, if you compiled without any optimizations, you could be doing a lot more memory access off the stack, where some of these temp values are stored on the stack rather than in registers, and then reaccessed off the stack, which takes a lot longer than moving values between registers. Either way, the code above is a lot harder to optimize. It works, but if you're on a 32-bit x86 platform, then you're not going to have that many general purpose registers to play with (you should have at least 6 though). x86_64 has more registers to play with, but still, there are all the memory accesses to contend with.
On the other-hand an operation like tmp += a[k] * b[k]
from p_scalarproduct_t
in a tight inner loop is going to move MUCH faster ... here is the above operation in assembly pseudo-code:
There would be a small initialization step for the loop
1) Make tmp
a register R1
rather than stack variable, and initialize it's value to 0
2) Move the address value referenced by a
on the stack into R2
3) Add the value of s
off the stack to R2
and save resulting address in R2
4) Move the address value referenced by b
on the stack into R3
5) Add the value of s
off the stack to R3
and save resulting address in R3
6) Setup a counter in R6
initialized to e - s
After the one-time initialization we would begin the actual inner loop
7) Move the value from the address pointed to by R2
into R4
8) Move the value from the address pointed to by R3
into R5
9) Multiply R4
and R5
and store the results in R5
10) Add R5
to R1
and store the results in R1
11) Increment R2
and R3
12) Decrement counter in R6
until it reaches zero, where we terminate loop
I can't guarantee this is exactly how your compiler would setup this loop, but you can see in general with your scalar example there are less steps in the inner loop required, and more importantly less memory accesses. Therefore more can be done with operations that are solely using registers rather than operations that include memory locations and require a memory fetch, which is much slower than register-only operations. So in general it's going to move a lot faster, and that has nothing to-do with threads.
Finally, I notice you only have two nested loops for the scalar product, so it's complexity is O(N^2), where-as for your other two methods you have three nested loops for O(N^3) complexity. That's going to make a difference as well.
精彩评论