Disjoint grids on processor subsets and their communication in Scalapack
In summary, my question is about how to implement a matrix copy between two block-cyclically distributed matrices on two different process grids in Scalapack (BLACS). I'm attempting to implement this with the pdgemr2d_, which I use frequently in other cases where I am copying between two matrices on the same process grid.
Below is a fairly technical discussion of the state of the problem I'm running into. I've got it nailed down to a basic problem, but there doesn't appear to me to be a solution ... though there must be, as Scalapack specifically states that the type of operation I'm trying for is possible. I do not find adequate examples of this anywhere.
An initialization of a 1x1 compute grid in Scalapack running with MPI in C usually goes something like this:
[...]
int NUM_TASKS, RANK;
int MPI_STARTUP = MPI_Init (&argc, &argv);
if (MPI_STARTUP != MPI_SUCCESS)
MPI_Abort (MPI_COMM_WORLD, MPI_STARTUP);
MPI_Comm_size (MPI_COMM_WORLD, &NUM_TASKS);
MPI_Comm_rank (MPI_COMM_WORLD, &RANK);
int CONTEXT;
Cblacs_pinfo (&RANK, &NUM_TASKS);
Cblacs_get (-1, 0, &CONTEXT);
Cblacs_gridinit (&CONTEXT, "Row", 1, 1);
Cblacs_gridinfo (CONTEXT, &NPROW, &NPCOL, &MYPROW, &MYPCOL);
[...]
This code would generate a 1x1 grid regardless of the number of processors MPI knows about ({1, 1}, the size of the grid, is passed to Cblacs_gridinit). Here, CONTEXT indicates to Scalapack functions which grid we're working on (it is possible to use more than one simultaneously, and is generated by Cblacs_get). Cblacs_gridinfo sets NPROW and NPCOL as the number of processor rows and columns ({1, 1} in this case). MYPROW and MYPCOL indicate to each processor which grid block belongs to it. In this case, on a 1x1 grid, only one processor participates and its grid IDs are {0, 0}.
Initialization of a matrix descriptor for a simple block-cyclic distributed 100x100 matrix is typically also simple:
int info;
int desc[9];
int block_size = 32;
int zero = 0; int one = 1;
int num_rows = 100; int num_cols = 100;
int num_cols_local = numroc_ (&num_cols, &block_size, &mypcol, &zero, &npcol);
int num_cols_local_protect = MAX (1, num_cols_local);
int num_rows_local = numroc_ (&num_rows, &block_size, &myprow, &zero, &nprow);
int num_rows_local_protect = MAX (1, num_rows_local);
descinit_ (desc, &num_rows, &num_cols, &block_size, &block_size, &zero, &zero, \
&CONTEXT, &num_rows_local开发者_StackOverflow社区_protect, &info);
/* now allocate array with per-processor size num_cols_local_protect * num_rows_local_protect */
(We will see later why the "protect" variables are necessary, as on some processors num_cols_local or num_rows_local will be returned, quite correctly, as negative integers.)
Most of the above is self explanatory except for the &zeros passed to descinit_, which indicate the processor row on which the first row of the matrix is distributed, and the processor column on which the first column is distributed. These values have very explicit bounds when used in the descinit_ function. From the Fortran function itself,
[...]
ELSE IF( IRSRC.LT.0 .OR. IRSRC.GE.NPROW ) THEN
INFO = -6
ELSE IF( ICSRC.LT.0 .OR. ICSRC.GE.NPCOL ) THEN
INFO = -7
[...]
IRSRC and ICSRC we pass as zero here, since {0,0} is the proper index of our single grid block. Even if the grid were much larger, we would likely still pass {0,0}, as the first processor block would likely store the first row and column values.
When run on one processor, this works very well. The values on the only processor, RANK 0, for NPROW, NPCOL, MYPROW and MYPCOL, are 1, 1, 0, and 0, respectively. CONTEXT in this case is 0, its non-negativity indicating that the grid it refers to is active on this RANK. These values indicate the existence of a 1x1 process grid and the first processor has the correct indicates the correct process grid block belonging to RANK 0. In this case, it is the only block.
When run on two processors, however, things break down, and they shouldn't formally. On the first and second RANK, we have for CONTEXT, NPROW, NPCOL, MYPROW and MYCOL:
RANK 0: 0, 1, 1, 0, 0
RANK 1: -1, -1, -1, -1, -1
All values are negative. Most importantly, CONTEXT on RANK 1 is negative, indicating that this RANK does not participate in our 1x1 processor grid. Calling descinit_ now, immediately becomes a problem on all processors. If we reference the Fortran code from descinit_, we have (repeated from above for clarity):
[...]
ELSE IF( IRSRC.LT.0 .OR. IRSRC.GE.NPROW ) THEN
INFO = -6
ELSE IF( ICSRC.LT.0 .OR. ICSRC.GE.NPCOL ) THEN
INFO = -7
[...]
These restrictions make sense as long as each processor is participating in the grid. The indices cannot be negative or greater than or equal to the total number of rows or columns in the process grid, since such grid blocks don't exist!
On RANK 1 then, IRSRC is passed as zero, but NPROW and NPCOL are returned from initialization of the grid as -1, and therefore descinit_ will always fail.
All of the above can be easily overcome, inelegantly, by simply restricting the initialization of the matrix descriptor and all subsequent operations to processors who participate in the current grid. Something like:
if (CONTEXT > -1) {
[...]
However, I don't have just one processor grid, I have two, and I need them to communicate by means of the pdgemr2d_ function. The purpose of this function is to copy a subset of a distributed matrix A on one grid to a distributed matrix B on another grid. The grids need not be related to each other in any way, and may be partially or completely disjoint. This should be a trivial operation. Say, for example, I want to copy full matrix from a processor grid with context CONTEXT_A to a processor grid with context CONTEXT_B. The descriptors for the matrix in each context are given as desc_A and desc_B.
pdgemr2d_ (&num_rows, &num_cols, matrix_A, &one, &one, \
desc_A, matrix_B, &one, &one, desc_B, &CONTEXT_B);
This is also fairly self explanatory. It must run on all processors which either context has any grid members on. In my case, CONTEXT_A has a grid spanning all processors MPI knows about, and CONTEXT_B is a 1x1 single processor grid.
pdgemr2d_ must be supplied with a context identifier englobing at least all processors included in both CONTEXT_A and CONTEXT_B, and for those processors not belonging to CONTEXT_A or CONTEXT_B, the element desc_A[CTXT] or desc_B[CTXT], respectively, must be set to -1 on that processor.
descinit_, in theory, does this elegantly, since the CONTEXT values returned by Cblacs_gridinit are -1 on any processor not participating in that context's grid. However, descinit_ will not generate the correct matrix descriptor on any processor which does not participate in the grid due to the restriction detailed above for negative values of NPROW and NPCOL.
To do proper disjoint grid communication, such a matrix descriptor must be defined on all processors which participate in either context.
Clearly, pdgemr2d_ cannot have been written with this as an insurmountable flaw, as the function description in the code specifically states:
PDGEMR2D copies a submatrix of A on a submatrix of B. A and B can have different distributions: they can be on different processor grids, they can have different blocksizes, the beginning of the area to be copied can be at a different places on A and B.
Thanks very much for any help, I know this is a fairly specialized question.
I had similar difficulties in trying to figure out how to use PDGEMR2D, and thought I'd share my conclusions here.
In short, don't use the supplied DESCINIT subroutine if you're trying to set up multiple contexts. Its error checking assumes that all processes participate in the context for the array descriptor being initialized, which isn't the case if you're trying to use PDGEMR2D.
You can easily initialize your own descriptor without using DESCINIT, as it's just an integer array of length 9. The first 8 fields (dtype, ctxt, m, n, mb, nb, csrc, and rsrc) are global, and should have the same value on all processes. Only the 9th field (lld) is local, and its value is ignored on processes that are not members of the context in which the array is being defined.
The example program pdgemrdrv.c in the ScaLAPACK sources (online version here) was useful for me when I was trying to figure this stuff out. It contains a lot of unnecessary complication, but you can deduce the following key points:
- Array descriptors should be initialized globally (on both sending and receiving contexts).
- Memory for the arrays need only be allocated on the contexts where the arrays live.
- PDGEMR2D should be called globally (on both sending and receiving contexts). Only processes that are members of either the sending or receiving context will actually participate.
Hope this helps. Cheers.
精彩评论