how to optimize dgemm? No.1


Kazushige Goto (goto@statabo.rim.or.jp)
Mon, 30 Nov 1998 23:19:31 +0900


Recently, some people asked me how to optimize my dgemm routine.
I wrote some information in dgemm_k.S source code, but it's not
enough.

So, I decided to explain some opimization method of Alpha step by
step. Because it takes too long to explain.

First, I make a sample & simple benchmark program. Please try to type
"make" and benchmark show you actual speed with MFlops.

I explain only Non-transposed x Non-transposed routine. If you have
any questions, please tell me.

Originally, BLAS dgemm kernel routine is programmed as follows.
It's really pretty simple, but it takes too long(about 23MFlops,
600MHz 21164).

void dgemm_nn(int m, int n, int k, double alpha, double *a,
              int lda, double *b, int ldb,
              double *c, int ldc){

  int i, j, l;

  for (j = 0; j < n; j++) {
    for (i = 0; i < m; i++) {
      for (l = 0; l < k; l++) {
          c[i + j * ldc] += alpha * b[l + j * ldb] * a[i + l * lda];
        }
    }
  }
}

1. loop exchangeable

This dgemm_nn routine works fine even if any loop is exchanged.

 << 44MFLops >>

  for (j = 0; j < n; j++) {
    for (l = 0; l < k; l++) {
      for (i = 0; i < m; i++) {
          c[i + j * ldc] += alpha * b[l + j * ldb] * a[i + l * lda];
        }
    }
  }

or

 << 40MFLops >>

  for (l = 0; l < k; l++) {
    for (j = 0; j < n; j++) {
      for (i = 0; i < m; i++) {
          c[i + j * ldc] += alpha * b[l + j * ldb] * a[i + l * lda];
        }
    }
  }

or
....

So we can choose a loop to get best performance.

2. Inner dot product

A basic strategy of Alpha(also, all other RISC machines) is to
"calculate innder dot product routine". Because "store" operation
also causes data traffics.

It means, we can choose one of this.

  for (j = 0; j < n; j++) {
    for (i = 0; i < m; i++) {
      for (l = 0; l < k; l++) {
          c[i + j * ldc] += alpha * b[l + j * ldb] * a[i + l * lda];
        }
    }
  }

or

  for (i = 0; i < m; i++) {
    for (j = 0; j < n; j++) {
      for (l = 0; l < k; l++) {
          c[i + j * ldc] += alpha * b[l + j * ldb] * a[i + l * lda];
        }
    }
  }

c[i+j*ldc] is a constant value in most inner loop. So we can change
it to

 << 27MFLops >>

  double temp;

  for (i = 0; i < m; i++) {
    for (j = 0; j < n; j++) {
      temp = 0.0;
      for (l = 0; l < k; l++) {
          temp += b[l + j * ldb] * a[i + l * lda];
        }
      c[i + j * ldc ] += alpha * temp;
    }
  }

or

 << 22MFLops >>

  double temp;

  for (j = 0; j < n; j++) {
    for (i = 0; i < m; i++) {
      temp = 0.0;
      for (l = 0; l < k; l++) {
          temp += b[l + j * ldb] * a[i + l * lda];
        }
      c[i + j * ldc ] += alpha * temp;
    }
  }

It's still slow, but it'll be much faster as optimizing dgemm_nn.

Continue to No.2

Thanks,
  goto@statabo.rim.or.jp




This archive was generated by hypermail 2.0b3 on Mon Nov 30 1998 - 22:10:34 EST