Kazushige Goto (goto@statabo.rim.or.jp)
Sat, 28 Nov 1998 21:53:28 +0900
Hi,
# I'm pleased to organize ML of high performance computing.
From: Martin Kahlert <martin.kahlert@mchp.siemens.de>
Subject: Questions on performance methods
martin> I would like to help improving the performance of Alpha software, too.
martin> It seems to me, that Mr. Goto is T H E guru of alpha performance.
martin> Hope he's listening.
I can show/explain you how to optimize, but it's entirely dependent of
source code. Of course, there are some common "rules" to get faster
code, but it'll be slightly faster than before(or almost same).
# I'm not a guru, though :-)
For example,
From: Jonathan L Dubois <dubois@bec.physics.udel.edu>
Subject: request for perf improvement suggestions
void r12(double *rij,double *x1,double *x2){
int i;
double tmp1,tmp2;
tmp2 = 0;
for(i=0;i<3;i++){
tmp1 = x1[i]-x2[i];
tmp2 += tmp1*tmp1;
}
tmp2 = sqrt(tmp2);
*rij = tmp2;
}
First, you must check your routines(with profile(-pg option), you can
check No.1 and 2 easily).
1. find the bottle-necked subroutine.
2. check whether this(bottle-necked) routine is called a few times,
or many times.
If this routine is called many times, the overhead is important,
therefore using "inline function" or macro is good way to
improve.
3. check if this routine calls other subroutines.
Alpha has many registers, but we can not use enough registers
if subroutine calls other subroutines(egcs/gcc is not good at
optimizing such a case).
Obviosly, the subroutine "r12" is called many times and calles sqrt
routine and bottle neck is sqrt. Of course, you can modify as
Mr. Stefan Schroepfer said, but it'll be only slightly faster(5 to
10%).
Actually, sqrt routine is pretty "sparse" because of calculation
dependencies. If it's possible to store temporary array in the lump,
you may use my sqrtv(vectorlized sqrt) routine. This sqrtv routine is
5 times faster than normal sqrt routine. But you showed only kernel
routine, so I do not know whether it's possible to be faster or not.
In this case, it's important "how to call" r12 routine.
So, you should show the routines which call r12 routine.
martin> How good is the scheduling of egcs these days on alpha?
martin> Can I describe something like 'the alpha can do an floating point
martin> add a fp mult and two int ops in one cycle, when they are in a block
martin> starting on an x-Byte boundary' to the egcs-scheduler? Or is this even
martin> implemented?
Now, egcs has a good scheduler except for complex value, I think. We
need optimized pre-processor like KAP(SUIF is freely available, but
it's not good enough).
Thanks,
goto@statabo.rim.or.jp
This archive was generated by hypermail 2.0b3 on Sat Nov 28 1998 - 08:53:21 EST