pshimizu@fa2.so-net.ne.jp
Mon, 30 Nov 1998 03:09:03 +0900
In article <19981129124047T.goto@statabo.rim.or.jp>
goto@statabo.rim.or.jp writes:
>>
>> 3. using vectorlized sqrtiv routine.
>>
>> The sqrti routine is still bottle neck. We can modify to use
>> vectorlized sqrti routine. But it'll be slower if the probability of
>> "rij < bi" is low. If it's much improved(depend on your source
>> parameter), this routine can be faster still more.
>>
In practice, the vectorlized sqrt/sqrti may need a (optional) mask
vector argument. In that case, we can set the mask for
((rijp2 > bp2) && (i != j)), where rijp2 is the 2nd power of rij and
bp2 is the 2nd power of b.
When I disscussed about vectorlised sqrt with Mr. Yamada in the Japanese
mailing list, the prototype routine was coded only to test the
effectiveness of the vectorlised routines. Both I and Mr.Yamada are
familliar with these conditional vectorlising and we intended not to
make the prototype as conditional routine for simplicity.
The masked vector routines will be useful when a vectorlizing
compiler is published to RISC communities.
But the limited memory band width may restrict the performance of the
code when the vector is too sparse. If one has time to tune, some compaction
of the vector with using the dense vector routine will be more effective.
I show the modified Goto's routine below (without debugging).
Naohiko Shimizu
http://shimizu-lab.et.u-tokai.ac.jp/
----------------------------------------------------------------------------
void dsqrtiv(double *, double *, int);
int find_jastrow(double *result,int j,struct coord *q,double r[3],double b){
int i,vcnt;
double tmp, tmp1, tmp2, tmp3;
double tmp4, tmp5, tmp6, tmp7;
double vtmp[8];
double r1, r2, r3;
double bi,bp2;
double vtemp[N+8];
r1 = r[0];
r2 = r[1];
r3 = r[2];
tmp = 1.0;
vcnt = 0;
bp2 = b*b;
if(b>0){
/*
To reduce inner loop conditional branch, I separate the loop.
This may not be required if the compiler is smart enough.
When b**2 < rij, we should return immediately, without calling sqrt.
Then I add one conditional branch *_*.
*/
for(i=0;i < j;i++){
tmp1 = q[i].x[0]-r1;
tmp1 = tmp1*tmp1;
tmp2 = q[i].x[1]-r2;
tmp2 = tmp2*tmp2;
tmp3 = q[i].x[2]-r3;
tmp3 = tmp3*tmp3;
if((vtemp[vcnt++] = tmp1 + tmp2 + tmp3) > bp2) {
*result = 0.0;
return 0;
}
}
for(i=j+1;i < N;i++){
tmp1 = q[i].x[0]-r1;
tmp1 = tmp1*tmp1;
tmp2 = q[i].x[1]-r2;
tmp2 = tmp2*tmp2;
tmp3 = q[i].x[2]-r3;
tmp3 = tmp3*tmp3;
if((vtemp[vcnt++] = tmp1 + tmp2 + tmp3) > bp2) {
*result = 0.0;
return 0;
}
}
dsqrtiv(vtemp, vtemp, vcnt);
bi = 1.0/b;
tmp = tmp1 = tmp2 = tmp3 = 1.0;
tmp4 = tmp5 = tmp6 = tmp7 = 1.0;
/*
Unrolling and software pipelining the loop.
*/
i = 0;
vtmp[0] = vtemp[i];
vtmp[1] = vtemp[i+1];
vtmp[2] = vtemp[i+2];
vtmp[3] = vtemp[i+3];
vtmp[4] = vtemp[i+4];
vtmp[5] = vtemp[i+5];
vtmp[6] = vtemp[i+6];
vtmp[7] = vtemp[i+7];
for(i=0;i < vcnt-8; i+=8){
tmp *= (1.0 - b * vtemp[0]);
tmp1 *= (1.0 - b * vtemp[1]);
tmp2 *= (1.0 - b * vtemp[2]);
tmp3 *= (1.0 - b * vtemp[3]);
tmp4 *= (1.0 - b * vtemp[4]);
tmp5 *= (1.0 - b * vtemp[5]);
tmp6 *= (1.0 - b * vtemp[6]);
tmp7 *= (1.0 - b * vtemp[7]);
vtmp[0] = vtemp[i+8];
vtmp[1] = vtemp[i+9];
vtmp[2] = vtemp[i+10];
vtmp[3] = vtemp[i+11];
vtmp[4] = vtemp[i+12];
vtmp[5] = vtemp[i+13];
vtmp[6] = vtemp[i+14];
vtmp[7] = vtemp[i+15];
}
/*
The rest of the loop will be here.
*/
for(i=0;i < vcnt - i;i++)
tmp *= (1.0 - b * vtmp[i]);
tmp = tmp*tmp1*tmp2*tmp3*tmp4*tmp5*tmp6*tmp7;
}
*result = tmp;
return 1;
}
This archive was generated by hypermail 2.0b3 on Mon Nov 30 1998 - 22:10:34 EST