Axp-List Archive
RE: FFTW performance

Subject: RE: FFTW performance
From: Tom Morris (Tom.Morris@api-networks.com)
Date: Tue Apr 10 16:42:35 2001


On Tuesday, April 10, 2001 3:33 AM, chu@tes-mail.jpl.nasa.gov
[SMTP:chu@tes-mail.jpl.nasa.gov] wrote:

> Are there any special flags (besides -fast) that will generate better
code?

To answer the last question first, since it seems to have attracted the
most comment, ccc -fast will typically get you 90% of the way there and
is probably all you need for most applications.

> The CPU benchmark that we use
> most often is running FFTs. More specifically, we use the FFTW package
> from MIT.

Even more specifically, I understand from a private conversation with
Eugene that their benchmark uses a single size of 131072 points for
the FFT test. As it happens, this is in a region where FFTW seems
to choose an algorithm which is particularly sub-optimal for Alpha.

If you look at http://www.fftw.org/benchfft/results/alpha467.html
you can actually see that this behavior goes back all the way
to EV56 and FFTW2.0 For powers of two at 16K points and above,
CWP significantly outperforms that algorithm that FFTW is using.

Running benchFFT with FFTW2.1.3 and CXML 4.0.0 shows
FFTW to be 2-5 times faster for powers of 2 between 2 and 64.
Between 128 and 4K both packages are within 10-15% of each
other. FFTW is 30% faster at 8K. At 16k and above, CXML is
1.5 to 5 times faster than FFTW. Specifically at 128K, CXML
is 2.6 times faster than FFTW. Comparing the performance
for CXML and CWP, I'd guess that CXML actually uses the
CWP algorithm in this region.

For non-powers of 2, or at least those that are tested by
benchfft, FFTW is consistently better than most other
algorithms/packages and CXML is very erratic in its
performance.

This isn't really the kind of problem that will be helped by
compiler switches or even better compilers. This is
largely a data flow problem and the data access pattern
is something intrinsic to the algorithm. The compiler's
going to have little luck fixing it up after the fact to try
and match the machine's architecture.

I haven't looked at what FFTW is doing specifically, but
it's obviously not making effective use of the high speed
DDR cache on the 833 MHz Slot B module.

If FFTs are important to you, it's worth having a look at
the different algorithms available, because they have
different strengths and weaknesses and VERY different
performance characteristics.

For Eugene's application at a size of 128K points, it
looks like CXML or CWP is the way to go. At smaller
sizes, FFTW is pretty hard to beat.

Tom

_______________________________________________
Axp-list mailing list
Axp-list@redhat.com
https://listman.redhat.com/mailman/listinfo/axp-list



This archive was generated by hypermail version 2a22 on Sat May 5 06:18:12 2001 PDT
Send any problems or questions about this archive to webmaster@alphalinux.org.