Like many scientists, we’re interested in using graphics cards to increase the performance of some of our numerical code. With the addition of CUDA to the supported list of technologies on Mac OS X, I’ve started looking more closely at architecture and tools for implemented numerical code on the GPU. This won’t be a CUDA tutorial, per se. Those will come later. However, I wanted to take some time to do a few comparisons between some CPU based technologies and the GPU equivalents.

I’m a huge fan of the FFT. We rely heavily on them in structural biology. Intellectually, they simply fascinate me. I don’t know why, but they do. So I’ve chosen to look at FFT performance for this article.

Background

The CUDA SDK comes with a few numerical libraries like BLAS, as well as a full suite of FFTs. Mac OS X also ships with a set of numerical libraries via the Accelerate framework. So I chose for this test to benchmark 1D FFTs using the standard CUDA and Apple implementations for powers of 2, complex FFTs from 32 – 1048576 elements (2^5 – 2^20). The FFTs performed are all single precision (since the graphics card we have only supports single precision floating point arithmetic). All timings use MachAbsoluteTime and are reported in milliseconds. And finally, for each calculation size, stages are run twice in full, and timings are taken on second pass through (e.g. Full CUDA Calc 1, Full CUDA Calc 2 (timed)). This was done because there were issues related to page faults when new memory allocations occurred. Times reported for each stage are an average of three runs.

Part of the power of CUDA is that it allows you to create kernels that can run concurrently (multithreaded). The CUDA FFT implementation is multithreaded, although I can’t say for certain at what point the implementation splits from single threaded to multithreaded (it may be on all of the time). For Accelerate the FFTs are single threaded to the best of my knowledge.

The test system specs are:

  Mac Pro
  Model Identifier: MacPro3,1
  Processor Name: Quad-Core Intel Xeon
  Processor Speed: 2.8 GHz
  Number Of Processors: 2
  Total Number Of Cores: 8
  L2 Cache (per processor): 12 MB
  Memory: 8 GB
  Bus Speed: 1.6 GHz

  NVIDIA GeForce 8800 GT
  Type: Display
  Bus: PCIe
  PCIe Lane Width: x16
  VRAM (Total): 512 MB

I need to point out we aren’t doing an apples to apples comparison (terrible pun entirely intended). So to be somewhat fair, I’ve broken down the FFT calculation timings into stages. Each stage corresponds with the steps required to perform a single FFT from start to finish, and I’ll describe those stages in more detail below. Each FFT is actually two FFTs (the forward and immediate reverse transform). No normalization of the output data was performed for timings; however, normalization was performed for assessing the error in the implementation.

Stage			Apple		CUDA
Mem Alloc and Init	malloc		malloc/cudaMalloc
FFT Setup		Weights Setup	Plan Setup	
Data Transport		ctoz		cudaMemcpy (to device)
FFT 			fft_zip		cufftExecC2C
Data Transport		ztoc		cudaMemcpy (from device)

Accelerate Stages

For the Accelerate Framework the stages are:

Mem Alloc and Init – Memory is allocated with malloc in DSPComplex format. Three arrays are allocated, one to hold the initial data (the data set), and two more to take the interleaved data and hold the split complex data. The allocated memory is then initialized using float values for the real component from 0 – N (floats). The imaginary component is zeroed out. If performing a lot of FFTs you’d probably allocate this memory once and reuse the memory over and over for optimal performance.

FFT Setup – This is the weights setup (twiddle factors). When performing the FFT the weights only need to be setup once (meaning you can reuse them, and should). The weights calculation is slow (often slower than the FFT itself). For these tests since we are testing the FFT from start to finish, weights aren’t cached.

Data Transport – Since the calculation uses main memory for storage and the CPU, which has direct access, this stage refers entirely to the complex-to-split-complex operation (ctoz).

FFT – Performs a forward and immediate reverse transform of the data using fft_zip (a split complex, in place FFT).

Data Transport – This converts the data from split complex back to interleaved and is the opposite of the initial Data Transport stage (ztoc).

CUDA Stages

For CUDA the stages are:

Mem Alloc and Init – Memory is allocated for both data (malloc) and device storage (cudaMalloc). The data is in complex format and the device memory is in cufftComplex format. The data is initialized the same way as for Accelerate.

FFT Setup – CUDA uses plans, similar to FFTW. cudaPlan1D was used to generate forward and reverse plans. Only 1 plan was calculated using CUFFT_C2C as the operator type. Since the implementation is opaque, it’s not obvious what FFT type will be used. However, like the weights setup for Accelerate, you only need to do the plan calculation once and then cache the result for future use.

Data Transport – The data set that was initialized above is copied to the device using cudaMemcpy.

FFT – Performs an in-place forward and immediate reverse FFT using cufftExecC2C, passing in the appropriate plan for each stage.

Data Transport – This stage is the opposite of the first Data Transport stage moving the results from the device back into the systems main memory using cudaMemcpy.

Results

For the results sections plots of the various stages (with commentary) are provided. In some cases a plot may only show a subset of the data to make it easier to see the differences. Not all plots show the entire set of data (not to hide anything, but usually because the data trend is similar across all tests). The number of elements are reported on the X-axis, and the times on the Y-axis (in milliseconds). I also want to give a shout out to David Adalsteinsson and his excellent program DataGraph. I initially tried plotting with Excel and a few other tools, but making the plots with DataGraph was far easier and it provides a lot of options for configuring how the plots will look. If you are looking for a lightweight, user friendly plotting program, I’d highly recommend it.

Memory Allocation and Initialization

For the memory allocation stage, there’s nothing terribly shocking here. It does look like CUDA has special routines for allocating memory of certain sizes (and in fact malloc does as well). There is a sharp transition between 256 and 512 elements (2048 and 4096 bytes, respectively) for CUDA. But after 65k elements, the timings are almost identical.

FFT Setup

The trend here is pretty obvious, the plan calculation is very lightweigth and changes little as the problem size increases. The weights calculation is relatively slow and you can see by the time you hit 65k elements is about 60x slower than the plan calculation stage. Again, in both cases these only really need to be calculated once and then cached. For 2^20 elements the times are: 209 ms (Accelerate) and 0.86 ms (CUDA).

Data Transport

The complex to split-complex transformation is very fast (this is a vectorized operation). One of the biggest issues with using GPUs is getting data on and off the card and it shows somewhat in this stage. At this point, we are copying the data from the main system memory to the main memory block on the video card. As you’ll see later, getting data onto the card is much faster than getting it back. And historically this makes sense, as data sent to the graphics card, historically, meant it was heading one place (to the display), where its journey ends. You will notice, however, that at 2^17 elements, the ctoz transform slows down and this may be related to beginning to fall out of cache.

FFT Forward and Reverse

This is the important part. Remember we are performing both a forward and reverse transformation in the timing. What you can see is up to 2^14 elements, the Accelerate FFTs are much faster than CUDA (again, to the best of my knowledge the Accelerate FFT used here is single-threaded). For CUDA the 2^14 case is a reproducible anomaly and I suspect related to some edge case in the plan setup, although I can’t be certain. After 2^14 elements, the CUDA FFTs are flat, and this is almost certainly due to the massive multithreading capabilities of the card. The current card that I have access to supports up to 128 concurrent threads (yes, 128 individual threads that have full access to each “core” of execution). Compared to the CPU that is 16x more “cores” and compared to the Accelerate implementation that’s 128x more cores.

Data Transport

For Accelerate (ztoc), split-complex to complex) the transform is relatively fast, and beyond 2^16 elements (65536) it’s slightly slower but not terribly. For CUDA, this is where the current technology starts breaking down. Getting data off of the card is just plain slow. So this isn’t something you want to do a lot. Rather for current technology, you’ll want to find ways to minimize having to retrieve data from the card, caching as much as possible if repeated calculations are necessary. This is largely implementation dependent. For the largest case tested, 2^20 elements, the time taken by cudaMemcpy is 97 ms versus 4.8 ms for ztoc. This is clearly an area that needs improvement, and I’d imagine that it is one that video card makers are dedicating more efforts to.

Total Times

Again, these are the timings for performing one complete cycle of FFT calculations from cold start to finish. A lot of the time after the initial setup you’d only perform some subset of the stages (see below). From these plots you can see that for small cases, Accelerate is the winner, and after that CUDA takes over, although in both cases the differences may not be enough to justify one technology versus the other. However if we look at the time spent for the stages that you MUST perform for any FFT calculation (assuming interleaved complex data is the default data type, i.e. requiring ctoz and ztoc transforms). The situation is different:

In this case it’s clear that Accelerate is the winner across the entire set of sizes tests, although the gap looks like it begins to close for the largest case tested. But still, that’s an important consideration given the current technology and steps required to complete the FFT.

Throughout this I’ve avoided talking about errors. And actually, I probably should have addressed this sooner. In performing these tests one of the reason to do the forward and reverse FFT is as a way to gauge how accurate the FFT implementation is. Tested this way, you should get back exactly what you put in (after normalization corrections, since most FFT implementations use performance tricks that result in scaling of the results).

I con not honestly say that I performed a rigorous error analysis. However looking at the out results (after normalizing) for some of the smaller cases, on average the CUDA FFT implementation returned results that were less accurate the Accelerate FFT. Element wise, 1 out of every 16 elements were in correct for a 128 element FFT with CUDA versus 1 out of 64 for Accelerate. The differences are most probably due to roundoff error. Again, with current technology CUDA is at a disadvantage since it doesn’t support the full IEEE rounding modes (this is expected to change in upcoming hardware) and I suspect this where the problem originates and it’s not an implementation issue.

Final Thoughts

I have to say I’m impressed with both technologies. On one hand you have access to massive amounts of computing power in the GPU. And in the near future I imagine that many of the issues I’ve noted here will begin to be addressed (faster data transfer rates off the card, double precision and full support for rounding modes and IEEE-754 compliance).

On the other hand you have Accelerate, a very comprehensive suite of routines and libraries that is highly optimized for all versions of Mac OS X. And more important, it ships with every copy of Mac OS X, so you can write applications that use the library and it will automatically do the right thing whether it’s a PPC G3 or brand new Intel-based Mac. Accelerate also supports single and double precision floating point operations. And in Leopard everything has been retuned and the performance of Accelerate (as well as the standard math library) are greatly improved.