Performance Tutorial For Mac (Part II): Using Accelerate

·

·

In this tutorial we are going to take a closer look at the Accelerate framework, how to use it and discuss some of its features. It’s been a while since I spoke about Accelerate, so if you don’t recall the general overview you can view the information here. In the examples presented here, we are going to focus on vDSP which is digital signal processing library in Accelerate. Since vDSP is a large library, we’ll focus on Apple’s FFT implementation and extend their 2D implementation to perform 3D FFT operations. The tutorial will be presented over the coming weeks as there is a lot to discuss.

You can download a copy of the source projects here.
The vDSP manual can be found here.

Let’s first discuss what exactly we want to calculate. The goal in this example is to perform a complex-to-complex, in-place 2D FFT initially. For this tutorial we’ll be performing small FFTs so we can easily perform analysis on the output. However, the content will be presented as if it were for a larger system. We’ll also restrict ourselves to considering only power-of-two FFT’s. vDSP allows for 3×2^N and 5×2^N for intermediate non-power of two values. It currently doesn’t support mixed radix FFTs. For that you will need something like FFTW.

As an aside, if you are only in need of FFTs, FFTW is a good compromise. Why? On a Mac, for FFTs that Accelerate handles, FFTW will actually use Accelerate’s routines. For most of the stuff I do, I’m explicitly exploiting the Accelerate frameworks’ strengths, so I typically don’t use FFTW (and don’t like having to package an additional library or add a layer of abstraction). YMMV.

Initial Setup

Open the Xcode project and under Source select part0.c. Set the build target to part0 as well. The bulk of the code will remain the same throughout this tutorial series, so let’s begin by looking over the basic structure.

We start out of course by including some standard headers followed by the Accelerate header.

#include <Accelerate/Accelerate.h>

We’ll need a MAX routine for later in the code (when setting up twiddle factors).

#define MAX(a,b)  ( ((a)>(b)) ? (a) : (b) )

Going through the code on the page we set up some convenience variables to control the FFT dimensions. We’ll be performing radix-2 based FFT’s. So our array dimensions need to be a power of two. To get the exponent of the base-2 value we can do something like:

/* Get log (base 2) of each dimension */
int logX = (int)log2((double)nx);
int logY = (int)log2((double)ny);

Now we’ll set up the data. Again, we will be performing a complex-to-complex, in-place 2D FFT. vDSP.h defines a struct for Complex (interleaved) data types:

struct DSPComplex {
  float               real;
  float               imag;
};
typedef struct DSPComplex               DSPComplex;

The interleaved type contains the real and imaginary values. There is an equivalent one for double precision values. On a PPC (at least) to get vector versions of the FFTs you’ll need your data in single precision format. We’ll allocate enough memory for MxN complex elements.

Note here that we are using malloc. Why? There are a few reasons, the least important being that typically you’ll calculate a large FFT and will need more memory than you can get, or want to put, on the stack. The second and more important reason is that the memory needs to be 16-byte aligned for the vector versions of the functions to work; malloc and calloc return memory allocated on a 16-byte boundary. You can align allocations on the stack using compiler attributes, but it’s not something we will discuss here.

Diversion – 16-byte boundaries

Without doing a complete rehash of how memory is stored, let’s briefly discuss memory alignment. Most CPU’s require memory to be evenly divisible by the type size of the element. For example a 32-bit integer needs to be stored at an address that is evenly divisible by 4. On some system types, failure to have the address evenly divisible by the type size will result in a seg fault. On others it results in a misaligned data load, which incurs a large performance penalty. Typically, the compiler takes care of this stuff for you by aligning (or padding, as in the case of structs) to the proper alignment in memory.

Now you may be thinking that a float is 32-bits so alignment should also be on a 4-byte boundary. And you are correct. However we’re wanting to use the vector hardware and a vector load is 16-bytes (regardless of the type) not 4-bytes. So we need everything on a 16-byte boundary.

Continuing

Once we have the memory allocated, we’ll fill it with some numbers (code not shown here, see project file). For this tutorial we’ll just fill in the real portion of the struct with numbers from 0 to MxN-1 and for the complex MxN-1 to 0. This will make looking at the output easy. In memory our data will look like this:

0 255 1 254 2 253 3 252 4 251....

since it’s in complex (interleaved) format. Ok, that is our data that we will be performing the FFT on. Now let’s set up our FFT arrays.

FFT Memory and Twiddle Factor Setup

vDSP.h defines another struct called DSPSplitComplex:

struct DSPSplitComplex {
  float *             realp;
  float *             imagp;
};
typedef struct DSPSplitComplex          DSPSplitComplex;

Basically it contains two pointers to memory that we will allocate shortly. One memory block will contain all of the real values of the complex data type, and the other will contain the imaginary values for the array.

Diversion – Complex data formats

Ok… at this point you might be asking, why are we allocating arrays that hold each element separately if we just went through the trouble of creating interleaved data? Well… this is partly due to historical reasons and probably somewhat due to bookkeeping. Languages like Fortran support a complex data type (the c99 standard does as well) and most early number crunching applications were written in… you guessed it, Fortran. Also, having the real and imaginary component right next to each other in memory makes bookkeeping and certain types of math a bit easier. For performance though it’s a bit of a drag. And in terms of commodity vector engines (such as Altivec and SSE) it’s a performance killer as you have to permute the data a lot, and lose the intrinsic single-instruction multiple data (SIMD) model.

So, although the data comes to us in an interleaved format, it’s better to split up the data for performance reasons. The amount of time we save using an optimized algorithm more than makes up for the time spend splitting and merging the data.

Continuing with the code we’ll allocate memory of N real elements and N imaginary elements. We’ll do this in main memory (as the requirements for the data array are still in effect).

/* Setup and allocate some memory for performing the FFTs */
DSPSplitComplex input;
	
input.realp = (float*)malloc(dims*sizeof(float));
input.imagp = (float*)malloc(dims*sizeof(float));

With all of that in place, we need to setup our weights (or twiddle factors). This is a bit more involved (only slightly) so let’s dissect this:

FFTSetup fft_weights;
fft_weights = create_fftsetup(MAX(logX,logY),kFFTRadix2);

The FFTSetup is an opaque data type that is returned by the create_fftsetup function. The data returned contains a memory block that has been allocated and filled with our twiddle factors and information about the fft_weights. Looking at create_fftsetup we can see it takes two values vDSP_Length (an integer) and the FFT type (radix2, radix3 or radix5). How do we choose the values to pass in? The radix type is easy. We are performing a power-of-two FFT, so the radix is 2. Now what about the vDSP_Length?

The twiddle factors (weights) that are generated are calculated based on the longest dimension of the 2D FFT. As long as your two dimensions are close in size, you can use one weights array for both dimensions. So basically, we want to take the maximum base2 value and use that to generate our weights. If our dimensions were 16 x 1024, we’d want two weights arrays. In this case the dimensions are the same, so we only really need one. We do the MAX test here for the sake of completeness.

Now, that we have our input data and the memory allocated to split the data into (one real and one imaginary array), we can perform the split using ctoz (complex interleaved, c, to split complex, z):

Preparing for the FFT

/* Split the complex (interleaved) data into two arrays */
ctoz(data,2,&input;,1,dims);

Input data = data
cStride = 2 (that is each real element is separated by a stride of 2, as is each imaginary element)
zStride = 1 (we want each real element to be placed into the data array with a stride of 1)
size = The total number (MxN) elements to process.

Here we pass in our input data and our DSPSplitComplex struct that will contain the output. Every second element (range: 0 to MxN-2 or 1 to MxN-1) will be copied to the respective real and imaginary arrays.

Performing the FFT

Now let’s do the actual FFT. Again, we are doing 2D (fft2d) split complex (z), in place FFT (ip). Now I say in place, but vDSP will create a scratch buffer for internal operations for us. So the data will be processed and moved around, but will be returned to us in the same memory location that we sent in:

/* Perform the FFT */
fft2d_zip(fft_weights,&input;,1,0,logX,logY,FFT_FORWARD);

We pass in our weights (twiddle factors), the input split complex data (DSPSplitComplex) as the first two parameters. The next parameter is the row stride. We want to process each successive row therefore the stride is 1.

The next value is the column stride, this parameter is tricky. According to the documentation:

The default column stride equals the row stride multiplied by the column count.

Note that it is the row stride times the column COUNT that defines the column stride. If you set it to zero, vDSP will just calculate it for you as meaning you want to process every column. Now, if you want to send in a value yourself it will be 1 (the row stride) times the column count (nx = 8) —> 1 x 8 = 8. So our column stride is 8. When in doubt just set it to 0 and let vDSP do the work.

The next two values are the base 2 exponent in the X and Y directions (M, N). And the last value is the direction to calculate the FFT. Here we want to do the forward transformation so we set it to FFT_FORWARD (which is an enumerated type that defines the sign of the exponent used in the FFT).

Finalizing the FFT

Now that we’ve done our FFT (pretty easy so far huh?), we need to get our data back in the form we sent it in as, we’ll use ztoc (split complex to complex interleaved) to do that. It’s basically the inverse operation of ctoz:

/* Merge the separate data (real and imaginary) parts into a single array */
ztoc(&input;,1,data,2,dims);

And then we can print out the data and do some cleanup:

/* Print out some data to check the calculation */
for(i=0;i<dims;i++){
	printf("%f %f\n",data[i].real,data[i].imag);
}

/* Free up the allocations */
free(input.realp);
free(input.imagp);

destroy_fftsetup(fft_weights);

Putting It All Together

Ok, so now we know what the basic lay out is, let’s go ahead and run calculation. To execute the program, make sure part0 is set as the target and click Build and Go in Xcode. You should see the following output in the Run Log window that appears as in this link.

Congratulations, you’ve now performed an FFT using Accelerate. Now let’s discuss a few things about performance before ending.

Memory Allocations

As I mentioned in another post, allocating memory can take some time (as does freeing it). This is mostly due to the hit incurred during the zeroing out of memory as you access a new page (zero fill page faults). In this example since we are doing everything in “main”, we perform an allocation and deallocation. However, if you are going to be performing a lot of FFTs (and the dimensions are the same), it may be worthwhile to allocate the FFT arrays (input.realp and input.imagp) once at the beginning of program execution and deallocate them once you don’t need them anymore. I’ve seen cases where simply doing this speeds up an calculation by a factor 8 (yes 8). So it’s something to consider.

FFT Weights (Twiddle Factors)

The weights array is very, very computationally expensive to calculate (you’re essentially doing a bunch of trigonometric calculations, which are slow). And can be as much as 50x slower than the calculation of the FFT itself. As with the memory allocation, if the dimensions of your FFT aren’t changing much (or at all), then it’s best to allocate this once and reuse each time you call the FFT routine. Even if they change slightly (and are smaller than the original allocation) it might not matter too much. Again, it’s something to consider.

Complex Interleaved to Split Complex (and vice versa)

Altivec simply rocks when compared to SSE and its variants. There is probably a more technical way of saying that, but I can’t think of any right now. One of the areas where Altivec really shines is in it’s vector permute and vector merge capabilities. The ctoz and ztoc transformation use this very heavily (as does another function I’ll introduce you to in a later installment). But even then, there are limitations to how fast these operations can be performed. So if you are able to keep your data in a split complex format (de-interleaved) then do it. In some cases it’s not possible, but if you are designing an application from scratch, consider using a split complex format. It might make certain calculations a bit more… well complex, but in terms of performance it can be a win (especially as Altivec falls to the wayside for SSE).

Wrap-up

This first part of the tutorial is now complete. With the majority of the setup out of the way, then next installment (next week, I promise) will move a bit faster and cover more ground. In particular we will verify the results of our calculation. Then, moving forward, we will extend the FFT from 2D to 2D-planar to a full 3D FFT. Until next time….


Leave a Reply

Your email address will not be published. Required fields are marked *