A couple of weeks ago, Apple introduced Snow Leopard, an OS release that left many users cold, but got developers very warm. Snow Leopard has a few new under-the-hood changes that are designed to make the life of a programmer that much easier in the heterogeneous, multi-core society that we now live in.

To begin with there is OpenCL. OpenCL is a technology for computing with heterogenous hardware such as GPUs and CPUs. It is being covered on MacResearch in David Gohara’s excellent series of screencasts.

The other big technology is Grand Central Dispatch (GCD), and that’s what I want to look at in this tutorial. Grand Central is a new approach to parallelism. It does away with the old threaded models, and replaces it with a packet-based approach. In this tutorial, we’ll take a look at how it works, contrast it to other approaches to concurrency, and get our hands dirty in some GCD code.

Multi-Threading

Traditionally, concurrency on shared memory computers like today’s multicore systems was achieved via multithreading. A developer could create extra threads of execution, which would share the same memory space, but could be run on separate logical CPUs (eg cores). If the developer split the work up between these different threads, a genuine performance boost could be achieved.

But there are some serious drawbacks to multithreading in this way. First, because threads are spawned in the application space, there is generally no way of knowing how many threads should be created for optimal performance. An application does not know the state of the system or other processes running on the system, so has no way of knowing what the optimal number of threads should be.

A second problem with multithreading is that threads are expensive to start and stop, and swap in and out. An application may start and stop many threads while it is running, and the operating system has to swap these threads so that processes can share the available CPUs. This wastes time. It would be better if the OS could regulate the number of threads to avoid excessive swapping and spawning.

Lastly, threading is particularly susceptible to so-called race conditions. Because threads tend to share data, it is very easy for one thread to corrupt the data of another, or for a particular sequence of events to lead to an unexpected results. In short, the outcome is determined by a race between threads, and avoiding race conditions is not as easy as it may seem.

Packet-Based Concurrency

Since Intel hit the MHz wall and turned to multi-core for salvation, software engineers have been looking for ways of utilizing those extra cores without over-complicating their code. Multithreading frankly scares the wits out of most software developers, so another approach was needed.

Apple too went looking for solutions, and found inspiration in an unexpected area: telephony. Originally, in order to call someone, you needed a direct line running from your telephone to the telephone of the person you were calling. This involved making contact with an operator, who would physically plug in a jack so that you could make the call, just like in those old movies. Telephony hit its MHz wall many years ago: directly connecting telephones simply does not scale well.

The solution they found was to split conversations into small packets of data. The advantage of doing this was that the packets could all travel down a single wire, and be reassembled at the end into the sound you hear. Packets could even travel different routes to get to the final destination. It was no longer necessary to have a dedicated line between caller and callee.

The approach of Grand Central is analogous to the revolution in telephony. Developers no longer have their dedicated threads, nor do they spawn or destroy threads. All of that has been moved down into the operating system. Instead, the developer delivers packets of computation, which can then be scheduled to execute by the operating system. Threads are still used — just like wires are still used in telephony — but they are used in a different way: they don’t perform a single dedicated task in a single application, but instead run packets of work from many different sources in the application.

Another useful analogy is the one right in front of your nose, the name ‘Grand Central’ itself. The name is very apt, and the railway is a good way to explain how GCD works. Imagine the tracks are threads, and the trains are packets of work to be executed. The operating system puts in place the railway network, with its tracks and signals. The developer supplies the trains, and the OS ensures they get to where they are going. Before GCD, you had to build the whole railway network yourself!

Threading Example

We’ll get to GCD itself in a minute, but first let’s consider how a traditional multithreading technology like OpenMP — which is quite widely used in science — would be used. We’ll morph this threaded example to use GCD and see where the similarities and differences lie.

OpenMP is an approach to multithreading that uses compiler directives to parallelize serial code, usually at the level of loops. Here is some C code that takes some image data, and applies a primitive blurring algorithm to it.

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

#define MAX(a, b) ((a) > (b) ? (a) : (b))
#define MIN(a, b) ((a) < (b) ? (a) : (b))

int main (int argc, const char * argv[]) {
    const int N = 1000;
    const int BLURRADIUS = 7;
    float originalImage[N*N], blurredImage[N*N];
    int row, col, i, j;

    #pragma omp parallel for private (col, i, j)
    for ( row = 0; row < N; ++row ) {
        for ( col = 0; col < N; ++col ) {
            int pixIndex = row*N + col;
            blurredImage[pixIndex] = 0.0f;
            for ( i = MAX(0, row-BLURRADIUS); i < MIN(row+BLURRADIUS, N); ++i ) {
                for ( j = MAX(0, col-BLURRADIUS); j < MIN(col+BLURRADIUS, N); ++j ) {            
                    int blurPixIndex = i*N + j;
                    float distance = (i-row)*(i-row) + (j-col)*(j-col);
                    distance = sqrt(distance);
                    blurredImage[pixIndex] += exp(-distance*distance) * originalImage[blurPixIndex];
                }
            }
        }
    }

    return 0;
}

It is interesting to note, aside from the fact that this code won’t win any awards, that it only takes a single directive, in the form of a C pragma, to parallelize this algorithm, and get close to linear speedup.

#pragma omp parallel for private (col, i, j)
for ( row = 0; row < N; ++row ) {

This OpenMP directive tells the compiler that the for loop that follows can be executed concurrently; that the iterations of the loop are fully independent of one another. What the compiler will do is inject code just before the loop to spawn threads, and code after the loop to clean them up again. It will then distribute the iterations of the loop as evenly as possible across the spawned threads.

The code generated would be conceptually like this

// Spawn threads here

int rowsPerThread = N / numThreads;
int row;
for ( row = threadId * rowsPerThread; row < (threadId+1) * rowsPerThread; ++row ) {
    InnerLoop(parentStack);
}

// Tear down threads here

The content of the main loop is moved to a function, which has been called InnerLoop here. Any variables that are shared between threads are passed to the function in the parentStack argument. Variables that should be kept private to each thread, such as the variables coli, and j from the original code, are made stack variables inside the function.

void InnerLoop(void *parentStack)
{
    int col, i, j;

All of this occurs during code compilation. You could write the exact same code using the pthreads library, but with OpenMP, it is easier, with the compiler handling the gritty details.

Grand Central Dispatch

So how much does GCD differ from the approach used in OpenMP? Actually, surprisingly little. In both cases the spawning and tearing down of threads is hidden from the programmer. Each technology is based on a language extension, with OpenMP using preprocessor directives, and GCD adopting blocks (sometimes called ‘closures’).

From a technological point of view, what GCD is doing is very similar to OpenMP. The differences lie in the flexibility of GCD’s block-based approach, and in the fact that threads are managed at the system level, rather than by the application developer.

The similarities between GCD and OpenMP are also evident when you look at the source code.

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <dispatch/dispatch.h>

#define MAX(a, b) ((a) > (b) ? (a) : (b))
#define MIN(a, b) ((a) < (b) ? (a) : (b))

int main (int argc, const char * argv[]) {
    const int N = 1000;
    const int BLURRADIUS = 7;
    float originalImage[N*N], blurredImage[N*N];

    dispatch_queue_t queue = dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0);
    dispatch_apply(N, queue, 
        ^(size_t row) {
            int col, i, j;
            for ( col = 0; col < N; ++col ) {
                int pixIndex = row*N + col;
                blurredImage[pixIndex] = 0.0f;
                for ( i = MAX(0, row-BLURRADIUS); i < MIN(row+BLURRADIUS, N); ++i ) {
                    for ( j = MAX(0, col-BLURRADIUS); j < MIN(col+BLURRADIUS, N); ++j ) {            
                        int blurPixIndex = i*N + j;
                        float distance = (i-row)*(i-row) + (j-col)*(j-col);
                        distance = sqrt(distance);
                        blurredImage[pixIndex] += exp(-distance*distance) * originalImage[blurPixIndex];
                    }
                }
            }
        }
    );

    return 0;
}

GCD replaces the for loop with a function call: dispatch_apply. This function basically performs the same task as the #omp directive in the previous example: it dispatches the iterations of the loop to different threads for execution.

The first argument to dispatch_apply is just the number of iterations in the loop. The second is a dispatch queue, in this case the default queue. Dispatch queues are used to submit work to GCD. You can create your own queues, or just use the default as in this example.

The third argument to dispatch_apply probably looks very strange if you have never seen a C block. A block is basically an anonymous function. It has a return value (optional), and an argument list. It has no name, but a caret (^) is used where a function name would be expected.

What makes a block special, and different from a function, is that it can access variables in its enclosing (parent) scope. In that sense, it is similar to the InnerLoop function from the OpenMP example, which, if you recall, was passed an argument containing stack variables. With a block, you can achieve the same thing, but without having to explicitly pass stack variables via the argument list, or even write a separate function.

The dispatch_apply function will dispatch one block instance to the queue for each iteration of the loop. The system will then execute the blocks from the queue across the available threads. When all blocks have run, dispatch_apply will return.

It should be noted that GCD has many functions, and that dispatch_apply is but one. A more general function, that allows you to submit a single block for asynchronous execution is dispatch_async. If you only learn one GCD method, make it dispatch_async.

Performance

So how fast is it? I haven’t performed any extensive comparisons, but here are some timings for this example using the Debug compile options.

TechnologyNumber of ProcessorsTime (s)
OpenMP111.363
OpenMP26.199
Grand Central5.561

The scaling of this example is quite good, at least going from one to two cores on my MacBook Pro. With OpenMP, you can set the number of threads used, because threading is handled by the developer. With GCD, you don’t tell it how many threads you want; it dynamically adjusts depending on system load and other variables. Also, GCD will generally limit the number of threads to avoid swapping, which is wasteful.

It is interesting that GCD is slightly faster in this case than OpenMP with 2 threads. GCD is very lightweight. In any case, it is encouraging to see the new technology working well in practice as well as theory.

For a more thorough benchmarking tool, take a look at Apple’s Dispatch_Compared code sample.

Update

Changed some terminology which implied that the thread pool was system wide. The pool is controlled by the system, but is actually still an application local pool.

Also added the link to the dispatch benchmark tool.