Do you seek ever increasing speed from your computers? Do you think multicore chips are a sign of a healthy chip industry? In major trade journals, most articles on the subject seem to uncritically accept multicore as the processor solution moving forward, without suggesting any viable alternative. Meanwhile, the chipmaker giants, Intel, IBM, and AMD, increasingly emphasize multicore chips. But I believe the hardware problems that were serious enough to prompt such fierce competitors to agree on multicore herald a major transition in the computer industry whose full consequences, especially for software writers, are not yet appreciated.

The reign of the single-processor computer is over

The popularized version of Moore’s law, expecting doubling performance per processing element every couple years, has ended. While Dr. Gordon Moore’s original observation of doubling transistor count every 18 to 24 months is holding true, by 2006 the popular expectation of doubling performance per processing core is not. Intel, as well as IBM, AMD, and others, could not produce faster processors because the chips ran too hot. Therefore Intel saw it necessary to produce a chip with two cores to increase total performance. But, read another way, that decision says Intel ran out of ideas to improve performance per chip save one: the technology of “copy and paste”.

Processor makers today offer chips with multiple “Core”s for a good reason. In 2004, the microprocessor manufacturers industry-wide encountered a wall in the form of heat issues preventing higher clock speeds, with Intel delaying, then canceling a 4 GHz chip.
In 2005, Apple surprised the world when it decided to switch from PowerPC to Intel because it saw a growth path in Intel processors that IBM did not pursue with PowerPC. In 2006, Intel introduced their Core Duo processor, a chip that, as far as a software writer is concerned, has two processors. At the same time, they announced that a 4-core chip was forthcoming, and both AMD and Intel slated 8- and 16-core chips for 2007 and beyond. Intel CEO Paul Otellini even previewed an 80-core chip as a way of pointing to the future. Others have since prototyped 64-core chips.

Like before (such as the introduction of SIMD units like AltiVec and SSE), chipmakers have recast their problems in a way that transfers them to software writers. This time software parallelism must carry on where hardware parallelism cannot. As far as a software programmer is concerned a core is a processor. Software writers, in order to best use multiple core hardware, must choose an efficient parallel programming paradigm and use it carefully.

Is shared-memory multithreading the answer to multicore?

By far the most commonly discussed programming method to apply multicore is multithreading. This parallel computing paradigm assumes many concurrent processing threads that all have shared access to all memory. The problems with this approach are two-fold:Software

Because memory is shared, threads may step on each others’ work, potentially giving erroneous results randomly. Determinism, formerly a defining feature of a computer, is easily obliterated, requiring the programmer to track down and eliminate such non-determinism. E. A. Lee of UC Berkeley writes in
The Problem with Threads: “… we in fact require that programmers of multithreaded systems be insane.” I highly recommend to anyone the thorough and thoughtful analysis in Lee’s article.

Creators of typical shared memory implementations recognize that such machines are inherently nondeterministic, so their solution is to have the software programmer apply mechanisms to prune away this nondeterminism. Specifically, the shared memory with threads approach would use either a locking or a semaphore mechanism which can too easily negate parallel performance.

When a race condition occurs unforeseen, the output is random and not repeatable, so diagnosing a problem whose symptom is never the same twice is very frustrating. There could even be circumstances that cause the right answer to result on one system, but random answers to appear on another, further frustrating the writer. Such shared memory issues are very difficult to isolate and solve.

Proponents of shared memory would argue that writing code for such systems is easier than using distributed memory with message passing. While for trivially parallelizable examples the code would appear easier, most of the interesting problems have inherent internal dependency that cannot be eliminated. Fundamental issues occur when applying the shared memory paradigm to such problems. Most language-based multithreaded solutions obfuscate these data dependencies, resulting in code with cryptic directives and hidden effects, all the more confusing for software writers.

OpenMP Programming Directives

Hardware

Data is commonly served from memory using a shared bus that easily can be overwhelmed by transaction requests of the processing cores. Beyond 16 cores, this memory bus is so taxed that hardware makers must design much more expensive, complex technology to compensate. Another reference I recommend is Chapter 6 of In Search of Clusters by Gregory F. Pfister, who provides an excellent description of the complications a hardware designer encounters to maintain cache coherence and data rates between processors and memory of such hardware.

Except for the most data-independent problems, memory contention and data congestion were already issues on the most recent single-processor personal computers. Isn’t that why Apple invested so much in making the system bus, memory, and I/O of the first Power Mac G5 faster than any Mac that came before? Now we have a Mac Pro whose 8 cores can easily overwhelm an even-faster system bus.

This problem occurs not only for scientific problems but also Apple’s latest H.264 compression algorithms for HD video. Benchmarks show how Apple-supplied H.264 QuickTime compressor flatlines beyond 4 cores and is no faster when using 8 cores. The benchmarks show only 20% parallelism when using 4 cores versus 2 cores. Clearly some sort of data bottleneck is holding performance back, despite Apple software writers’ advanced skills. (It turns out,
except for game developers, few apply multithreading well. And yet 16-core chips are to come?)

“Those who do not listen to history are doomed to repeat it.”

A central debate of high-performance computing (HPC) in the 1990’s was between two camps: shared memory with threads versus distributed memory with message passing. Silicon Graphics, Inc., (SGI) became the prime corporate advocate of the shared memory approach. While other companies like Intel, Cray, IBM, and Fujitsu, in their HPC offerings, abandoned shared memory in favor of distributed memory, SGI, at their peak, built impressive boxes with 256 processors all sharing memory at great expense. SGI helped build technologies that we know today as OpenMP and, because their approach worked well for graphics because such applications are often easy to parallelize, OpenGL. But, when asked to build 1024 processor systems, even SGI would have to build 4 nodes of 256 processors each connected via a network. Software and hardware
layers would recreate the illusion of a shared memory system, but its speed would at best be limited by the network, just like for message
passing. While today distributed memory systems operate with several thousand processors, SGI itself encountered a practical limit to the pure shared memory approach it advocated.

So in the 21st century, what was SGI’s fate? SGI was unable to produce technology sufficiently more powerful and more economic than its competitors in the HPC arena. Likewise, the corporate fortunes of SGI faired poorly, with stock prices under 50 cents per share after 2001, sinking below NYSE minimums. In May 2006, SGI filed Chapter 11 bankruptcy and major layoffs, leaving its stock worthless. Only in the following October did the company find new life after a complete financial overhaul.

Today, Intel is advocating threads, but that also means Intel is tacitly advocating the shared memory approach championed by SGI from the 1990’s until declaring bankruptcy. The essential comparison I see is this: Intel is following the path SGI has already tread, only this time, because of Intel’s microprocessor dominance, the personal computer industry is following the path to doom blazed by SGI. And we already know where the old SGI’s path ended: technological and financial ruin.
Does this make sense?

Meanwhile, all new Macs and nearly all new PCs are multicore, and the most discussed approach to apply multicore is that of yesteryear’s doomed SGI.

What is HPC using?

For a decade, clusters and supercomputers, well-represented at major supercomputing centers around the world, adopted the MPI standard on distributed memory hardware. Today, distributed memory MPI is the de facto standard at the San Diego Supercomputing Center, National Center for Supercomputing Applications, National Energy Research Scientific Computing Center, Lawrence Livermore National Laboratory and many more. In the annual reports of these organizations, they highlight the accomplishments they perform with their hardware, but it also goes without saying that these applications use the distributed-memory MPI approach. Even the new SGI’s product list includes the distributed-memory message-passing hardware design and MPI support. While novel alternatives exist on the horizon, HPC in practice would imply distributed memory with message passing is the best multiple processor programming paradigm yet.

My prediction is that, sooner or later, the entire industry will evolve to use some sort of message passing, perhaps something MPI-like, to wrangle all those processors. (IBM Cell processors have interconnects between SPEs, suggesting the So why not get a head start and use multiple processors the way HPC does today, bypassing the forthcoming Multithreading Meltdown? With the prospect of more systems with four, eight, or more processors in a machine, it is left to the software programmers to use these systems efficiently. Although the personal computer industry is trending towards shared memory with threads, the HPC industry has already shown that path is a mistake. The need for programming parallel computers has arrived at our desktops, and lessons learned from scientists who apply computing can show how to use multicore, and beyond, if heeded.