Simple approach to battle memory bandwidth limitations
I've written before on this issue of memory bandwidth bottlenecks for certain
type of scientific codes (i.e. memory "bound" codes) on current multi-core Macs.
The main point is that with processors getting faster and faster (higher clock rates
and inclusion of additional cores) at a much more rapid rate compared with
memory subsystems, often times the bottleneck can be fetching data to or
from memory. Here is a link to an article I wrote sometime ago about this:
MacResearch.org: MEMORY PERFORMANCE
I did a simple experiment earlier this week relating to these memory bandwidth
issues on multi-core systems and got some interesting (maybe even surprising!) results.
The motivation of this experiment was the observation that when one has a large
computation domain, the memory bandwidth limitation on current architectures can be
severe enough that one does not obtain any significant speedup from using more
processor cores. The idea I had was to try reduce the memory usage of my code
drastically even if it implied a significant increase in raw computation, and yet
expect an overall speed up! Seemed like a pretty crazy plan.
My main astrophysics code, much like many other codes, has an initial
routine that computes all the PDE coefficients (essentially all the values that
don't change during the simulation) on the entire computational
grid and stores them for later retrieval and use, as needed. My simple plan
was to change this approach and not store these coefficients at all, but
rather, compute them on the fly, over-and-over again, at every time-step!
And still expect overall better performance.
When I implemented this very simple approach the memory footprint
of my code dropped by a factor of two. For others, this drop may be even more
significant, depending on how much is being stored in their code. Then, I ran the
code on a single-core of my Mac Pro. The modified (reduced memory code), for
a relatively large computational domain, ran a few percent (single-digit) slower
than the original. I had expected this outcome to be much worse.
Then came the big surprise. When I ran both the codes on 4-cores (these
versions of the codes use OpenMP threads) I actually got an overall speedup!
The modified code ran 15-20% faster than the original. On full 8-cores, this
speedup was nearly 50%!
It seems that the improvement in scaling with the processor cores in the modified
code was significant enough to not only overcome the increase in the raw
computation, but also yield an overall significant speedup. It seems that our
programming 101 approach may benefit from a reversal of the commonly taught
paradigm to "pre-compute and store as much as possible"!
To summarize, not only does my code run faster now, it uses significantly less memory.
So, actually, you can have your cake and eat it too :-).



Comments
pre-compute
It seems that our
programming 101 approach may benefit from a reversal of the commonly taught
paradigm to "pre-compute and store as much as possible"
Actually, this reversal happened some years ago. For example, from
The Art of Unix Programming (2003):
Precomputation may seem efficient because it minimizes total use of processor cycles, but processor cycles are cheap.
Most modern CPUs (like in your Mac, but not in current-gen game consoles) hide memory latencies somewhat with large caches, integrated memory controllers, and out-of-order execution/deep-pipelining. But every generation of machine the gap widens and computations that were previously expensive enough to merit precomputing become so fast that memory latency eventually dominates.
Some other memory related things that will help you improve performance of your astrophysics code:
Instruments.app and the CHUD tools can help you track most of these.
OpenMP is very convenient, but in performance bottlenecks you may benefit from manually implementing threading/locking. As always, the compiler does the best it can but sometimes you can do better and in performance-sensitive parts of your code the development cost of doing so may be justified by the runtime benefits.
Re: pre-compute
Nice post xbox! Good overview of many different types of optimizations.
The only thing I want to add is that you need to be careful not to prematurely optimize. In other words, if you are going to use these rather low level techniques, make sure first that the code you are working on is really in need of optimizing (using tools like Instruments and CHUD). If a section of code takes an insignificant portion of the runtime, optimizing it is time wasted.
Some of the points in the post are also potentially cases of premature optimization. For example, unrolling loops these days by hand should only be done in the most critical code. Often the compiler can do it better than you can do it yourself, and although your hand optimized code may work great on the current CPUs, a few generations further and it may be slower. If you can leave it to the compiler, do. (Often just using a HPC library like the BLAS routines in Accelerate is a good alternative to doing your own loop unrolling.)
I think the lesson of this article in general is that what is fast today may not be fast tomorrow. Caching data was once fast, now it is often slow. You can't always avoid the problems this throws up, but you need to be careful not to excessively optimize your code for one set of circumstances, only to have it dismally slow in another set (this is the no free lunch theorem).
Drew
---------------------------
Drew McCormack
http://www.maccoremac.com
http://www.macanics.net
http://www.macresearch.org