Table of Contents
Introduction
R is a venerable, free software environment for statistical computing and graphics. It has a large user community and an impressive number of available add-on packages. Notable among these is the Bioconductor family of packages, which constitute a comprehensive statistical computing environment for bioinformatics (with a focus on microarray data).
Fortunately R is very well supported on OS X and in fact, the R for Mac installation provides a very Cocoa-esque editing and execution interface for R, with such Mac-specific amenities as Quartz-based plotting windows all wrapped up in a nice OS X application bundle. You can download the OS X installation for R from any of the CRAN mirrors.
One particularly useful R package is the Simple Network of Computers, or SNOW package. Snow provides a number of parallelized R functions that piggyback on a number of distributed computing back end technologies, such as MPI, or Parallel Virtual Machine (PVM). For a nice quick reference of SNOW functions I recommend the SNOW Simplified website.
A Case Study
Recently I had a need to perform hierarchical clustering + multiscale bootstrap analysis of a massive microarray data set. Thankfully there is an R package called pvclust that trivially facilitates such analyses. The non-trivial aspect of my analyses was the CPU power needed to carry out my analyses using 10000 bootstrap permutations. A quick test on my Macbook Pro laptop indicated it would take days to execute on a single machine. Luckily the pvclust package provides a parallelized bootstrapping function based on SNOW, so I turned to our lab’s linux cluster, but the shuffling of data across the network to the compute nodes was proving to slow things down too much (only 1GB ethernet interconnects on this cluster).
After wasting so much time with the linux cluster I realized that I had overlooked one of the newest members of my computing arsenal. I recently acquired a Mac Pro workstation with dual quad-core 3.0 GHz Xeon processors and 16GB or RAM, which I have dubbed the MacZilla. Foolishly I ignored the baby supercomputer sitting next to my desk.
I installed SNOW on the MacZilla using the GUI package manager offered by the R GUI for Mac. I decided to use the MPI back end for SNOW and installed the LAM/MPI packaged (preferred by SNOW) using MacPorts. To use the MPI back end with SNOW you need to install the Rmpi package. If you installed LAM/MPI using MacPorts you will need to specify the following configure flag when you install Rmpi:
--with-mpi=/opt/local/
This ensures that the Rmpi library is linked using the correct MPI implementation. With Rmpi in place you have all you need to use the parallel SNOW functions. The first step in any parallel SNOW code is to create a cluster object:
cl <- makeCluster(16, type="MPI")
The above command creates a 16 node MPI cluster object. This cluster object will be passed in as a parameter to any SNOW-based function. For example:
> A<-matrix(1:10, 5, 2)
> A
[,1] [,2]
[1,] 1 6
[2,] 2 7
[3,] 3 8
[4,] 4 9
[5,] 5 10
> parApply(cl, A, 1, sum)
[1] 7 9 11 13 15
When you no longer need the cluster object you must call stopCluster() to dispose of it:
stopCluster(cl)
Using my MacZilla workstation I created a 16 node MPI cluster and carried out my bootstrap analyses in a single hour, instead of the many hours if not days it may have taken on our Linux cluster (due to the data shuffling over the network). The main benefit being the availability of so many CPU cores on the same memory and CPU buses. Given the many high-level MPI interfaces for R and other scripting languages (PERL, Python, Ruby, etc), the 8-core Mac Pro gives you a lot of computing bang for your buck.
Leave a Reply