Literate Reporting, anyone?

I recently discovered Sweave, and have become enamored with Literate Reporting. (That is, the idea of including the report text and the calculations behind the report in the same document, which can be woven to create a final report that includes both, or can be tangled to get just the calculations without the extra verbiage. This helps with transparency and reproducibility in research.)

As I began typing this, it struck me that perhaps Mathmatica, Maple, and other such programs have a good enough notebook environment that people use them for Literate Reporting, without even knowing the term. Is that the case? Or do most people use their scientific-program-of-choice to crunch the numbers and make the graphs, then bring them by hand into a LaTeX-based program (or, shudder, Word) to create the final article?

Sweave is a tool that works with LaTeX and R, so that you can embed R commands into LaTeX as a .Stex document, and then Sweave it (through R) into a .tex document. The embedded R commands can have a variety of flags such that: 1) the R code is echoed in a tt-style, 2) the textual output of the R code is shown in a tt-style, 3) neither the R code nor its output is shown, or 4) both are shown. You can also include the graphical output of R plotting commands, and you can include the results of R commands inline, in the context's style. (You might wonder why you'd want option #3, above, but that allows you to do calculations and save them into variables for later display. Or you could use it simply to save your final calculations and graph-making in the same document as your report, for future reference.)

You can also use R libraries that output TeX, such as xtable, to include tables and other kinds of output from the R code as well.

Perhaps some of this is not of use in many printed journals, where space is at a premium and you would not want to show all your calculations. On the other hand, with many journals going online, it makes more sense. And with the option to not display many of your calculations directly, you can still include the calculations and text in a single document that is easier to share with other researchers, who can examine your analysis and try to reproduce it.

Not to mention that, at least in my schoolwork, it's easy to be a bit disorganized and end up with an R workspace with the proper stuff in it, but no real organization. So once I've copied/pasted into LaTeX, the chain of what I did is broken and I'd be hard-pressed to tweak calculations/graphs a couple of months later.

Compendium

What you describe is similar to the "compendium" idea proposed by Gentleman and Temple Lang:

http://www.bepress.com/bioconductor/paper2/

I find their argument convincing for the value of interactive documents, I don't know that I agree that they should be the center of a research package. I think a smarter make system would be the way to go. If I want to correct a spelling mistake, I don't need to rerun all my computations (which can be a difficulty with such documents). I think weaves combined with an intuitive make system would be the best of both worlds.

I'm currently working on a project documenting some existing research to be co-published with a paper. My opinion may change after I have more experience in the subject.

Smart Make

I'd like to hear you elaborate on what you'd like to see a smart make system do.

I'm guessing that a lot of hard work on Sweave might handle your example of fixing a typo in the text without rerunning all the R code. It seems to me that it could be done with several changes:

  1. It would have to read all chunks before executing any of them, which I imagine it does not currently do.
  2. It would cache incoming chunks and outgoing results.
  3. It would snoop on all possible R data reading mechanisms, to determine what variables are set from external sources, such as read.table. It would then save a data/time stamp of modification, or perhaps a hash or something, so it knows if the data had changed.
  4. It could then do something fairly simple (more complicated versions do more, but get more and more complicated to figure out all the possible loopholes), like:
    • If any external data files changed, throw away all caches and calculate everything from the start.
    • If any of the chunks had changed (as indicated by disagreement with the cached versions), throw away
      all caches and calculate everything from the start.
    • Otherwise, simply weave in the cached results of the chunks, with no other processing.

Perhaps the master program could actually track data file changes, and tell Sweave. Having Sweave attempt to determine what files are accessed seems problematic to me, since there are so many ways in a program -- and R in particular -- to read in data.

I don't think there's any practical way to have only some R chunks re-computed while others are not, since later chunks can depend on the results of earlier ones. (And this seems necessary to have any kind of coherent calculation throughout the paper.)

Even if data file changes were not detected, making the rest of these changes would allow you to fix your non-R-typo without re-invoking R. I guess I just work with small datasets. ;-)

silver bullet?

I found this idea fascinating. Though I don't do research that involves that much computation, the issue of "notebooking" the little bioinformatics work I do is real. It takes some real effort to force myself to keep track of what i do. I found it less natural than a notebook for benchwork, but maybe it is simply coming from all these years doing it so it becomes second nature.

The problem with Sweave for me is that it is tied to R apparently. So it only works when you use R. But for the little bioinformatic stuff I do, I keep using various tools, writing scripts here and there, manipulating data in Textmate + regex, etc... Also, what about different versions of the databases that are used in bioinformatics? That makes for irreproducible results!

Anyway, I would love to hear about a silver bullet for this. Maybe a separate app that can receive input from the command-line (or e.g. via Applescript embeded in your script), so that whatever you run is logged into a centralized "notebook" document?

Noweb

This literate reporting idea seems to be similar to the literate programming concept put forward by Donald Knuth. There is a decent introduction on wikipedia and a literate programming webpage, and a tool called noweb (that I don't have any experience with) seems to be useful for combining source code in any language with tex to typeset explanations and mathematics. Probably worth a look.

Sweave uses noweb

In fact, Sweave does use noweb syntax (as well as an alternative, LaTeX-like syntax). It may be that noweb can be the foundation for further work, or it might be something like how Apple has implemented bundles in MacOS X (i.e. folders of files).

Literate programming has never really caught on, and I'd guess for a couple of reasons:

1. It used LaTeX, which is useful for books, publication standards, or text with a lot of formulas in it, but is pretty baroque in general.

2. The description of a program is really not equivalent to a scientific paper, which will be full of background information, proofs, descriptions of data, and other things that don't have clear correspondences to programming.

3. Having a more capable tool to document your code does not change the environmental factors that have historically lead to poor code documentation. Especially if the tool requires learning another programming language in order to do it.

4. Program documentation issues also haven't historically been caused by separation of documents and programming has had code organization tools (source code control, make, IDE's, etc) for a long time.

But these problems aren't really applicable to scientific reporting, in my opinion, making Literate Reporting (a.k.a. Repeatable Experiments, etc ...) something that can happen. I mean, a lot of scientific reporting already takes place in LaTeX, and there is no standard for gathering data, analysis, and reporting in a package, much less tracking changes or providing mechanisms to share -- or even create and modify -- all of these files in groups.

Re: Literate Reporting, anyone?

Just in response to the reference to Mathematica: Yes indeed Mathematica is an ideal environment for this... and it is perhaps the most sophisticated one that I know of.

David Reiss
http://Scientificarts.com/worklife

Caching, ODFweave

There are at least two R packages that support caching objects for Sweave: cacheSweave and weaver.

ODFweave lets you do the same things as Sweave in a word processing document. (Support for spreadsheets and presentations is "experimental".) This makes the approach accessible to a much wider range of people, including MSWord users who have installed Sun's ODF plugin.

Nice

Excellent leads on cacheSweave and weaver! (Not sure I'd personally go with the ODF option.) Unfortunately, I can't find the RRPM project, which sounds like it is a lot like what we've described here.

Thanks!

re; Smart Make

Mostly, I'm just looking for a make system that addresses some of the well documented issues in the classic UNIX make utility (tabs vs spaces, harder to embedded a more modern programming language). There are make utilities for Python, Ruby, and Perl. They're all good. The compendium authors only seem to consider the older UNIX make utility in their discussion. If this were the only make system available I would agree with them, but I think there are better options.

The real point of a make system is explicit dependency information. If I need to create a .tex document, I can give state which other documents (e.g. charts, tables, R source files) it depends on. If any of those documents change (heuristically determined by the time stamp on the file), the make system knows that it needs to rebuild any documents that depend on that file.

For example, I am creating a PDF that includes a graph that is built from an R source file. I start writing my .tex and when I want to build my PDF, I run some make command (say I'm doing it with Rake it might look like "$ rake build:pdf". The rake system knows that the build:pdf command depends on the chart.pdf file which depends on the build:chart task which depends on the chart.R file. If the the chart.pdf is newer than the chart.R file, we know that we don't need to rebuild it when creating the PDF. If chart.pdf is older than chart.R, we know that I've made a change to chart.R and that change needs to be propagated to chart.pdf and, in turn, the final pdf which depends on chart.pdf.

If I dump everything into a single file, there is no way to do this update propagation. I see references to some caching mechanisms, which may be a good solution too. A make system is a fairly general tool that can handle this workflow and others, which is what attracts me to it. As projects grow and change I think the make tool has the best chance of adapting with it, whereas a document weave is more constrained.

Again, I wish had more practical reporting experience to draw upon. These are just my opinions and inclinations from my time as programmer. There is a good chance my opinion will change as I transition to the world of academia.

Cheers,
-Mark