One of the great things about developing for the Mac or iPhone is the suite of performance and debugging tools that Apple provides: GDB, Shark, Instruments, etc. However, there’s one tool which has received a lot less attention than it should, potentially because of its perceived complexity. That would be the DTrace framework, which lets you probe the inner workings of your applications or your overall system. As I’ll show you here through a series of practical examples, it’s a lot easier to work with than you might think.

DTrace is a framework whereby probes placed throughout the operating system and within applications can be observed and custom actions can be taken when these probes are triggered. What this means is that by using DTrace, you can examine the command flow of any application while it’s running on the system, or even look into the kernel itself.

The D scripting language used to control what happens when probes fire can sift through the data returned and perform complex analysis to express results in almost any format you desire. Custom DTrace scripts can let you focus in on the particular aspect of an application you’re testing and obtain clear results in minutes.

Security Set UP

The DTrace technology was introduced by Sun Microsystems in Solaris 10, who then released the code for it as part of OpenSolaris. Why this is of interest to Mac developers is that Apple integrated DTrace within Leopard and added a provider for tracking Objective-C method calls. The command-line tools for working with DTrace are distributed with every Leopard (and soon, Snow Leopard) operating system installation.

Why would you want to use DTrace instead of the other developer tools that Apple provides? Many of those who work with DTrace are fond of saying that it lets you ask specific questions about your system or applications and easily arrive at answers. Simple scripts, such as the examples I provide here, can be quickly written to resolve slowdowns or bugs related to a particular aspect of your application.

As mentioned, the DTrace tools are present on every Leopard installation and don’t even need the Xcode suite to be installed. This means that you can easily test applications in the field, while they’re running on your end-user’s computer. The application doesn’t even need to be restarted, because DTrace scripts can latch on to running processes.

Finally, DTrace integrates well with the other developer tools and can provide a supplemental perspective on a problem. In fact, Instruments, one of the core items within a modern Mac / iPhone developer’s toolkit, derives a good piece of its functionality from DTrace. In the second half of this article I’ll even show you how to use DTrace scripts within Instruments to extend its core capabilities.

DTrace scripts

It always helps to start with a practical example of what you can do with a technology, so let’s take a look at a very simple DTrace script. Save the following script to disk with a name like dtracedrawrecttiming.d:


#pragma D option quiet

objc$target::-drawRect?:entry
{
	start[probemod] = timestamp;	
}

objc$target::-drawRect?:return
{
	printf("%30s %10s Execution time: %u us\n", probemod, probefunc, (timestamp - start[probemod]) / 1000);
}
	

This script times the execution of the -drawRect: method call for every distinct class within a running application, giving us some indication about the drawing performance of that application.

In order to run this script, we’ll first need to attach it to the process ID of an application. Open the Terminal application and run a ps -A command to list running processes. Pick a Cocoa application and make a note of its process ID (on the left-hand side of the listing).

While at the command line, type


sudo dtrace -s dtracedrawrecttiming.d -p[pid]
	

where [pid] stands for the process ID you obtained at the previous step. The dtrace command-line utility requires root access to run your script, due to the low level at which probes exist, thus sudo is required to temporarily grant this access. The -s option tells DTrace to run off of a script file. While you can provide the full text of your script at the command line, I prefer to do all my work within script files.

If you instead wanted DTrace to start your application and automatically attach the script to it, so that you could even log startup-related events, you could use a command like:


sudo dtrace -s dtracedrawrecttiming.d -c [Path to executable]
	

Once you execute one of these commands, DTrace will process your script for a little bit and then start responding to probes. Because we’ve set this one up to listen for -drawRect: calls, go find the Cocoa application you attached it to and play with its interface. If you watch the terminal, you should start seeing output like the following:


       NSSplitView -drawRect: Execution time: 68 us
            NSView -drawRect: Execution time: 10 us
         NSControl -drawRect: Execution time: 129 us
      NSScrollView -drawRect: Execution time: 98 us
       NSTableView -drawRect: Execution time: 460 us
 NSTableHeaderView -drawRect: Execution time: 143 us
...
		

This script is tracking how long, in microseconds, it takes to run -drawRect: for each class every single time that method is called. To kill the script, go to the terminal and hit Ctrl-C.

For a Cocoa programmer, information like this is incredibly valuable, yet we were able to gather it using just nine lines of code. To understand how this script works, let’s step through it and pick apart each line.

	
#pragma D option quiet

The D language has a very C-like syntax. Part of that syntax includes preprocessor directives. The one on the first line of this script simply sets the quiet DTrace option, which suppresses the normal console logging you get from scripts like this. I added this so that we would only see the specific results we were printing out and not any other information that DTrace normally logs.

	
objc$target::-drawRect?:entry
	

This line shows off one of the key elements of the D language, called a probe descriptor. A probe descriptor determines when actions will be taken in response to things happening within the target application. It acts as a filter for all of the probes in your application or system, letting you only respond to certain events or classes of events.

A probe descriptor is of the form


provider:module:function:name	
	

A provider is a source for specific types of probes into the system. In the example above, we’re using the Objective-C provider objc, which requires a specific process ID to latch onto, thus the $target macro variable. $target will be replaced by the process ID of the application we latch on to when we run the script.

In the case of Objective-C, the module is the name of the class we’re filtering the probes by. We left this blank in our example, which DTrace simply takes to mean that we don’t care what class the probes come from. As I’ll demonstrate later, you can use wildcards in any of the fields within a probe descriptor to do some pretty interesting filtering.

Again, in the case of Objective-C the function maps to a method within a class that we’re filtering on. Because DTrace uses colons within the probe descriptor format, Objective-C methods need to have any colons replaced with question marks, thus the -drawRect? used above.

The last entry in the list is the name of the probe. In our example, we use both entry and return. These are probes that fire at the start and end of a method call, respectively.

	
{
	start[probemod] = timestamp;	
}

After the probe descriptor comes the action block, or what you’d like to happen when that probe is triggered. Again, the D language follows a C-like syntax.

In this portion of the script, we are determining the current timestamp (in nanoseconds) using the built-in variable timestamp and storing it for later comparison. We are storing that timestamp within an associative array called start. This array is using the probemod variable as the key for storage within the array. probemod is another built-in variable, this one representing the string description of the class running the -drawRect: method. By using the name of the class as the key, we can keep separate timings for -drawRect: calls in different classes.

Note that we never defined the start variable, as this is not necessary in the D language.

Skipping the second probe descriptor, we come to

	
{
	printf("%30s %10s Execution time: %u us\n", probemod, probefunc, (timestamp - start[probemod]) / 1000);
}

When the -drawRect: method has finished executing, we want to print out the time it took to run the method. For this, we use a standard printf statement that many C programmers will find quite familiar. To identify the class and method, we print out the probemod and probefunc variables, respectively. The duration is calculated by subtracting the current timestamp from the one we stored in the associative array for this class, then dividing by 1000 to present the result in microseconds, not nanoseconds.

While at first a DTrace script may look pretty cryptic, it actually has a rather simple structure. Even this basic example lets you probe behavior that would be difficult to observe using the standard development tools, doesn’t require any custom logging code to be inserted within the application, and can be run on any Leopard system.

Data analysis using aggregates

DTrace has some powerful facilities for collecting and reporting data, so that you can avoid digging through reams of results. In particular, the D language has a variable type called an aggregate, which makes it easy to get counts, rolling averages, sums, or other values, and present them in interesting ways.

Again, the best way to illustrate this is by example. While working on the Core Plot open source project, I ran across some slowdowns in redrawing scaled graphs. Now, I could use Shark or Instruments to dig down to which methods were the hotspots in the code, but sorting through all the data provided by those tools might take a little time. I wondered if you couldn’t just trace the total execution times and call counts for each method in the Core Plot framework, then sort those to get a quick readout of the worst offenders.

The following DTrace script keeps track of that data for every method called from a class with a CP prefix (the one used by Core Plot). When you’re done running it against a process, hitting Ctrl-C will stop the logging and produce a nicely-formatted report which sorts in descending order the methods the application spent the most total time within.

	
#pragma D option quiet
#pragma D option aggsortrev

dtrace:::BEGIN
{
	printf("Sampling Core Plot methods ... Hit Ctrl-C to end.\n");
	starttime = timestamp;		
}

objc$target:CP*::entry
{
	starttimeformethod[probemod,probefunc] = timestamp;
	methodhasenteredatleastonce[probemod,probefunc] = 1;
}

objc$target:CP*::return
/methodhasenteredatleastonce[probemod,probefunc] == 1/
{
	this->executiontime = (timestamp - starttimeformethod[probemod,probefunc]) / 1000;
	@overallexecutions[probemod,probefunc] = count();
	@overallexecutiontime[probemod,probefunc] = sum(this->executiontime);
	@averageexecutiontime[probemod,probefunc] = avg(this->executiontime);
}

dtrace:::END 
{
	milliseconds = (timestamp - starttime) / 1000000;
	normalize(@overallexecutiontime, 1000);
	printf("Ran for %u ms\n", milliseconds);
	printf("%30s %30s %20s %20s %20s\n", "Class", "Method", "Total CPU time (ms)",  "Executions", "Average CPU time (us)");
	printa("%30s %30s %20@u %20@u %20@u\n", @overallexecutiontime, @overallexecutions, @averageexecutiontime);
}

This produces results that look like the following:

	
        Class                         Method  Total CPU time (ms)           Executions Average CPU time (us)
      CPLayer                -drawInContext:                 6995                  352                19874
       CPPlot                -drawInContext:                 5312                   88                60374
CPScatterPlot      -renderAsVectorInContext:                 4332                   44                98455
CPXYPlotSpace        -viewPointForPlotPoint:                 3208                 4576                  701
       CPAxis               -layoutSublayers                 2050                   44                46595
CPXYPlotSpace -viewCoordinateForViewLength:linearPlotRange:plotCoordinateValue:                 1870                 9152
...

There are a number of new DTrace items introduced in this script, so let’s walk through them.

	
	dtrace:::BEGIN
	dtrace:::END 	

These are probe descriptors that let you perform actions at the very beginning and end of a DTrace script. In this case, we use them to display some instructions and mark the start time at the beginning, then print out the results at the end.

	
	objc$target:CP*::entry

This is the same style of probe descriptor as before, but notice that we’ve now set something in the module portion. The CP with a wildcard will cause any class starting with the prefix CP to be matched. All that’s required for you to apply this same script to your codebase is to change the CP* to match your particular namespace. You can even remove that class filter to see all Objective-C messages, but you might drop a few calls due to the quantity of data coming in, and the results could become cluttered.

		
	methodhasenteredatleastonce[probemod,probefunc] = 1;

Associative arrays can use more than one key for storing values. Because we want to split apart methods with the same name, but different classes (and vice versa), we key off of both the class and method. The particular value we are setting here is a flag to notify that this method has been entered at least once.

When attaching this script to a running process, you will sometimes catch methods in the middle of executing, where only the return probe is hit, and not the entry. We want to ignore those, so as to not skew our results.

		
	/methodhasenteredatleastonce[probemod,probefunc] == 1/

This line is how we prevent DTrace from analyzing methods where we didn’t catch the entry probe. This is what’s called a predicate, and is an additional condition that must be true for the actions related to the fired probe to execute. If the flag we’re setting on entry of a method is missing for this method and class, then we don’t perform the following action block.

		
	this->executiontime = (timestamp - starttimeformethod[probemod,probefunc]) / 1000;

In this statement, we’re setting a value to a local variable, rather than a global one, by using the this-> prefix. Local variables only exist for the duration of the action block, so they’re good for temporary stores, like the timestamp result we’re calculating here. You can also create thread-specific variables using self->, if you want a global variable that has a unique per-thread value.

			
	@overallexecutions[probemod,probefunc] = count();
	@overallexecutiontime[probemod,probefunc] = sum(this->executiontime);
	@averageexecutiontime[probemod,probefunc] = avg(this->executiontime);

These statements introduce a set of DTrace aggregations. Aggregations are unique structures that let you do threadsafe accumulations of various types, and they take the form of variables preceded by @. To set values in an aggregation, you use an aggregation function, like the ones above. In order of appearance, they increment the count of an aggregation, add a value to an aggregating sum, and cause a value to be incorporated within the running average of an aggregation.

As mentioned, these aggregations are threadsafe, so you don’t need to worry about losing values when an aggregation is hit by multiple threads at the same time.

				
	normalize(@overallexecutiontime, 1000);

You can do post-processing on aggregations, such as with this function, which divides all values in the aggregation by the amount specified. Another function you could use would be clear(), which will reset an aggregation’s values to 0.

				
	printa("%30s %30s %20@u %20@u %20@u\n", @overallexecutiontime, @overallexecutions, @averageexecutiontime);

This last line is how we print the results of the aggregation to the screen. It uses the same format as the more familiar printfcommand, with a few tweaks. The arguments supplied to the format string follow the order of fields within your aggregations. The first two are the keys used in the aggregration (class and method), followed by the integer values for each separate aggregation (with an extra @ sign for the elements in the format string). This one statement groups the values from the aggregations by the keys, then sorts (by default this is based on the first aggregate value in the list, but this can be changed) and displays them according to the format string.

The output of the aggregation values was tweaked a little by the following line:

	
	#pragma D option aggsortrev

By default, aggregations will print in order from the smallest item in the sorted field to the largest, but this DTrace option reverses that order. This makes the report a little clearer, in my opinion.

This example script presented me with extremely valuable information, filtered and sorted so that I could get a clear answer as to what might be slowing down the resizing of a Core Plot graph. As expected, the fundamental graphical layout routines are accounting for a lot of the CPU cycles when resizing a graph. However, the -viewPointForPlotPoint: and -viewCoordinateForViewLength:linearPlotRange:plotCoordinateValue: methods clearly stick out in this list, because these simple methods are having a significant impact on performance and therefore look to be good candidates for optimization. As I look down the list, I can also see a few other methods that might require some further examination.

Once again, the script code to gather this information is relatively simple and straightforward, once you understand the D language syntax.

Time profiling

Another great aspect of DTrace is how it handles multithreading, as mentioned in the discussion about aggregates. In addition to the threadsafe structures it uses, and thread-specific variables you can create, DTrace can track conditions like which CPU core is executing the part of a program that fired a probe.

A multithreading-related problem is what caused me to start tinkering with DTrace in the first place. At our company, we have a robotic system with a high-end Firewire CCD camera attached to one arm. I recently rewrote our video processing code to be based around, instead of the manually managed threads we had been using. After I finished this conversion, our software ended up being much slower than I had expected. Both Shark and Instruments hinted at a potential cause, but neither provided a clear answer.

In response to this, I created the following DTrace script, which shows off a few more features for profiling an application:

	
#pragma D option quiet

dtrace:::BEGIN
{
	printf("Sampling... Hit Ctrl-C to end.\n");
	start = timestamp;		
}

objc$target::-main:entry
{
	@operationspercpu[cpu] = count();	
	@totaloperationexecutions[probemod] = count();

	@multicoredistributionofalloperations = lquantize((cpu + 1), 1, 16, 1);
	@multicoredistributionofoperations[probemod] = lquantize((cpu + 1), 1, 16, 1);

	starttimeformethod[probemod] = timestamp;	
	cpustarttimeformethod[probemod] = vtimestamp;	
	methodhasenteredatleastonce[probemod] = 1;
}

objc$target::-main:return
/methodhasenteredatleastonce[probemod] == 1/
{
	@overallexecutions[probemod] = count();
	@averageexecutiontime[probemod] = avg((timestamp - starttimeformethod[probemod]) / 1000);
	@averagecpuexecutiontime[probemod] = avg((vtimestamp - cpustarttimeformethod[probemod]) / 1000);
}

tick-1s
{
	printa("%30s %10@u operations / s\n", @totaloperationexecutions);

	clear(@totaloperationexecutions);
}

dtrace:::END 
{
	seconds = (timestamp - start) / 1000000000;

	normalize(@averageexecutiontime, 1000);
	normalize(@averagecpuexecutiontime, 1000);

	printf("Ran for %u seconds\n", seconds);
	printf("%30s %20s %20s %20s\n", "Operation", "Executions","Average time (ms)", "CPU average time (ms)");
	printa("%30s %20@u %20@u %20@u\n", @overallexecutions, @averageexecutiontime, @averagecpuexecutiontime);

	printf("\nCPU core distribution of all operations:\n");
	printa(@multicoredistributionofalloperations);

	printf("CPU core distribution, by operation type:\n");
	printa(@multicoredistributionofoperations);
}

The script tracks entry and return from the -main method in all classes. This method is where the heavy lifting of an NSOperation takes place, therefore this monitors the execution of the NSOperation subclasses created within the application.

While the script is running, it produces output every second that breaks down how many of each operation type ran in the last second. That logged information looks like this:


SPCameraFrameEncodingOperation         30 operations / s
    SPCameraFrameGrabOperation         30 operations / s

This regular output is produced using a new type of probe descriptor:


	tick-1s

This creates a probe that fires at an interval of 1 second. The duration of this interval can be changed by replacing the 1s value with time values in a number of units, such as ms or ns, or even frequency in units of hz.

Within the following action block, we simply log the aggregated execution counts of each operation type, then reset the aggregation for the start of the next interval.

Even more interesting is the output when the script finishes:


                     Operation           Executions    Average time (ms) CPU average time (ms)
SPCameraFrameEncodingOperation                  315                   24                   23
    SPCameraFrameGrabOperation                  319                   32                    6

CPU core distribution of all operations

           value  ------------- Distribution ------------- count    
             < 1 |                                         0        
               1 |@@@@@@@@@@@@@@@@@@@@                     322      
               2 |@@@@@@@@@@@@@@@@@@@@                     314      
               3 |                                         0        

CPU core distribution, by operation type

  SPCameraFrameGrabOperation                        
           value  ------------- Distribution ------------- count    
             < 1 |                                         0        
               1 |@@@@@@@@@@@@@@@@@@@@@@@@                 188      
               2 |@@@@@@@@@@@@@@@@@                        132      
               3 |                                         0        

  SPCameraFrameEncodingOperation                    
           value  ------------- Distribution ------------- count    
             < 1 |                                         0        
               1 |@@@@@@@@@@@@@@@@@                        134      
               2 |@@@@@@@@@@@@@@@@@@@@@@@                  182      
               3 |                                         0        
	

At the top, we see a standard breakdown of the total numbers of executions of each type of operation recorded, along with the actual time, on average, each operation took to run, as well as the average active CPU time required for each operation type. The difference between the two timings comes from one measuring the time the CPU is idle while executing the operation, while the other only counts the time that the CPU was actively processing something.

The application was capturing a video frame from a camera with one type of operation (SPCameraFrameGrabOperation), converting its colorspace, and handing it off to the UI to be displayed. While this was going on, MPEG4 video was being recorded, so a second type of operation (SPCameraFrameEncodingOperation) was encoding each video frame for output to that video file. As you can see, the frame grabbing operation spends most of its time idle, waiting for a new video frame from the camera for it to process. The encoding operation, on the other hand, works hard throughout its lifetime to compress its video frame.

Below the performance breakdown of the operations themselves, we see an interesting series of charts that present the distribution of operations between the two CPU cores of this machine. As you can see, the NSOperationQueues do a nice job of evening out the work load on this system.

Those charts are produced using a new aggregate function shown in this line of code:


@multicoredistributionofalloperations = lquantize((cpu + 1), 1, 16, 1);

DTrace has two functions called quantize and lquantize which can generate histograms, like the ones shown, by placing data within bins. In this line, we are using lquantize, which takes as arguments the data to bin, the lowest value to tabulate, the highest value to tabulate, and the size of the bins in that range. Using the cpu value, we can obtain the CPU core which is running the current operation, but we need to add 1 to the value to account for the fact that lquantize files values under the first bin with a lower value than the input data.

These quantized aggregates automatically produce nicely formatted histograms when they are output with printa.

When I ran this against my application, I found that the camera frame processing operations were taking 35 ms to run, which means that they weren’t able to handle the 30 frames per second of video that was being fed in (1 s / 30 = 33 ms), yet they really were only using 6 ms of actual CPU time to execute. It turned out that in my haste to port the processing routines across, I had copied across code that induced a 29 ms sleep with every frame. I had done this on the manually threaded version to work around some expensive locks, which were no longer needed with the NSOperationQueue. This one slow step was jamming up my whole video processing pipeline.

With this script, I was able to identify and fix my stupid mistake in a half hour, something couldn’t do in days of examination using Shark and Instruments. As a side benefit, I was able to detect that one of our camera types was capturing at 50 frames per second, instead of 30, pointing out a settings bug in the application.

To be continued

In thesecond half of this article, I’ll explain how to insert your own custom DTrace static probes within an application, and show you how easy it is to extend the capabilities of Instruments using DTrace.