Before continuing with the performance tutorial (which I promise will start back up very soon), there is one important topic to discuss; specifically, timing routines. In order to have an understanding of how changes in code affect performance you need to have a metric of sort (you all know this, you’re MacResearchers!). There are a number of methods that people use for timing routines, some more portable than others. In this tutorial I’m going to talk about Mach Absolute Time.
Why Mach?
Timing routines vary depending on the time domain you need to measure. In some cases using a function such as clock() or getrusage() and doing some simple math is sufficient (and it’s portable). If the timing routine is going to be used outside of the actual development scheme then portability is important. I don’t use these. Why?
For me, typical issues when tuning code are:
1) I only need the timing routine for immediate testing.
2) I don’t like relying on multiple functions to cover different time domains. Their behavior may not be consistent.
3) Sometimes I need a very high resolution timer.
Welcome mach_absolute_time!
mach_absolute_time is a CPU/Bus dependent function that returns a value based on the number of “ticks” since the system started up. It’s not well documented. But, the lack of concrete documentation shouldn’t be a deterrent toward using it, because the behavior is guaranteed on Mac OS X AND it covers any time range that the system clock covers (nanoseconds up until the system halts (by differencing)). Would I use it in production code, probably not. But for testing, it’s a sure thing.
There are a couple of items we need to consider in using mach_absolute_time.
1) How do we take get the current Mach Absolute Time?
2) How do we convert it to a meaningful number (since I said it’s CPU/Bus dependent)?
You can download all of the code for this project here.
Getting the mach_absolute_time
This one is easy:
#include <stdint.h>
uint64_t start = mach_absolute_time();
uint64_t stop = mach_absolute_time();
Done. We typically take two values since we want to see the time elapsed.
Converting mach_absolute_time differences to seconds
This one is a bit more complex because we need to get the system timebase that mach_absolute_time is…well based on. The code below is derived from some code that Ian Ollmann posted sometime back on one of the Apple Mailing lists:
#include <stdint.h>
#include <mach/mach_time.h>
//Raw mach_absolute_times going in, difference in seconds out
double subtractTimes( uint64_t endTime, uint64_t startTime )
{
uint64_t difference = endTime - startTime;
static double conversion = 0.0;
if( conversion == 0.0 )
{
mach_timebase_info_data_t info;
kern_return_t err = mach_timebase_info( &info; );
//Convert the timebase into seconds
if( err == 0 )
conversion = 1e-9 * (double) info.numer / (double) info.denom;
}
return conversion * (double) difference;
}
The most important call here is the call to mach_timebase_info. We pass in a struct that will return time base values. Finally once we have the system heartbeat, we can generate a conversion factor. Normally the conversion is imply the numerator (info.numer) divided by the denominator (info.denom). Here I’ve multiplied it by 1e-9 to get it into seconds (something my brain understands a little better). Finally we take the difference of the two Mach Absolute Times and multiply by our conversion factor to get the true time difference.
Wow. Now you’re probably thinking, how is this better than using clock? Seems like a lot of work. Well it is, sort of. That’s why it’s in a function. We just pass in our values to the function and get the answers back.
Example
Let’s do a quick example. Below is the entire code listing (including the mach function). You can compile it using: gcc mach.c -o mach
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <math.h>
#include <mach/mach_time.h>
//Raw mach_absolute_times going in, difference in seconds out
double subtractTimes( uint64_t endTime, uint64_t startTime )
{
uint64_t difference = endTime - startTime;
static double conversion = 0.0;
if( conversion == 0.0 )
{
mach_timebase_info_data_t info;
kern_return_t err = mach_timebase_info( &info; );
//Convert the timebase into seconds
if( err == 0 )
conversion = 1e-9 * (double) info.numer / (double) info.denom;
}
return conversion * (double) difference;
}
int main() {
int i, j, count;
uint64_t start,stop;
double current = 0.0;
double answer = 0.0;
double elapsed = 0.0;
int dim1 = 256;
int dim2 = 256;
int size = 4*dim1*dim2;
//Allocate some memory and warm it up
double *array = (double*)malloc(size*sizeof(double));
for(i=0;i<size;i++) array[i] = (double)i;
count = 5;
for(i=0;i<count;i++){
start = mach_absolute_time();
//do some work
for(j=0;j<size;j++){
answer += sqrt(array[j]);
}
stop = mach_absolute_time();
current = subtractTimes(stop,start);
printf("Time for iteration: %1.12lf for answer: %1.12lf\n",current, answer);
elapsed += current;
}
printf("\nTotal time in seconds = %1.12lf for answer: %1.12lf\n",elapsed/count,answer);
free(array);
return 0;
}
What are we doing here? For this example we have a reasonable size array of doubles that we will populate with some numbers and get the sum of the sqrt of those numbers. For testing purposes we will iterate over that calculation 5 times. After each iteration we will print out the time taken and then summarize the total runtime when the calculation is done. On my PowerMac G5 (2.5 GHz) this is what I get.
[bigmac:~/misc] macresearch% gcc mach.c -o mach [bigmac:~/misc] macresearch% ./mach Time for iteration: 0.006717496412 for answer: 89478229.125529855490 Time for iteration: 0.007274204955 for answer: 178956458.251062750816 Time for iteration: 0.006669191332 for answer: 268434687.376589745283 Time for iteration: 0.006953711252 for answer: 357912916.502135872841 Time for iteration: 0.007582157340 for answer: 447391145.627681851387 Average time in seconds = 0.007039352258 for answer: 447391145.627681851387
Note that I turned optimizations off in this case, because compilers have ways of getting around brain-dead loops like this. Again, this is an example. If it were real code, reasonable optimizations would be on.
Ok… So a couple of points about the example.
First. The dimensions I used were set larger than my caches. I did this on purpose since typically we need to be aware of behavior in cases where data falls out of cache (as it will here, on my system at least. If you are on a MacPro it won’t, necessarily). We’ll talk more about cache issues in the future. Of course, this is a contrived example, but it’s something to always think about.
Second, you’ll notice I put in a comment after the memory allocation about warming up the memory. What do I mean?
This is something that typically isn’t a concern in the real world, as memory is often already touched by the time you need it. But when putting together smaller test cases to measure function performance it can be a real issue that affects your measurements.
When dynamic memory is allocated, the first time it is accessed the memory manager needs to zero it out (on OS X regardless of how you allocate dynamically: malloc, calloc etc… all memory is zeroed before being handed to the user***. The memory is zeroed as a security precaution (you don’t want to hand someone memory that contains secure information, such a decryption keys, for example).
***I stand corrected on this statement see Ian Ollmann’s note below regarding zeroing behavior using malloc. As he states, if you always need zeroed memory use calloc instead of malloc***
That zeroing process incurs a penalty (and is flagged by the system as a zero-fill page fault). So to make our timings a bit more accurate, we touch that memory once before we start using it to ensure we don’t get hit during the timing process with the zero-fill page fault hit.
Let’s test that. Change the line:
for(i=0;i<size;i++) array[i] = (double)i;
to:
//for(i=0;i<size;i++) array[i] = (double)i;
And rerun the test:
[bigmac:~/misc] macresearch% ./mach Time for iteration: 0.009478866798 for answer: 0.000000000000 Time for iteration: 0.004756880234 for answer: 0.000000000000 Time for iteration: 0.004927868215 for answer: 0.000000000000 Time for iteration: 0.005227029674 for answer: 0.000000000000 Time for iteration: 0.004891864428 for answer: 0.000000000000 Average time in seconds = 0.005856501870 for answer: 0.000000000000
Note that the first iteration through the loop is about twice as long as the subsequent ones. Also note that conveniently our answers are all zero! Again, the memory has been zeroed out. If we were handed memory from the heap as is, we’d have nonsensical numbers for the answers. The fact that the timing of the other functions went down is due to how the sqrt function is implemented and for the moment beyond the scope of this tutorial (we’ll get to that kind of stuff later).
One final, but important point. Don’t rely on the zeroing behavior for all memory. Indeed, it’s very much possible that memory handed to you on from a static allocation will be polluted as in a case such as:
double array[3][3];
Produces on my system:
-1.99844 -1.29321e-231 -1.99844
-3.30953e-232 -5.31401e+303 0
1.79209e-313 3.3146e-314 0
So be careful!
Leave a Reply