Code faster code.

Unrolling Loops With Duff’s Device

Loops are a daily part of programming - if not, you’re probably doing something terribly wrong, or incomprehensibly right. For the rest of us, loops are these rather innocent looking pieces of code that reduce the amount of typing we have to do. Oftentimes they make it possible for us to do a wide variety of things otherwise not possible or practical.

My biggest problem with tight loops in performance-critical areas is that every time the loop iterates, the CPU blocks until it can know whether to continue with the loop or not.

Timing NSCoding

It sure has been a little while. This time I decided to do some legwork on how fast NSCoding really is. For those unfamiliar with it, NSCoding is the tools that Apple provides in the Foundation framework for serializing Objective-C objects. While it isn’t nearly as slick as some tools, it does get the job done. Because you have to write your deserializer logic yourself, it also lends itself to migration-friendly deserializers.

So what I want to do is time the difference in speed between the two underlying backends for NSKeyedArchiver and NSKeyedUnarchiver. The first (and default) is an XML-based encoding system. This is used all over Cocoa. In fact, if you’ve ever created a XIB file in Interface Builder, you’ve used it. IB is actually a really nifty tool. It works by creating all these user interface objects and then serializing them to disk. They pop out of the archive without needing to run a lot of redundant code. It’s spiffy.

Measuring Q_rsqrt - Part II

I bit the bullet and spent the half hour or so to install MinGW onto my desktop PC, a Windows 7 Professional 64-bit Edition workstation running on an AMD Phenom II X4 955 Black Edition chip. Naturally I was quite excited to see the difference in floating point performance! Only, the Phenom II’s floating point outperformed Q_rsqrt about half the time! On the strange numbers which aren’t natural roots, even!

So I put on my thinking cap, and this is what I’ve come to (also after chatting with Keith Bauer, who also warns me to be careful of things I don’t know).

Measuring Q_rsqrt - Part I

Out of pure curiosity, I decided to measure the performance difference between Q_rsqrt and math.h’s sqrt call. Using my rdtsc code, it’s simple enough to concoct a test.

The results I found were that in the event that the number n is a natural root (for instance, sqrt(4.0f) == 2.0f) the math.h implementation is faster. For virtually all other numbers, Q_rsqrt is faster.

Why GNU Grep Is Fast

Thanks to John Reagan for linking this on the BerkTIPGlobal mailing list! The following is an excellent message from an ex-GREP maintainer about how it gets its speed. Enjoy!

The Peril of Autorelease

A while back, on 9 August 2010, a short conversation took place in the #iDevGames channel on Freenode:

9 August 2010, irc.freenode.net/#iDevGames
[16:31] <+mercy_____> #define killItWithFire(a) [(a) release]
[16:33] <+[Araelium]Seth> #define deathByUngaBunga(p) free(p);
[16:40] <+cmiller> #define deathByThousandPapercuts(a) NSAutoreleasePool *pool_#a = \
                    [[NSAutoreleasePool alloc] init]; [(a) autorelease]; [pool_#a release]

RDTSC for AT&T Syntax

I said I’d translate my little RDTSC magic to AT&T syntax so it’d be compatible with GAS assemblers. For those of us using things like GCC and LLVM.

#include <stdint.h>

uint64_t r;
uint64_t rdtsc() {
      : "=A" (r)
      : /* no inputs */ );
  return r;

For the record, I actually prefer the Intel-style of assembler. It’s more clean-looking to me.

Anyways, this is the GAS-compatible version of my RDTSC utility. Happy coding!

EDIT: In retrospect, I should note that this doesn’t involve as much with differences between the AT&T and Intel syntaxes as it does with retrieving the results, which is a difference between the MSVC inline assembler and the GCC inline assembler.

Accurate Instrumentation

You can do a lot with a code profiler. I love Instruments (yes, I’m a Mac head). But when you’re in the middle of some white-kuckles optimization, all the profilers in the world are just too coarse to accurately measure performance. To where do you turn for help?

Assuming you’re on an Intel chip (or AMD) you can use a handy little tool called RDTSC. RDTSC, simply put, will grab the current clock count from the CPU.