C++ Gymnastics

I’m having an opportunity to play with a highly parallel machine (think >40 processors/cores) and starting to think of smart ways to make use of this power. For low level libraries, you can play all kinds of tricks with the C++ preprocessor, known as template meta programming.

I’m going to describe one that I thought of.

Modern processors have special registers for vector processing. For example, if you were asked to add four single-precision floating point values, you’d write something like this:

float a[4], b[4], c[4];
for (int i=0; i<4; i++)
    c[i] = a[i] + b[i];

The SSE addps instruction will add four packed single-precision floating point values simultaneously. __m128 is the packed version of four floats.

__m128 a, b, c;
c = addps (a, b);

If you had longer vectors, the logical extension of this idea is to create an array that is a multiple of four and then process the remaining elements in the normal way. In C, you’d write something like this:

const int LENN = LEN / 4;
const int LENR = LEN % 4;
__m128 a[LENN], b[LENN], c[LENN];
float ar[LENR], br[LENR], cr[LENR];

for (int i=0; i<LENN; i++)
    c[i] = addps(a[i], b[i]);

for (int i=0; i<LENR; i++)
    cr[i] = ar[i] + br[i];

The problem in C is that if your program uses a vector length that is a multiple of four, the second loop is still executed. This isn’t optimal.

Using template meta programming, you’d create a template that uses the length of the vector as a template argument. What happens is that the variables LENN and LENR are determined at compile time, and the compiler will be able to optimize the second loop away (with a decent compiler anyways.)

The second advantage is that each instantiation of the template becomes a new type and gets you compile-time type safety. This makes it impossible to add vectors of different lengths (which happens all the time with void* tricks in vanilla C.)

I hope to share more tricks in the future, but that’s all for now.

Possibly related:

  • No related posts

Leave a Reply