C++ Gymnastics
Posted in Computing 1 year, 2 months agoI’m having an opportunity to play with a highly parallel machine (think >40 processors/cores) and starting to think of smart ways to make use of this power. For low level libraries, you can play all kinds of tricks with the C++ preprocessor, known as template meta programming.
I’m going to describe one that I thought of.
Modern processors have special registers for vector processing. For example, if you were asked to add four single-precision floating point values, you’d write something like this:
float a[4], b[4], c[4];
for (int i=0; i<4; i++)
c[i] = a[i] + b[i];
The SSE addps instruction will add four packed single-precision
floating point values simultaneously. __m128 is the packed version
of four floats.
__m128 a, b, c;
c = addps (a, b);
If you had longer vectors, the logical extension of this idea is to create an array that is a multiple of four and then process the remaining elements in the normal way. In C, you’d write something like this:
const int LENN = LEN / 4;
const int LENR = LEN % 4;
__m128 a[LENN], b[LENN], c[LENN];
float ar[LENR], br[LENR], cr[LENR];
for (int i=0; i<LENN; i++)
c[i] = addps(a[i], b[i]);
for (int i=0; i<LENR; i++)
cr[i] = ar[i] + br[i];
The problem in C is that if your program uses a vector length that is a multiple of four, the second loop is still executed. This isn’t optimal.
Using template meta programming, you’d create a template that uses the length of the vector as a template argument. What happens is that the variables LENN and LENR are determined at compile time, and the compiler will be able to optimize the second loop away (with a decent compiler anyways.)
The second advantage is that each instantiation of the template
becomes a new type and gets you compile-time type safety. This makes
it impossible to add vectors of different lengths (which happens all
the time with void* tricks in vanilla C.)
I hope to share more tricks in the future, but that’s all for now.



