Streaming Instructions

Yesterday, Intel disclosed more information about its forthcoming processors codenamed Penryn. Quoting arstechnica,

Penryn’s back end boasts two major advances over its predecessor. First is a new radix-16 divider that offers a 2x performance improvement on division operations vs. Core 2 Duo. The fast divider also speeds up a range of operations that depend on the divider hardware, like the square root function. Penryn’s SQRT operation is 4x the speed of Core 2.

The other major back-end improvement is support for the SSE4 extensions, a group of 50 new vector instructions aimed at speeding up media and other data-parallel applications. SSE4 will be paired with a new “Super Shuffle Engine,” a full-width, single-pass, 128-bit shuffle unit. This will enable Penryn’s vector hardware to perform 128-bit shuffle operations (e.g. pack, unpack, packed shift) in a single clock cycle. The beefed up shuffle capabilities will help Penryn align incoming vector data in the SSE registers so that the execution hardware can go to work on it.

SSE4

What’s interesting to me is the 50 new streaming instructions. Taking a quick look at the list of new instructions, I found dpps which is the vectorized floating point dot product. The document doesn’t state the number of cycles required for this instructions, but in my opinion, this is a huge win.

Mathematically, the dot product doesn’t fall into the class of vector instructions. The fastest way to do a dot product in the pre-SSE2 era was with the use of a vector multiply mulps and then adding the components pairwise. With the introduction of SSE3, you could make use of the Horizontal Add instruction haddps to reduce the component-wise addition. This is still not much better.

It’s extremely hard to write a physics engine without the use of dot products. I’m pretty sure a user will want to see the kinetic energy of the system, which is now only one instruction away.

T = m ( v \cdot v )

(the trick is to scale mass by a half.)

To see what instructions are supported by your CPU on Linux, at the commandline type:

cat /proc/cpuinfo

3 Responses to “Streaming Instructions”

  1. [...] have dedicated support to accumulate forces for reducing latency. I was previously excited by the addition of the dot product instruction haddps to recent generations of processors. Anton does dot products [...]

  2. [...] feels like just yesterday that Intel announced new SSE4 instructions. AMD today announced new SSE5 instructions. The new [...]