Streaming Instructions
Yesterday, Intel disclosed more information about its forthcoming processors codenamed Penryn. Quoting arstechnica,
Penryn’s back end boasts two major advances over its predecessor. First is a new radix-16 divider that offers a 2x performance improvement on division operations vs. Core 2 Duo. The fast divider also speeds up a range of operations that depend on the divider hardware, like the square root function. Penryn’s SQRT operation is 4x the speed of Core 2.
The other major back-end improvement is support for the SSE4 extensions, a group of 50 new vector instructions aimed at speeding up media and other data-parallel applications. SSE4 will be paired with a new “Super Shuffle Engine,” a full-width, single-pass, 128-bit shuffle unit. This will enable Penryn’s vector hardware to perform 128-bit shuffle operations (e.g. pack, unpack, packed shift) in a single clock cycle. The beefed up shuffle capabilities will help Penryn align incoming vector data in the SSE registers so that the execution hardware can go to work on it.

What’s interesting to me is the 50 new streaming instructions. Taking
a quick look at the list of new instructions, I found dpps
which is the vectorized floating point dot product. The document
doesn’t state the number of cycles required for this instructions, but
in my opinion, this is a huge win.
Mathematically, the dot product doesn’t fall into the class of vector
instructions. The fastest way to do a dot product in the pre-SSE2 era
was with the use of a vector multiply mulps and then adding the
components pairwise. With the introduction of SSE3, you could make use
of the Horizontal Add instruction haddps to reduce the
component-wise addition. This is still not much better.
It’s extremely hard to write a physics engine without the use of dot products. I’m pretty sure a user will want to see the kinetic energy of the system, which is now only one instruction away.

(the trick is to scale mass by a half.)
To see what instructions are supported by your CPU on Linux, at the commandline type:
cat /proc/cpuinfo
And here’s the official press release:
http://www.intel.com/pressroom/archive/releases/20070328fact.htm?cid=rss-90004-c1-163976
[...] have dedicated support to accumulate forces for reducing latency. I was previously excited by the addition of the dot product instruction haddps to recent generations of processors. Anton does dot products [...]
[...] feels like just yesterday that Intel announced new SSE4 instructions. AMD today announced new SSE5 instructions. The new [...]