Optimized APP supporting MMX + SSE (but not SSE2, 3)

Message boards : Number crunching : Optimized APP supporting MMX + SSE (but not SSE2, 3)

To post messages, you must log in.


Send message
Joined: 23 Sep 07
Posts: 13
Credit: 220,526
RAC: 2
Message 1966 - Posted: 5 Mar 2011, 6:34:02 UTC
Last modified: 5 Mar 2011, 6:37:01 UTC

Optimized APP supporting MMX + SSE (but not SSE2, 3) might be in order.

I get the impression that the core application and the wrapper might perform better if complied with MMX and SSE optimizations.

I don't see much gains coming from SSE2 or SSE3, but this must be considered if SSE optimization creates substantial and substantive performance gains.


MMX is a single instruction, multiple data (SIMD) instruction set designed by Intel, introduced in 1996 with their P5-based Pentium line of microprocessors, designated as "Pentium with MMX Technology".

It developed out of a similar unit first introduced on the Intel i860. MMX is a processor supplementary capability that is supported on recent IA-32 processors by Intel and other vendors.

MMX defined eight registers, known as MM0 through MM7 (henceforth referred to as MMn).

To avoid compatibility problems with the context switch mechanisms in existing operating systems, these registers were aliases for the existing x87 FPU stack registers (so no new registers needed to be saved or restored).

Hence, anything that was done to the floating point stack would also affect the MMX registers and vice versa. However, unlike the FP stack, the MMn registers are directly addressable (random access).

Each of the MMn registers holds 64 bits (the mantissa-part of a full 80-bit FPU register). The main usage of the MMX instruction set is based on the concept of packed data types, which means that instead of using the whole register for a single 64-bit integer, two 32-bit integers, four 16-bit integers, or eight 8-bit integers may be processed concurrently.

The mapping of the MMX registers onto the existing FPU registers made it somewhat difficult to work with floating point and SIMD data in the same application.

To maximize performance, programmers often used the processor exclusively in one mode or the other, deferring the relatively slow switch between them as long as possible.

Because the FPU stack registers are 80 bits wide, the upper 16 bits of the stack registers go unused in MMX, and these bits are set to all ones, which makes them NaNs or infinities in the floating point representation. This can be used to decide whether a particular register's content is intended as floating point or SIMD data.

MMX provides only integer operations. When originally developed, for the Intel i860, the use of integer math made sense (both 2D and 3D calculations required it), but as graphics cards that did much of this became common, integer SIMD in the CPU became somewhat redundant for graphical applications. On the other hand, the saturation arithmetic operations in MMX could significantly speed up some digital signal processing applications.


In computing, Streaming SIMD Extensions (SSE) is a SIMD instruction set extension to the x86 architecture, designed by Intel and introduced in 1999 in their Pentium III series processors as a reply to AMD's 3DNow! (which had debuted a year earlier).

SSE contains 70 new instructions.

SSE was originally known as KNI for Katmai New Instructions (Katmai being the code name for the first Pentium III core revision). During the Katmai project Intel was looking to distinguish it from their earlier product line, particularly their flagship Pentium II.

It was later renamed ISSE, for Internet Streaming SIMD Extensions, then SSE. AMD eventually added support for SSE instructions, starting with its Athlon XP and Duron (Morgan core) processors.

Intel's first IA-32 SIMD effort, MMX, had two main problems: it re-used existing floating point registers making the CPU unable to work on both floating point and SIMD data at the same time, and it only worked on integers.

SSE originally added eight new 128-bit registers known as XMM0 through XMM7.

The AMD64 extensions from AMD (originally called x86-64 and later duplicated by Intel) add a further eight registers XMM8 through XMM15. There is also a new 32-bit control/status register, MXCSR. The registers XMM8 through XMM15 are accessible only in 64-bit operating mode.

Each register packs together:
four 32-bit single-precision floating point numbers or
two 64-bit double-precision floating point numbers or
two 64-bit integers or
four 32-bit integers or
eight 16-bit short integers or
sixteen 8-bit bytes or characters.

The Integer operations have instructions for signed and unsigned variants. Integer SIMD operations may still be performed with the eight 64-bit MMX registers.

Because these 128-bit registers are additional program states that the operating system must preserve across task switches, they are disabled by default until the operating system explicitly enables them.

This means that the OS must know how to use the FXSAVE and FXRSTOR instructions, which is the extended pair of instructions which can save all x86 and SSE register states all at once. This support was quickly added to all major IA-32 operating systems.

Because SSE adds floating point support, it sees much more use than MMX. The addition of SSE2's integer support makes SSE even more flexible. While MMX is redundant, operations can be operated in parallel with SSE operations offering further performance increases in some situations.

The first CPU to support SSE, the Pentium III, shared execution resources between SSE and the FPU. While a compiled application can interleave FPU and SSE instructions side-by-side, the Pentium III will not issue an FPU and an SSE instruction in the same clock-cycle.

This limitation reduces the effectiveness of pipelining, but the separate XMM registers do allow SIMD and scalar floating point operations to be mixed without the performance hit from explicit MMX/floating point mode switching.

And if you want to research all the CPU instruction extensions

ID: 1966 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile TJM
Project administrator
Project developer
Project scientist

Send message
Joined: 25 Aug 07
Posts: 843
Credit: 189,261,557
RAC: 124,589
Message 1980 - Posted: 29 Mar 2011, 7:55:27 UTC - in response to Message 1966.  
Last modified: 5 Apr 2011, 11:29:19 UTC

The optimized apps we use here were made by trials and errors. Many apps were build with different flags, then tested and the fastest were selected.
Stefan Krah's enigma app we use is written in c, I think that the original code is already very efficient as I couldn't tweak much to get a little better performance.
You might be right about SSE, as the Pentium III app (it was SSE+MMX) was the fastest here for a long time, especially on Linux.
Later when new compilers came out, we have found that on Intel processors (C2Ds and Core series) SSE2 gives a slightly better performance and if I remember correctly, the current app in the C2D app pack was build with Intel compiler, O2 optimizations + SSE2.
If I remember correctly, higher levels of SSE didn't change anything and in fact on most of the C2D processors SSE3/4 apps were a bit slower.
I also found out that on the slower C2D processors (probably the ones with small cache) app build for Pentium M runs faster.
M4 Project homepage
M4 Project wiki
ID: 1980 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [AF>Libristes] Dudumomo

Send message
Joined: 15 Feb 09
Posts: 20
Credit: 196,260
RAC: 0
Message 1985 - Posted: 2 Apr 2011, 11:27:30 UTC - in response to Message 1980.  

Oh yeah !
I remember having tested every optimisation on a Centrino T9400 and a AMDx2 6000.
Indeed the PIII were the fastest.
And the fastest for Intel was not the fastest for AMD obvisouly.

Ahh nostalgiaaa :p
ID: 1985 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile TJM
Project administrator
Project developer
Project scientist

Send message
Joined: 25 Aug 07
Posts: 843
Credit: 189,261,557
RAC: 124,589
Message 1987 - Posted: 5 Apr 2011, 11:51:36 UTC

AMD also suffered from the compiler errors, one of the fastest apps for Intels was build with -fschedule-insns and had manually unrolled loops. Unfortunately there was an error in GCC and I was unable to build AMD version from the same source because compiling with -fschedule-insns for any of the AMD family processors always ended with error msg.
M4 Project homepage
M4 Project wiki
ID: 1987 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : Optimized APP supporting MMX + SSE (but not SSE2, 3)

Copyright © 2018 TJM