AMD Compilation Flags in gcc. |
| log in |
Message boards : Number crunching : AMD Compilation Flags in gcc.
| Author | Message |
|---|---|
|
Hi All, | |
| ID: 643 · Rating: 0 · rate:
| |
Hi All, To build this one I used gcc 3.2, cflags: -march=pentium3 -O3 ____________ M4 Project homepage M4 Project wiki | |
| ID: 644 · Rating: 0 · rate:
| |
No -ffast-math flag? Hmmmm. Too bad gcc 3.2 only has -m32 and -m64 flags. I might try that but I'm not sure what the optimizations in 4.3.1 (amdfam10) are vs. 3.2 (m64) in the compiled code, and other than installing 3.2 on my box, I'm not sure how else to evaluate the two. I wonder how the -march=pentium3 would do with the flags ffast-math, funroll-all-loops and fpeel-all-loops would do. I'd think the AMD optimizations would do as well if not better on the Intel platform. I'll try it this weekend if I have time. Mike Doerner PS Also, the executable compile is called out in the makefile. If i want to play with Intel's compiler or gcc 3.2, how do I get the Makefile to recognize an alternate compiler? Compile seems to go to gcc by default, or at least openSuSe's default. | |
| ID: 645 · Rating: 0 · rate:
| |
You could try to set up virtual machine and then install older linux distribution on it, the gcc I've used was Mandrake 9's default version.
I'd think the AMD optimizations would do as well if not better on the Intel platform. I'll try it this weekend if I have time. I think that executable compiled with Intel's C compiler will outperfom anything build with gcc on Intel's processors. Take a look at this host: http://www.enigmaathome.net/results.php?hostid=1369 it's 996MHz PIII with PC-133 SDRAM running Intel-optimized executable. ____________ M4 Project homepage M4 Project wiki | |
| ID: 646 · Rating: 0 · rate:
| |
|
I think I got an App faster than the PIII 32-bit app for the Phenom in 64-bit mode....:-)
Ran the start command twice... 2nd compile.....
The 1st compile seemed to equal the speed of the PIII app, but after re-compiling with -fprofile-use it seems to have optimized it further. I didn't see any *.da files like the gcc 4.2.0 documentation stated would be there, but I figured, what the heck, I'll give it a shot anyways. Also, if you'd like to d/load my executable, here it is..... http://members.cox.net/mdoerner1/enigma_Phenom64 I haven't zipped it or anything, it's just the raw executable. Let me know if my time-savings are correct. The ACML libraries are a bit out of my league, and not having the time to analyze the source code, I'm not sure if the BLAS functions would save any time in computation. Mike Doerner PS Here's the latest AMD compilation flags for gcc......http://developer.amd.com/assets/AMD_GCC_Quick_Reference_Guide080509.pdf PPS Strangely enough, -march code was 1 second slower than -mtune. Go figure. | |
| ID: 650 · Rating: 0 · rate:
| |
I think I got an App faster than the PIII 32-bit app for the Phenom in 64-bit mode....:-) I can't try your executable because the only Phenom I have (BOINC server) is running 32 bit OS (Debian 4). I built 32 bit version with same cflags, seems to run a bit faster than the previous one I used. Btw, if the benchmark results between two apps are similar, try running a real workunit and check it's runtime. The benchmark is specific, it runs enigma on a very short ciphertext (shorter than hceyz72). Sometimes a very small diference (like 1-3% in benchmark time) can turn into huge speedup on longer workunits. ____________ M4 Project homepage M4 Project wiki | |
| ID: 655 · Rating: 0 · rate:
| |
|
Glad to see those flags work for you then. I've also overclocked my processor since I've optimized my 64-bit app, so it's harder to compare apples to apples so to speak. However, I'll see if I can come up with some sort of comparison. You're right though, a 4 second savings on a 3.25 minute run time equates to around a minute savings on each awgly100_0 that is run. Considering the amount of workunits left, that will help get us to the end quicker. | |
| ID: 656 · Rating: 0 · rate:
| |
|
These times are at 2.6Ghz....
And the Intel....
My phenom is currently clocked at 2.93Ghz, and I'm getting awgly100_0_2153304_r0 2,487.11 secs CPU time w/ my new optimized app. 2.6 / 2.93 = 0.887 so 2765.39 should be 2453.93 secs if I had overclocked earlier with the old optimized app. Right in the ballpark, but hard to say if I've actually saved a minute of computing time or not, as I see completion times change between the different awgly100's. I'll have to just watch it I guess....:-) Mike Doerner | |
| ID: 657 · Rating: 0 · rate:
| |
|
Here's some of the low times since I compiled the app.... | |
| ID: 658 · Rating: 0 · rate:
| |
|
Try -ftree-vectorize to enable vectorization, since all x86-64 systems must support at least SSE2. | |
| ID: 662 · Rating: 0 · rate:
| |
Try -ftree-vectorize to enable vectorization, since all x86-64 systems must support at least SSE2. Howdie, What is vectorization and how will it help the Enigma code run faster?:-) Mike Doerner | |
| ID: 665 · Rating: 0 · rate:
| |
|
Vectorization is another term for SIMD (single instruction and multiple data). In other words, the use of a single operation, say multiply, applied to different data at the same time. | |
| ID: 668 · Rating: 0 · rate:
| |
|
Thanks for explaining that. I'll try recompiling it with that flag enabled.... | |
| ID: 672 · Rating: 0 · rate:
| |
|
DOH! Looking at the gcc documentation, -ftree-vectorize is turned on by the -O3 flag. Looks like it's already on. | |
| ID: 673 · Rating: 0 · rate:
| |
DOH! Looking at the gcc documentation, -ftree-vectorize is turned on by the -O3 flag. Looks like it's already on. Yes, but only in GCC 4.3 or later, not in earlier versions. And only when the architecture supports SSE, i.e., all x86-64 processors and only when -msse2 is specified for x86 processors, but then it will fail running on those processors which don't support SSE2. HTH ____________ | |
| ID: 675 · Rating: 0 · rate:
| |
|
OK, now I have a better understanding of how the gcc flags interact. I have recompiled again with your suggestion, and it looks like I'm 6 to 7 seconds faster then TJM's 32-bit P3 app. I am using gcc 4.3.1 w/ openSUSE 11.0. 1st compile. Profiling didn't seem to generate much of a speed improvement compared to just a regular compile without profiling enabled. Got any other tricks up your sleeve?;-) The -ftree-parralellize-loops=4 just seems to run faster, compared to using 2 and 8 as the other values I tried. Fortunately, the task did NOT thread to the other cores on my Phenom when I tried this. PS I've also tried export CFLAGS='-m64 -O3 -funroll-all-loops -ffast-math -march=amdfam10 -msse4a -mabm -fprefetch-loop-arrays -combine -fwhole-program'but it seems to run at the same speed as the previously listed flags, FWIW. Mike Doerner | |
| ID: 676 · Rating: 0 · rate:
| |
Naturally, as Phenom has 4 cores. But I wonder why you didn't see the other cores busy. Threads don't show up in top populating the cores, but you notice that the cores a busy. Or perhaps the compiler couldn't parallelize the code.
Not surprising, as neither SSE4A nor ABM add critical instructions to typical programs. Most of them are for bit manipulation. Also, when building for x86-64, the options -m64 and -msse2 are the default. So there's no need to specify them, though the don't hurt either. Now, an interesting option would be -msse3. SSE3 adds helpful instructions for matrix multiplication and operation on complex numbers. So if Enigma relies on algorithms using such operations, it might benefit too. Unfortunately, according to my tentative analysis here, systems supporting SSE3 are not as common as one might expect. However, if you want the application to run well on most systems, it's better to not use it as well as dropping -march=amdfam10 (AKA -march=barcelona). For the x86 application, you might want to try the option -mpc64 that reduces the precision of floating-point calculations to 64 from 80 bits, making it faster and with results similar to those produced by x86-64. HTH ____________ | |
| ID: 677 · Rating: 0 · rate:
| |
Supposedly AMD's are faster in floating point math than the Intels, but looking at the results so far, the Intel is about 25% faster at integer math than my AMD processor. The times to complete suggest little floating point math is used to generate the results, making my AMD look like an abacus compared to the Intel.:-) Since enigma was never designed as a multi-threaded application (which BOINC would maybe choke on if the 4 tasks became multi-threaded) that may explain why turning on the -ftree-parallelize-loops didn't do anything. The only other trick I can think of to speed this up is to try incorporating the ACML libraries from AMD into the enigma.c code (and maybe other source files) and calling up some of the BLAS routines to speed up integer operations. Being a Mechanical Engineer, computer programming is not my strength. ;-) I don't know if calling up the ACML headers is enough, or if the ACML has specific syntax to include its functionality into the code. Mike Doerner | |
| ID: 678 · Rating: 0 · rate:
| |
Supposedly AMD's are faster in floating point math than the Intels, but looking at the results so far, the Intel is about 25% faster at integer math than my AMD processor. AMD does pretty well in FP, but, with Intel's Core2, the gap is smaller if not zeroed, while zooming past AMD in integer. Core2 is truly a very good architecture. The only other trick I can think of to speed this up is to try incorporating the ACML libraries from AMD into the enigma.c code (and maybe other source files) and calling up some of the BLAS routines to speed up integer operations. Perhaps, but I've never used ACML before. It does have many routines rewritten in assembler to extract every bit of performance from AMD processors, but not all were rewritten in assembler. So it may be a matter of making sure that the routines important to Enigma are indeed optimized. HTH ____________ | |
| ID: 679 · Rating: 0 · rate:
| |
Message boards :
Number crunching :
AMD Compilation Flags in gcc.
