New Optimized apps for 64-bit Linux

Message boards : Number crunching : New Optimized apps for 64-bit Linux

To post messages, you must log in.

1 · 2 · 3 · Next

AuthorMessage
Profile mdoerner
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 30 Jul 08
Posts: 202
Credit: 6,998,388
RAC: 0
Message 874 - Posted: 4 May 2009, 3:00:44 UTC
Last modified: 4 May 2009, 3:24:13 UTC

Hi All,

With AMD's release of the Open64 compiler for Linux, I thought I would try my hand at recompiling the enigma application. I have always been interested in speeding up these applications because I have felt that a 64-bit application should perform the same function faster than a 32-bit application, especially one so mathematically driven like enigma. While I have compiled my own application for my AMD Phenom in the past, the performance increase in 64-bit Linux was only a few seconds of improvement vs. TJM's 32-bit applications he posted on the site last year. This has changed dramatically with the release of AMD's recent release of Open64 v4.2.2. I have seen a significant decrease in computation time for these new 64-bit applications compared to the 32-bit optimized ones released last year. Here is a list of times I have noticed with the benchmark application evaluation program.

My system: AMD Phenom 9950: Overclocked 3.05 Ghz, running Linux 64-bit (OpenSuSe 11.1 KDE 4.2.2)

Default Application 32-bit – 3 minutes 48.202 seconds
TJM's Pentium 3, 32-bit application compiled on gcc 3.2 – 3m 15.155s (fastest of TJM's optimized apps)
My 64-bit Phenom gcc 4.3 app – 3m 8s (appx)

AMD has also helped the gcc folks with more optimizations in v4.4
My 64-bit Phenom gcc 4.4 app – 2m 54s (getting better...more on this in the conclusions)

I have compiled both 32-bit and 64-bit applications under Open64. The 32-bit applications either showed no improvement, or in one case, marginal improvement.

Open64 32-bit Athlon – 3m 14.443s (basically even with TJM's P3 app)
Open64 32-bit Opteron – 3m 27.920s
Open64 32-bit Athlon64 – 3m 31.448s
Open64 32-bit Athlon64fx – 3m 34.46s
Open64 32-bit Phenom (Barcelona) – 3m 31.595s

Now for the good part. The 64-bit Linux optimized applications REALLY PUT THE HAMMER DOWN! I do not know what voodoo the engineers at AMD exercised when they optimized this compiler but there is about a 20% performance increase compared to TJM's P3 optimized app. Here's the numbers....

Open64 64-bit Opteron – 2m 29.500s
Open64 64-bit Athlon64 – 2m 28.851s
Open64 64-bit Athlon 64fx – 2m 28.427s
Open64 64-bit Phenom (Barcelona) – 2m 28.410s

So the AMD optimizations give roughly the same performance across the board of about 24% compared to TJM's P3 app. 35% compared to the default app. For more info, I also ran some Intel optimizations...

Open64 64-bit Core – 2m 29.632s
Open64 64-bit Wolfdale – Could not compile (some B.S. about Phenoms not having SSE2 instructions)
Open64 64-bit Xeon – 2m 31.182s
Open64 64-bit EM64T – 2m 28.998s
Open64 64-bit Pentium4 – 2m 29.622s

Since there is such a substantial performance improvement, I have taken the Open64 apps I compiled and placed them in a file for downloading here......

Mike D's Linux Apps

...if you would like to use one of them (32-bit or 64-bit, AMD or Intel optimized), be my guest. I would like someone to compile a Wolfdale 64-bit app in 64-bit mode so I can see if an Open64 64-bit Wolfdale run on an Intel Core 2 Duo or Quad would post even quicker times than the one's I've shown here.

Conclusions:

1.)The 32-bit Open64 apps are just as fast, but not faster than, TJM's Intel P3 optimized app from last year.
2.)The 64-bit Open64 application I compiled are significantly faster any 32-bit application, either gcc or Open64 compiled. (And available for anyone to use, just click on the link above to download. I have included both 32-bit and 64-bit apps just in-case someone else would like to test them.)
3.)Windows users may not be left out here. The gcc 4.4 version supposedly includes more optimizations for AMD processors (as shown with my gcc 4.4 result above). I would think if someone was running WinXP 64-bit, or Vista 64-bit, or Win7 64-bit could download gcc 4.4 64-bit for their version of windows and compile these apps to find out if they could get any improvement compared to TJM's P3 app. Also, Open64 I would imagine may be ported to Windows some day (I can't see AMD letting this work only on Linux.)

Thanks to TJM for letting the code out so us unemployed mechanical engineers can have something meaningful and rewarding to do besides changing dirty diapers during the day....;-)

If you have any questions, best way to get a hold of me is by email at mdoerner1 (at) cox (dot) net

Mike Doerner

PS The default flags in conf-cc are changed from gcc -Wall -W -O3 to opencc -Wall -W -Ofast -m64

The only change to the conf-ld file I made is I changed the compiler from gcc to opencc and added the -ipa -m64 flags, as they were needed by the compiler to successfully compile the code.

The flags I used for gcc-4.4 are gcc-4.4 -Wall -W -O3 -m64 -march=amdfam10 -combine -funroll-all-loops.
ID: 874 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile mdoerner
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 30 Jul 08
Posts: 202
Credit: 6,998,388
RAC: 0
Message 879 - Posted: 4 May 2009, 21:48:42 UTC - in response to Message 874.  

I forgot to mention, for the Open64 32-bit and 64-bit apps I did add -march=barcelona, or athon64, or athlon64fx, etc. to the flags to get the architecture flag working correctly.

Mike Doerner
ID: 879 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile mdoerner
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 30 Jul 08
Posts: 202
Credit: 6,998,388
RAC: 0
Message 902 - Posted: 11 May 2009, 11:02:26 UTC
Last modified: 11 May 2009, 11:04:08 UTC

Howdy All,

Just to eliminate confusion, you must use the app_test_522.tgz file from TJM's Optimized App Thread along with the executables shown above. You must still follow TJM's Optimized app procedure as shown in this thread....

v5.22 Experimental optimized app

1st, grab this file....app_test_522.zip

...the difference is instead of using the files from the 2nd link (test.tgz) you would use one of the executable files from my link above.

If you have any questions, please email or send me a private message and we'll get you straightened out as best as we can. ;-)

Mike Doerner
ID: 902 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile mdoerner
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 30 Jul 08
Posts: 202
Credit: 6,998,388
RAC: 0
Message 919 - Posted: 12 May 2009, 21:36:27 UTC - in response to Message 902.  

I've added a 32-bit Intel Core architecture optimized app compiled with Open64 to the file, for those who refuse to realize 64-bit is the only way to go....;-)

Mike D
ID: 919 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile mdoerner
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 30 Jul 08
Posts: 202
Credit: 6,998,388
RAC: 0
Message 936 - Posted: 22 May 2009, 3:08:56 UTC

DOH!

Do not use the app_test_522.zip link in the above post, it for the WINDOWS platform. The link you want is this one.....for Linux.....

app_test_522.tgz

Mike D
ID: 936 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile mdoerner
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 30 Jul 08
Posts: 202
Credit: 6,998,388
RAC: 0
Message 938 - Posted: 22 May 2009, 17:23:54 UTC
Last modified: 22 May 2009, 17:25:02 UTC

Here is the performance boost I'm seeing (so far) on my Phenom. The only minor tweak I've made on my system is I've gone from 3.05GHz to 3.10GHz, but this only shaves a few seconds of the small tasks, maybe a minute or 2 on the large tasks.I've gone from 1775 RAC to 2330 RAC, maybe a bit more. We'll see here shortly.


ID: 938 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile TJM
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 25 Aug 07
Posts: 843
Credit: 68,800,688
RAC: 376,128
Message 949 - Posted: 25 May 2009, 14:01:44 UTC - in response to Message 938.  
Last modified: 25 May 2009, 14:05:35 UTC

I've tried to build Wolfdale app, but the compiler says that the target processor does not support SSE2. That's quite weird, because I've also tried on a machine with Wolfdale processor and got the same result. Buninek from BOINC@Poland also tried with no luck, but here is his other app (xeon x64):

http://dl.getdropbox.com/u/349831/enigma.xeon.x86_64.tar.bz2

So far the fastest app runs at speed similar to one of my older 64 bit executable built with Intel compiler, which was slightly slower than old PIII app. Tested on Q6600, E7200 and E5200.
Now I'm trying to build something faster for Athlons 64/64 x2, it seems unfair that these processors run enigma slower than Pentium III with half of their clocks (996MHz PIII runs faster than Athlon 64 2,4GHz).
M4 Project homepage
M4 Project wiki
ID: 949 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile mdoerner
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 30 Jul 08
Posts: 202
Credit: 6,998,388
RAC: 0
Message 951 - Posted: 25 May 2009, 22:09:23 UTC - in response to Message 949.  
Last modified: 25 May 2009, 22:54:18 UTC

I can let the guys at Open64 know that -march=wolfdale chokes on Intel and AMD processors, giving the same bogus "No SSE2" error.

So the 32-bit gcc compiled P3 app runs faster than either Open64 or Intel Compiled 64-bit code? That's weird, because when I ran and tested the P3 gcc 32-bit app in the benchmark it was slower than the Open64 64-bit code. Maybe we're finally running into an processor architecture situation here?

Just so I understand, are you saying a PIII at 1.0Ghz beats a 2.4GHz Athlon64 in raw time?!?!?! Or are you saying clock-for-clock the PIII is more efficient? (i.e. let's say the P3 completes a task 1000 seconds, are you saying the Athlon64 completes it in 2400 seconds, 1200 secs, or 990 secs?)

I'm glad to hear the Intel Compiler code is as efficient as the Open64 code. If the Gentoo guys make a distro with Open64 instead of GCC, I just may have to switch distros from OpenSuSE to Gentoo..... :)

Mike Doerner

PS I don't want to open another can of worms here, but I've heard the C code isn't as efficient mathematically as Fortran (which is why it won't die). Not that a code re-write is possible this late in the game, but that architecture optimization will only get us so far.....
ID: 951 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile mdoerner
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 30 Jul 08
Posts: 202
Credit: 6,998,388
RAC: 0
Message 952 - Posted: 25 May 2009, 22:30:09 UTC - in response to Message 951.  
Last modified: 25 May 2009, 22:43:18 UTC

When you change the flag from -Ofast to -O0 thru -O3 you get this error....

mdoerner@Linux-QuadZilla:~/Xfers/enigma-suite-0.76> make clean
rm -f *.o tools/*.o
mdoerner@Linux-QuadZilla:~/Xfers/enigma-suite-0.76> make
( cat warn-auto.sh; \
echo exec "`head -1 conf-cc`" '-c ${1+"$@"}' \
) > compile
chmod 755 compile
./compile enigma.c
Error loading wolfdale.so: wolfdale.so: cannot open shared object file: No such file or directory
make: *** [enigma.o] Error 2
mdoerner@Linux-QuadZilla:~/Xfers/enigma-suite-0.76>


Looks like they forgot to package something in the distro. I've heard the bug update will be pushed out the door by AMD early June, maybe they can fix this by then....

Mike D
ID: 952 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile mdoerner
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 30 Jul 08
Posts: 202
Credit: 6,998,388
RAC: 0
Message 953 - Posted: 25 May 2009, 22:41:33 UTC - in response to Message 949.  

Buninek from BOINC@Poland also tried with no luck, but here is his other app (xeon x64):

http://dl.getdropbox.com/u/349831/enigma.xeon.x86_64.tar.bz2


I tried running that app on my box and it choked.....

mdoerner@Linux-QuadZilla:~/Xfers/enigma_benchmark> ./start
./start: line 4: ./enigma: No such file or directory

real 0m0.000s
user 0m0.000s
sys 0m0.000s
mdoerner@Linux-QuadZilla:~/Xfers/enigma_benchmark>


Ran may other apps just fine. Does it only work on Xeon?

Mike D
ID: 953 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile TJM
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 25 Aug 07
Posts: 843
Credit: 68,800,688
RAC: 376,128
Message 954 - Posted: 25 May 2009, 23:34:24 UTC - in response to Message 951.  
Last modified: 25 May 2009, 23:41:51 UTC



Just so I understand, are you saying a PIII at 1.0Ghz beats a 2.4GHz Athlon64 in raw time?!?!?!


Yep, thats exactly what I'm saying. 1GHz PIII Coppermine beats 2.4GHz Athlon64 by around 3-4 minutes on shortest workunits.
But...
I think that's the only situation where app build with Intel C Compiler is much faster than anything else. I forgot exact numbers and I don't have PIII box here anymore, but the speedup gained by replacing PIII-gcc app by PIII-Intel was just insane - more than twice the speed of the PIII-gcc app and around 3 times faster than speed of base app.

If I remember correctly, 1GHz PIII needed around 40 minutes to complete hceyz72/0, and the fastest result I've seen from Athlon 64/2.4GHz is around 44 minutes.




I tried running that app on my box and it choked.....

mdoerner@Linux-QuadZilla:~/Xfers/enigma_benchmark> ./start
./start: line 4: ./enigma: No such file or directory

real 0m0.000s
user 0m0.000s
sys 0m0.000s
mdoerner@Linux-QuadZilla:~/Xfers/enigma_benchmark>


Ran may other apps just fine. Does it only work on Xeon?



I guess it's not a statically linked app and some of your libraries are wrong version.
M4 Project homepage
M4 Project wiki
ID: 954 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
thinking_goose

Send message
Joined: 12 Nov 07
Posts: 116
Credit: 1,105,645
RAC: 0
Message 955 - Posted: 26 May 2009, 2:48:32 UTC

I always thought the p3's were good. Is there an app for the Turion, or is it compatible with the Athlon64 app?
ID: 955 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile mdoerner
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 30 Jul 08
Posts: 202
Credit: 6,998,388
RAC: 0
Message 956 - Posted: 26 May 2009, 3:03:41 UTC - in response to Message 955.  

I always thought the p3's were good. Is there an app for the Turion, or is it compatible with the Athlon64 app?



I would just use the Athlon64 code for now. There is no Turion flag in Open64.

Mike Doerner
ID: 956 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile TJM
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 25 Aug 07
Posts: 843
Credit: 68,800,688
RAC: 376,128
Message 965 - Posted: 27 May 2009, 11:30:44 UTC - in response to Message 956.  

I've just tested 32bit Athlon64 app from your pack on my Athlon64/2.4GHz.
Old app (the best app I could build with gcc) benchmark: 3m 27s (*)
Your app: 2m 26s

Now that's a serious speedup :-D

(*) - these runtimes cannot be compared with runtimes from other processors/apps, because I used modified, shorter benchmark file.

M4 Project homepage
M4 Project wiki
ID: 965 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile mdoerner
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 30 Jul 08
Posts: 202
Credit: 6,998,388
RAC: 0
Message 966 - Posted: 27 May 2009, 12:27:33 UTC - in response to Message 965.  
Last modified: 27 May 2009, 13:11:37 UTC

I've just tested 32bit Athlon64 app from your pack on my Athlon64/2.4GHz.
Old app (the best app I could build with gcc) benchmark: 3m 27s (*)
Your app: 2m 26s

Now that's a serious speedup :-D

(*) - these runtimes cannot be compared with runtimes from other processors/apps, because I used modified, shorter benchmark file.



Glad to see someone got some benefit out of it. :-) At least the Athlon64's won't get beat by the P3's anymore....;-)

It's hard to tell how well an application optimized for an earlier processor will run when you're running a later generation processor....

Mike Doerner

PS I just looked at your computers on the project. Your AMD Athlon box seems to be having a lot of errors right now. I hope that isn't the fault of the app I made......
ID: 966 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile mdoerner
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 30 Jul 08
Posts: 202
Credit: 6,998,388
RAC: 0
Message 1029 - Posted: 6 Jun 2009, 14:48:59 UTC

AMD has released Open64 4.2.2.1, so I was able to compile a 64-bit Wolfdale app for the Intel guys. The 32-bit Wolfdale app will have to wait a bit, as Open64 4.2.2.1 has introduced a bug with the -m32 flag...:-(. As soon as they get it fixed I'll add it to the archive.

Mike D
ID: 1029 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile mdoerner
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 30 Jul 08
Posts: 202
Credit: 6,998,388
RAC: 0
Message 1033 - Posted: 7 Jun 2009, 14:37:09 UTC - in response to Message 1029.  
Last modified: 7 Jun 2009, 15:02:27 UTC

AMD has released Open64 4.2.2.1, so I was able to compile a 64-bit Wolfdale app for the Intel guys. The 32-bit Wolfdale app will have to wait a bit, as Open64 4.2.2.1 has introduced a bug with the -m32 flag...:-(. As soon as they get it fixed I'll add it to the archive.

Mike D



OK, the 32-bit wolfdale app has been added. For some reason, when compiling with -m32 flag, you gotta use gcc 4.3.3. AMD says the preferred version is gcc 4.1.2 or 4.2, but that only works with the -m64 flag. Oh well.

Also, I have added 2 apps (32-bit and 64-bit) to the file, with -march=anyx86 enabled. I'm gonna try it on an old, old, old 233Mhz Pentium MMX portable I've had sitting around doing nothing for awhile. I put DSL Linux on it (a debian derivative) with a 2.4.31 version of the kernel. Unfortunately, it picked up a hceyz72_1_ task and it's crunching on the default app right now, so it might not be until tomorrow before it's done crunching on it....:-( How computers have come along in a decade.....

Odd thing is anyx86 and barcelona flags seem to compute in the same time on my Phenom box, so I'll give it a whirl and see how the times come up an regular tasks.

Mike D
ID: 1033 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile mdoerner
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 30 Jul 08
Posts: 202
Credit: 6,998,388
RAC: 0
Message 1034 - Posted: 8 Jun 2009, 10:54:53 UTC - in response to Message 1033.  

Tried running the anyx86 code on the Pentium mobile, and it choked. Not sure if it's because the Pentium is on the 2.4.31 kernel or if I need to add more flags to make the libraries static or what.....

Mike Doerner
ID: 1034 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [AF>Libristes] Dudumomo

Send message
Joined: 15 Feb 09
Posts: 20
Credit: 196,260
RAC: 0
Message 1035 - Posted: 8 Jun 2009, 12:18:10 UTC - in response to Message 1034.  
Last modified: 8 Jun 2009, 12:20:52 UTC

I've tried the wolfdale and anyX86_64, both give me computation error on my Gentoo 64b, (2.6.28-gentoo-r5)

Edit :
the error is : wrapper: running ../../projects/www.enigmaathome.net/enigma_0.76_i686-pc-linux-gnu (-R -o results.txt 00trigr.cur 00bigr.cur 00ciphertext)
../../projects/www.enigmaathome.net/enigma_0.76_i686-pc-linux-gnu: error while loading shared libraries: libmv.so.1: cannot open shared object file: No such file or directory
app
ID: 1035 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile mdoerner
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 30 Jul 08
Posts: 202
Credit: 6,998,388
RAC: 0
Message 1036 - Posted: 8 Jun 2009, 12:24:58 UTC - in response to Message 1035.  
Last modified: 8 Jun 2009, 12:49:32 UTC

Hmmm.... I might have to statically link the libraries for you (though you should consider installing the required libraries in the future). Maybe it's time to come up with a separate archive since those files will be so much bigger....

PS Here's the new link. Right now there's only a 32-bit AnyX86 statically linked file. Let me know if you have any further problems.

Mike D's Linux App Statically Linked - AnyX86


Mike Doerner
ID: 1036 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
1 · 2 · 3 · Next

Message boards : Number crunching : New Optimized apps for 64-bit Linux




Copyright © 2017 TJM