mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Mlucas (https://www.mersenneforum.org/forumdisplay.php?f=118)
-   -   Mlucas probably very fast on AMD64 platforms (https://www.mersenneforum.org/showthread.php?t=564)

Dresdenboy 2003-05-07 11:14

Mlucas probably very fast on AMD64 platforms
 
Hello,

following image shows, that 189.lucas, part of SPEC 2000 suite and based on Mlucas code is running a bit faster on Athlon 64 (here with 1MB L2 like Opteron) than on a P4 with 80% clock frequency advantage (score of 880 vs. 864)

[img]http://www.heise.de/ct/03/01/018/bild.gif[/img]

One could try to compile Mlucas for AMD64 platforms using compilers like the Portland Group F90 compiler which I mentioned in The Hardware forum.

The only thing that prevents me from testing this is the availability of some Opteron system (although there are entry systems starting at $1200+).

Next year such x86-64 platforms (maybe also from Intel) will get some relevance for distributed computing.

I'm also looking forward to EWMs developments in parallel FP/Integer calculations for Alpha CPUs since that would also make sense on Opteron. :)

Regards,
Matthias

ewmayer 2003-05-09 17:22

Thanks for the info, Dresdenboy. But some caveats: the FFT code I used for my submission to the SpecFP suite is light-years removed from the one in the currrent version of Mlucas, so it's not at all clear whether the performance difference seen in your chart will carry over to the current code.

Also, you don't need an f90 compiler anymore, since I the latest development version of Mlucas is in C, and is at ftp://hogranch.com/pub/mayer/src/C . I haven't personally built it on an Athlon (just on my P3 using CodeWarrior, which gives a binary that is roughly 50% as fast as Prime95 running on the same machine - no surprise there, since I've done little x86-oriented tuning), but Tom Cage ( k5gj@earthlink.net ) regularly does so.

Re. the mixed float/int code, that has been on hold due to the demands of my work-for-pay job, and the fact that the little Mersenne-related code development time that has left me has mostly gone into coding a fast C-based sieve factorer. I've gotten the factoring code to really blast on the Alpha and Itanium (both of which have excellent 64x64==>128-bit multiply capability), and with the help of Tom C. and especially Klaus Kastens, gotten pretty good performance on the Mac/PPC, as well. This effort will also help in the mixed float/int code, though, because figuring out how to do speedy wide integer multiply on the various platforms is crucial to that.

I'm hopeful that this kind of code could run well on the AMD Opteron, too, since I hear those also have good 64x64==>128-bit multiply capability. They need 4 cycles to get a 128-bit integer product, which is 2x as many as the Alpha and Itanium, but especially in non-factoring code this slight extra cycle count can be hidden by interleaving the integer muls with other integer operations that are going on.

nomadicus 2003-05-10 04:36

[quote="ewmayer"]Re. the mixed float/int code, . . . I've gotten the factoring code to really blast on the Alpha and Itanium (both of which have excellent 64x64==>128-bit multiply capability), [/quote]
I love that Alpha. Thanks again for all your help ewmayer. :D

Dresdenboy 2003-05-11 11:09

Oh, I overlooked that latest Mlucas sources I saw were C+ASM code.

Well then we could try Portland Groups C/C++ compilers which are in the same workstation compiler suite like their F90 compiler. And since it is free, we could give it a try.

But first I have to find out if these compiler binaries are runnable on standard platforms because they are compiled for Opteron. If they are thought to run in 32bit mode (Windows and 32bit Linux) then they should run at least on P4s because of possible need of SSE2. I'll try them today.

Besides the 4 cycle latency on AMD64 CPUs it's at least possible to pipeline it by starting a 64bit mul every 2 cycles.

Regards,
DDB

Dresdenboy 2003-06-12 09:30

[quote="ewmayer"]I'm hopeful that this kind of code could run well on the AMD Opteron, too, since I hear those also have good 64x64==>128-bit multiply capability. They need 4 cycles to get a 128-bit integer product, which is 2x as many as the Alpha and Itanium, but especially in non-factoring code this slight extra cycle count can be hidden by interleaving the integer muls with other integer operations that are going on.[/quote]

One addition:
The Itanium 2 with Madison core will reach speeds of 1.4 to maybe 1.7 GHz till year end. Alpha CPUs also lie in this range but the Opteron (and especially the smaller core Athlon 64) will be at 2.4 till 3 GHz (if a certain AMD rep statement is true) then. Now we get 1.8GHz Opterons (unfortunately the price is a bit higher now than in april because of demand) and soon 2GHz (Cray is already getting 2GHz chips). Together with PPC970 this will create a wide base of mainstream 64bit PCs. Intel will surely follow in the next years.

Also the amount of 64bit CPUs in PDAs will grow because Microsoft will once again support MIPS in coming PocketPC OS releases. If there are enough of them and the cost of letting them run all the time (shouldn't harm too much if everything else is off) is ok then we could try finding a new user base there. StrongARM users can already use Nick's StrongARM client. So the others could join in.

Regards,
Matthias


All times are UTC. The time now is 06:16.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.