 2003-04-19, 11:59 #1 TauCeti     Mar 2003 Braunschweig, Germany 2·113 Posts Opteron 244 (1.8 GHz) benched with Prime95 22.12 Some Opteron news... I just got my paper-copy of the german publication c't-magazine. Andreas Stiller did some benchmarks with an Opteron 244 system. Prime95 v22.12 W32 on the 1.8 GHz Opteron showed about half the performance compared to a P IV 3.06 GHz. There was also a hint that George Woltman is already working on some Opteron optimizations. I am still undecided if i should wait for the Atestosteron 64 ;) The Optimisteron is far to expensive für me *g* Tau
 2003-04-19, 15:52 #2 Prime95 P90 years forever!     Aug 2002 Yeehaw, FL 8,167 Posts The only optimization I plan is to prefetch 64 byte cache lines instead of 128 byte cache lines. This will be a modest improvement but it looks like the P4 will be CPU of choice for some time to come.
 2003-04-19, 16:57 #3 gbvalor     Aug 2002 3×37 Posts Prime95 on Opterons Hi, George, I'm now writing some SSE2 code for Glucas thinking on Opterons. I see I still have a chance to reduce the gap between Glucas and Prime95 if you don't touch the code too much ;). Regards. Guillermo.
 2003-04-20, 02:47 #4 Prime95 P90 years forever!     Aug 2002 Yeehaw, FL 11111111001112 Posts An excellent review at http://www.xbitlabs.com/articles/cpu/display/athlon64.html does not bode well for the Opteron. From the conclusion: Athlon 64 is not very successful in traditional calculating tasks, such as scientific calculations
 2003-04-20, 08:46 #5 gbvalor     Aug 2002 3×37 Posts Hi, > http://www.xbitlabs.com/articles/cpu/display/athlon64.html does not bode well for the Opteron. I'm afraid that most of these tests are made using 32 bits sofware for Pentium4. Yes, the low core frequency is a problem, but Opteron also has 8 aditional 128 registers which could reduce register pressure (no talking about its 16 64-bit integer registers). I think when software begin to use all this advantages the gap will be reduced drastically. IMHO, AMD needs urgently a good compiler. Guillermo
 2003-04-23, 08:37 #6 BranMuffin     Dec 2002 2×5 Posts Testing the Code on the Hardware I might be able to come up with some actual hammer hardware to test code optimizations.
2003-04-23, 17:12   #7
gbvalor

Aug 2002

11110 Posts

Hi,

Quote:
 I might be able to come up with some actual hammer hardware to test code optimizations.
Then, if you want, I would ask you to test the new beta code for Glucas I'm just writing for Opterons. What OS/compiler are you talking about?

Guillermo.

2003-04-25, 12:46   #8
Dresdenboy

Apr 2003
Berlin, Germany

192 Posts

Quote:
 Originally Posted by Prime95 An excellent review at http://www.xbitlabs.com/articles/cpu.../athlon64.html does not bode well for the Opteron. From the conclusion: Athlon 64 is not very successful in traditional calculating tasks, such as scientific calculations
I don't know what changes happened between xbit's 3 months old (week 01/2003) engineering sample and the current Opteron release, but I'm sure that there are at least small differences. A bigger difference is the single memory channel although that doesn't matter in this review as they wisely used the same memory configuration for all computers.

The cause of Athlon 64s lower performance in this review is the code. An Athlon can "peephole" optimize a bit the fed in P3/P4 code in its schedulers. But that can't do wonders if the available resources aren't used efficiently. We know that P3 code on P4 and vice versa can cause significant differences in performance and this was a reason for P4s poor performance in it's first months of existence on the market.

I'd also recommend these reviews because of their detailed information regarding FPU/FFT speed/SSE2..:
Aces Hardware has an good review (look at page 14 for compiled C-code FFT and other stuff):
and tecchannel (german) also looks closely at SSE2, SMP and so:
http://www.tecchannel.de/hardware/1164/index.html

Regards,
DDB

BTW 1.4 GHz Opterons are around 280$and will go down during the next months. But oc'ed 1700+ ($50-\$60) running at 2.2GHz or more (at 1.5-1.6 V) are also welcome for computation ;)

 2003-04-25, 18:09 #9 Prime95 P90 years forever!     Aug 2002 Yeehaw, FL 177478 Posts From the same Ace's Hardware article, the graph on this page http://www.aceshardware.com/read.jsp?id=55000253 shows a hugh difference in L2 cache bandwidth. Prime95 reads and writes a lot of data to the L2 cache so this could have a significant impact. Without having an Opteron on hand, I think the biggest problem the Opteron has competing with a P4 running prime95 is raw clock speed. Both have a theoretical throughput of one FPU mul and one FPU add per clock cycle. Since both are doing a pretty good job of approaching this theoretical limit (a P4 reaches about 55% of the theoretical maximum FPU throughput), the Opteron cannot overcome the 3.06 vs 1.8 GHz speed disadvantage.
2003-04-26, 10:10   #10
gbvalor

Aug 2002

3×37 Posts

Quote:
 Originally Posted by Prime95 From the same Ace's Hardware article, the graph on this page http://www.aceshardware.com/read.jsp?id=55000253 shows a hugh difference in L2 cache bandwidth. Prime95 reads and writes a lot of data to the L2 cache so this could have a significant impact.
This AMD64 L2 cache disvantage can be partially reduced using its additional 8 registers. I mean we can retain in registers some more critical data and so we don't need to store and read again to cache.

George, how many load/store and clock cycles could you have saved with 8 more registers?
Quote:
 Without having an Opteron on hand, I think the biggest problem the Opteron has competing with a P4 running prime95 is raw clock speed. Both have a theoretical throughput of one FPU mul and one FPU add per clock cycle. Since both are doing a pretty good job of approaching this theoretical limit (a P4 reaches about 55% of the theoretical maximum FPU throughput), the Opteron cannot overcome the 3.06 vs 1.8 GHz speed disadvantage.
I'm agree with that.

BTW, is there any advantage with the integer 64 bit multiply in factoring task. Opteron takes only 4 clocks in a 64x64=128 bits mul, and only a clock throughtput for imuls.

2003-04-26, 17:08   #11
Prime95
P90 years forever!

Aug 2002
Yeehaw, FL

8,167 Posts

Quote:
 Originally Posted by gbvalor This AMD64 L2 cache disvantage can be partially reduced using its additional 8 registers. I mean we can retain in registers some more critical data and so we don't need to store and read again to cache. George, how many load/store and clock cycles could you have saved with 8 more registers?
If starting coding from scratch, 8 registers would be of some help. I could do three levels of the FFT while in registers instead of just two. This would reduce the L2 cache reads and writes by up to 50%. I don't know what that would translate into in terms of a per-iteration speed improvement.

When I wrote the P4 SSE2 code, my tests on small snipets of assembly code showed that I could completely ignore the L1 cache. That is, the P4's L2 bandwidth and ability to schedule reads far enough in advance lets prime95 run out of the L2 cache nearly as fast as running out of the L1 cache (which is a good thing since the L1 cache is so small).

I don't remember what the Opteron's L1 cache size is, but if it is a decent size you could further reduce the L2 cache reads and writes by juggling the assembly code to process more data while it is in the L1 cache. The downside is code complexity goes up a little bit and it may be hard to avoid store-forwarding penalties.

