mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2003-04-19, 11:59   #1
TauCeti
 
TauCeti's Avatar
 
Mar 2003
Braunschweig, Germany

2·113 Posts
Default Opteron 244 (1.8 GHz) benched with Prime95 22.12

Some Opteron news...

I just got my paper-copy of the german publication c't-magazine. Andreas Stiller did some benchmarks with an Opteron 244 system.

Prime95 v22.12 W32 on the 1.8 GHz Opteron showed about half the performance compared to a P IV 3.06 GHz. There was also a hint that George Woltman is already working on some Opteron optimizations.

I am still undecided if i should wait for the Atestosteron 64 ;) The Optimisteron is far to expensive für me *g*

Tau
TauCeti is offline   Reply With Quote
Old 2003-04-19, 15:52   #2
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

8,167 Posts
Default

The only optimization I plan is to prefetch 64 byte cache lines instead of 128 byte cache lines. This will be a modest improvement but it looks like the P4 will be CPU of choice for some time to come.
Prime95 is online now   Reply With Quote
Old 2003-04-19, 16:57   #3
gbvalor
 
gbvalor's Avatar
 
Aug 2002

3×37 Posts
Default Prime95 on Opterons

Hi,

George, I'm now writing some SSE2 code for Glucas thinking on Opterons. I see I still have a chance to reduce the gap between Glucas and Prime95 if you don't touch the code too much ;).

Regards.

Guillermo.
gbvalor is offline   Reply With Quote
Old 2003-04-20, 02:47   #4
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

11111111001112 Posts
Default

An excellent review at http://www.xbitlabs.com/articles/cpu/display/athlon64.html does not bode well for the Opteron. From the conclusion:


Athlon 64 is not very successful in traditional calculating tasks, such as scientific calculations
Prime95 is online now   Reply With Quote
Old 2003-04-20, 08:46   #5
gbvalor
 
gbvalor's Avatar
 
Aug 2002

3×37 Posts
Default

Hi,

> http://www.xbitlabs.com/articles/cpu/display/athlon64.html does not bode well for the Opteron.

I'm afraid that most of these tests are made using 32 bits sofware for Pentium4. Yes, the low core frequency is a problem, but Opteron also has 8 aditional 128 registers which could reduce register pressure (no talking about its 16 64-bit integer registers). I think when software begin to use all this advantages the gap will be reduced drastically.

IMHO, AMD needs urgently a good compiler.

Guillermo
gbvalor is offline   Reply With Quote
Old 2003-04-23, 08:37   #6
BranMuffin
 
BranMuffin's Avatar
 
Dec 2002

2×5 Posts
Default Testing the Code on the Hardware

I might be able to come up with some actual hammer hardware to test code optimizations.
BranMuffin is offline   Reply With Quote
Old 2003-04-23, 17:12   #7
gbvalor
 
gbvalor's Avatar
 
Aug 2002

11110 Posts
Default

Hi,

Quote:
I might be able to come up with some actual hammer hardware to test code optimizations.
Then, if you want, I would ask you to test the new beta code for Glucas I'm just writing for Opterons. What OS/compiler are you talking about?

Guillermo.
gbvalor is offline   Reply With Quote
Old 2003-04-25, 12:46   #8
Dresdenboy
 
Dresdenboy's Avatar
 
Apr 2003
Berlin, Germany

192 Posts
Default

Quote:
Originally Posted by Prime95
An excellent review at http://www.xbitlabs.com/articles/cpu.../athlon64.html does not bode well for the Opteron. From the conclusion:


Athlon 64 is not very successful in traditional calculating tasks, such as scientific calculations
I don't know what changes happened between xbit's 3 months old (week 01/2003) engineering sample and the current Opteron release, but I'm sure that there are at least small differences. A bigger difference is the single memory channel although that doesn't matter in this review as they wisely used the same memory configuration for all computers.

The cause of Athlon 64s lower performance in this review is the code. An Athlon can "peephole" optimize a bit the fed in P3/P4 code in its schedulers. But that can't do wonders if the available resources aren't used efficiently. We know that P3 code on P4 and vice versa can cause significant differences in performance and this was a reason for P4s poor performance in it's first months of existence on the market.

I'd also recommend these reviews because of their detailed information regarding FPU/FFT speed/SSE2..:
Aces Hardware has an good review (look at page 14 for compiled C-code FFT and other stuff):
http://www.aceshardware.com/read.jsp?id=55000251
and tecchannel (german) also looks closely at SSE2, SMP and so:
http://www.tecchannel.de/hardware/1164/index.html

Regards,
DDB

BTW 1.4 GHz Opterons are around 280$ and will go down during the next months. But oc'ed 1700+ ($50-$60) running at 2.2GHz or more (at 1.5-1.6 V) are also welcome for computation ;)
Dresdenboy is offline   Reply With Quote
Old 2003-04-25, 18:09   #9
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

177478 Posts
Default

From the same Ace's Hardware article, the graph on this page http://www.aceshardware.com/read.jsp?id=55000253 shows a hugh difference in L2 cache bandwidth. Prime95 reads and writes a lot of data to the L2 cache so this could have a significant impact.

Without having an Opteron on hand, I think the biggest problem the Opteron has competing with a P4 running prime95 is raw clock speed. Both have a theoretical throughput of one FPU mul and one FPU add per clock cycle. Since both are doing a pretty good job of approaching this theoretical limit (a P4 reaches about 55% of the theoretical maximum FPU throughput), the Opteron cannot overcome the 3.06 vs 1.8 GHz speed disadvantage.
Prime95 is online now   Reply With Quote
Old 2003-04-26, 10:10   #10
gbvalor
 
gbvalor's Avatar
 
Aug 2002

3×37 Posts
Default

Quote:
Originally Posted by Prime95
From the same Ace's Hardware article, the graph on this page http://www.aceshardware.com/read.jsp?id=55000253 shows a hugh difference in L2 cache bandwidth. Prime95 reads and writes a lot of data to the L2 cache so this could have a significant impact.
This AMD64 L2 cache disvantage can be partially reduced using its additional 8 registers. I mean we can retain in registers some more critical data and so we don't need to store and read again to cache.

George, how many load/store and clock cycles could you have saved with 8 more registers?
Quote:
Without having an Opteron on hand, I think the biggest problem the Opteron has competing with a P4 running prime95 is raw clock speed. Both have a theoretical throughput of one FPU mul and one FPU add per clock cycle. Since both are doing a pretty good job of approaching this theoretical limit (a P4 reaches about 55% of the theoretical maximum FPU throughput), the Opteron cannot overcome the 3.06 vs 1.8 GHz speed disadvantage.
I'm agree with that.

BTW, is there any advantage with the integer 64 bit multiply in factoring task. Opteron takes only 4 clocks in a 64x64=128 bits mul, and only a clock throughtput for imuls.
gbvalor is offline   Reply With Quote
Old 2003-04-26, 17:08   #11
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

8,167 Posts
Default

Quote:
Originally Posted by gbvalor
This AMD64 L2 cache disvantage can be partially reduced using its additional 8 registers. I mean we can retain in registers some more critical data and so we don't need to store and read again to cache.

George, how many load/store and clock cycles could you have saved with 8 more registers?
If starting coding from scratch, 8 registers would be of some help. I could do three levels of the FFT while in registers instead of just two. This would reduce the L2 cache reads and writes by up to 50%. I don't know what that would translate into in terms of a per-iteration speed improvement.

When I wrote the P4 SSE2 code, my tests on small snipets of assembly code showed that I could completely ignore the L1 cache. That is, the P4's L2 bandwidth and ability to schedule reads far enough in advance lets prime95 run out of the L2 cache nearly as fast as running out of the L1 cache (which is a good thing since the L1 cache is so small).

I don't remember what the Opteron's L1 cache size is, but if it is a decent size you could further reduce the L2 cache reads and writes by juggling the assembly code to process more data while it is in the L1 cache. The downside is code complexity goes up a little bit and it may be hard to avoid store-forwarding penalties.
Prime95 is online now   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Prime95 crashing on dual Opteron with some workers doing P-1 bgbeuning Information & Answers 2 2015-12-30 00:00
2 x AMD Opteron 2427 @ 2.39 GHz - prime95 bench- joblack Hardware 2 2010-03-12 19:38
Best Prime95 Version for Opteron Minot Software 1 2005-02-14 00:47
Opteron Bottleneck?? Prime95 Hardware 31 2003-09-17 06:54
AMD Opteron naclosagc Software 27 2003-08-10 19:14

All times are UTC. The time now is 18:54.


Tue Feb 7 18:54:32 UTC 2023 up 173 days, 16:23, 1 user, load averages: 0.55, 0.66, 0.74

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔