![]() |
|
|
#23 | |
|
P90 years forever!
Aug 2002
Yeehaw, FL
7,537 Posts |
Quote:
The Opteron/Athlon64 has a throughput of one add AND one multiply every clock cycle for both x86 and SSE2. In fact, your everyday Athlon has the same theoretical throughput using x86 instructions. Since P4/Athlon/Opteron/Athlon64 all have the same theoretical throughput the one with the highest clock speed wins (P4). Also it seems the P4 may have other advantages in that it gets closer to the theoretical throughput than the AMD chips. This may be due to memory-to-L2 bandwidth, L2-to-L1 bandwidth, or something else. This is a long way of saying don't expect great timings out of the AMD line until they can get their raw clock speed higher - or come out a chip that has higher theoretical FPU throughput. This is just a long-winded way of saying that SSE2 FFTs on the Opteron/Athlon64 may be a little faster than the x86 FFTs and any Opteron/Athlon64 specific optimizations may or may improve timings significantly. |
|
|
|
|
|
|
#24 |
|
Oct 2002
Lost in the hills of Iowa
44810 Posts |
I remember seeing someone run Prime benchmarks on an Opteron.
IIRC, it clocked significantly faster than most current Athlons (fast Bartons and very fast Thuroughbreds might be a little faster), but nowhere near current P-IVs do (140ish ms at 1792k comes to mind - I don't pay attention to the lower FFTs) - and noticeably faster *per clock* than any current Athlon. I believe this was on a 1.4 Ghz Opteron - but it's been a while and I don't remember all the details. (edit) I found a ref to it. http//www.aceshardware.com/forum?read=95030015 |
|
|
|
|
|
#25 | |
|
Apr 2003
Berlin, Germany
192 Posts |
Quote:
In the next step I'll write a pearl script which puts the constant locally right before the loops (to get 8byte pointers) and additionally aligns the loops to 16byte boundaries. Unfortunately there is no fast way of using MMX or SSE code to modify the FP registers although they are in the same register file. That would allow a similar optimization like your SSE2 version of mul by 2 by adding 1 to the exponents. There is some interesting thing in the optimization manuals: For optimally decoding the FP instructions should be fitting into 8byte windows. This can be done by adding 0x66 prefixes to some of the instructions which has no effect besides adding one byte. And I don't know if this has side effects on Intel CPUs. For example a block of repeated fld, fadd, fmul would usually yield 2 ops/cycle, but aligned to the 8 byte decoder windows it yields full 3 ops/cycle. Because of this and also some branch penalties caused by "misalignement" I'll try to create some complex script, which tries to guess the final opcode size (should be easy for fp-only stuff) and insert pad bytes for 8byte and branch-target alignment. |
|
|
|
|
|
|
#26 | |
|
Aug 2002
3×37 Posts |
Quote:
Prime95 , Dresdenboy, I already have the first working Glucas version using SSE2 (also multithreaded if you like) . IT could be compiled in an AMD64 machine if I had access to one of them. Have you such access?. It would be interesting to compare how it runs using x86 with Intel C++ compiler versus x86_64 / x86 and GCC compiler. Guillermo. |
|
|
|
|
|
|
#27 | |
|
Oct 2002
2×13 Posts |
Quote:
Paying attention to the 16000+ Athlon machines running Prime95 would be beneficial. When Athlon specific optimizations happens I will reconsider adding back the eight Athlon machines back to working on Prime95. SALEM |
|
|
|
|
|
|
#28 | |
|
Apr 2003
Berlin, Germany
192 Posts |
Quote:
We should also try PGCC because the latest numbers I saw were very promising (have a look here: http://www.aceshardware.com/forum?read=105021452 and here: http://www.pgroup.com/images/pg50vpg41.jpg, http://www.pgroup.com/images/pgf90vg77.jpg - more description on http://www.pgroup.com .. we are coming close to a URL overflow error ;)) AFAIR there is a beta licence for the compiler for 14 days - the executables would also refuse to work after that time. But that should be ok for testing. Matthias [/url] |
|
|
|
|
|
|
#29 | |
|
Aug 2002
Dawn of the Dead
23510 Posts |
The average time, 149 ms for 1792K FFT, is slightly faster than this 2000 MHz tbred - 0.160 ms. So, the 1800 MHz opteron can do what a ~2060 MHz current core can. The xeon score is not representative and must have been done with an old client - a 2240 MHz northwood runs the same 0.087 ms.
Quote:
|
|
|
|
|
|
|
#30 | |
|
Aug 2002
Dawn of the Dead
5·47 Posts |
A few weeks ago an AMD optimized version was released, netting from three to eight percent improvement. I gained five percent ... get v23.5 ...
Quote:
|
|
|
|
|
|
|
#31 | |
|
Aug 2002
CA16 Posts |
Quote:
I get the sense you're saying, "You can't have my machines until the Athlon code is a fast as the P4 code is." And not understanding that this may not be possible. |
|
|
|
|
|
|
#32 | ||
|
Oct 2002
2·13 Posts |
Quote:
The Athlon code optimizations have been largely ignored in favor (or bias) to the P4. I have been reading the release notes with each new client. Any gains the Athlon gets seems to be because some optimization for the P4 slightly improves the Athlon performance. The last major improvement was the 20% jump when prefetch was implemented. And that also improved the P3. I responded in this thread to a gentleman called dresdenman who got an 8.5% improvement in speed with just a few simple optimizations. he also mentioned that other code in the Prime95 client could be easily changed for further improvement. Now why is it that an OUTSIDER who is using the publically available Athlon optimization guide could get these kinds of performance improvements. My take is that the Athlon is an afterthought, or more likely a no-thought. If the continual bias against even trying to improve the Athlon performance continues I will pull the plug on all Prime95 clients. Now what would happen if the other 16000+ Athlon Prime95 users decide the same. An 8.5% Athlon speed improvement is waiting to be implemented. I will be watching. SALEM |
||
|
|
|
|
|
#33 |
|
Oct 2002
2·13 Posts |
After posting I noticed the speed improvement was 1.85% and the poster was dresdenboy.
If these changes for the Athlon are easy to implement they should be implemented. Again, I’m just looking for the Athlon, and the AMD64 processors to be taken seriously. SALEM |
|
|
|
![]() |
| Thread Tools | |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| More bother | paulunderwood | Hardware | 24 | 2019-05-01 13:19 |
| Bother | fivemack | Hardware | 25 | 2018-03-31 07:21 |
| Unable to download 64bit Linux version | brianread | PrimeNet | 2 | 2012-01-10 17:27 |
| where can I download the latest version of GMP-ECM | aaa120 | GMP-ECM | 2 | 2008-10-31 14:28 |
| Where can I download the latest version of primo? | aaa120 | Software | 7 | 2008-10-27 06:28 |