mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software

Reply
 
Thread Tools
Old 2003-07-19, 00:21   #23
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

7,537 Posts
Default

Quote:
Originally Posted by powervolume
Would the P4 optimizations made in version 23 make a difference for Athlon 64 and Opteron since they also have SSE2? Or are the optimizations more architectural specific than just having SSE2?
SSE2 was a BIG deal to P4s because using standard x86 FPU instructions the FPU was limited to a throughput of one add OR one multiply every clock cycle. Using SSE2 you can get a throughput of one add AND one multiply every clock cycle.

The Opteron/Athlon64 has a throughput of one add AND one multiply every clock cycle for both x86 and SSE2. In fact, your everyday Athlon has the same theoretical throughput using x86 instructions.

Since P4/Athlon/Opteron/Athlon64 all have the same theoretical throughput the one with the highest clock speed wins (P4). Also it seems the P4 may have other advantages in that it gets closer to the theoretical throughput than the AMD chips. This may be due to memory-to-L2 bandwidth, L2-to-L1 bandwidth, or something else.

This is a long way of saying don't expect great timings out of the AMD line until they can get their raw clock speed higher - or come out a chip that has higher theoretical FPU throughput.

This is just a long-winded way of saying that SSE2 FFTs on the Opteron/Athlon64 may be a little faster than the x86 FFTs and any Opteron/Athlon64 specific optimizations may or may improve timings significantly.
Prime95 is offline   Reply With Quote
Old 2003-07-19, 17:48   #24
QuintLeo
 
QuintLeo's Avatar
 
Oct 2002
Lost in the hills of Iowa

26·7 Posts
Default

I remember seeing someone run Prime benchmarks on an Opteron.

IIRC, it clocked significantly faster than most current Athlons (fast Bartons and very fast Thuroughbreds might be a little faster), but nowhere near current P-IVs do (140ish ms at 1792k comes to mind - I don't pay attention to the lower FFTs) - and noticeably faster *per clock* than any current Athlon.

I believe this was on a 1.4 Ghz Opteron - but it's been a while and I don't remember all the details.


(edit) I found a ref to it.

http//www.aceshardware.com/forum?read=95030015
QuintLeo is offline   Reply With Quote
Old 2003-07-21, 09:24   #25
Dresdenboy
 
Dresdenboy's Avatar
 
Apr 2003
Berlin, Germany

36110 Posts
Default

Quote:
Originally Posted by Prime95
As for Athlon64 and Opteron there are three optimizations I can think of.
One: use the extra xmm registers. This will gain maybe a percent. Not too hard to implement. Two: prefetch 64 byte cache lines instead of the P4's 128 byte cache lines. I built a special prime95 that did this and surprisingly it did not make any difference. Three: use 64-bit instructions for factoring. This is likely to be tedious to code and while making factoring faster does not speedup GIMPS thoughput greatly because the bulk of the CPU time goes to LL testing.
I found some quick optimization which gives about 1.85% improvement on Athlon XP 1800+. I got the available source23.zip and replaced most fadd st, st by fmul XMM_TWO (a real Athlon specific optimization). But that still has an disadvantage because the 32bit pointer throws some code into the next decoding window.

In the next step I'll write a pearl script which puts the constant locally right before the loops (to get 8byte pointers) and additionally aligns the loops to 16byte boundaries. Unfortunately there is no fast way of using MMX or SSE code to modify the FP registers although they are in the same register file. That would allow a similar optimization like your SSE2 version of mul by 2 by adding 1 to the exponents.

There is some interesting thing in the optimization manuals: For optimally decoding the FP instructions should be fitting into 8byte windows. This can be done by adding 0x66 prefixes to some of the instructions which has no effect besides adding one byte. And I don't know if this has side effects on Intel CPUs.

For example a block of repeated fld, fadd, fmul would usually yield 2 ops/cycle, but aligned to the 8 byte decoder windows it yields full 3 ops/cycle. Because of this and also some branch penalties caused by "misalignement" I'll try to create some complex script, which tries to guess the final opcode size (should be easy for fp-only stuff) and insert pad bytes for 8byte and branch-target alignment.
Dresdenboy is offline   Reply With Quote
Old 2003-07-21, 15:59   #26
gbvalor
 
gbvalor's Avatar
 
Aug 2002

3×37 Posts
Default

Quote:
SSE2 was a BIG deal to P4s because using standard x86 FPU instructions the FPU was limited to a throughput of one add OR one multiply every clock cycle. Using SSE2 you can get a throughput of one add AND one multiply every clock cycle.

The Opteron/Athlon64 has a throughput of one add AND one multiply every clock cycle for both x86 and SSE2. In fact, your everyday Athlon has the same theoretical throughput using x86 instructions.

Prime95 , Dresdenboy, I already have the first working Glucas version using SSE2 (also multithreaded if you like) . IT could be compiled in an AMD64 machine if I had access to one of them. Have you such access?.

It would be interesting to compare how it runs using x86 with Intel C++ compiler versus x86_64 / x86 and GCC compiler.

Guillermo.
gbvalor is offline   Reply With Quote
Old 2003-07-21, 16:22   #27
SalemTheCat100
 
Oct 2002

2610 Posts
Default

Quote:
Originally Posted by Dresdenboy
I found some quick optimization which gives about 1.85% improvement on Athlon XP 1800+. I got the available source23.zip and replaced most fadd st, st by fmul XMM_TWO (a real Athlon specific optimization). But that still has an disadvantage because the 32bit pointer throws some code into the next decoding window.

In the next step I'll write a pearl script which puts the constant locally right before the loops (to get 8byte pointers) and additionally aligns the loops to 16byte boundaries. Unfortunately there is no fast way of using MMX or SSE code to modify the FP registers although they are in the same register file. That would allow a similar optimization like your SSE2 version of mul by 2 by adding 1 to the exponents.

There is some interesting thing in the optimization manuals: For optimally decoding the FP instructions should be fitting into 8byte windows. This can be done by adding 0x66 prefixes to some of the instructions which has no effect besides adding one byte. And I don't know if this has side effects on Intel CPUs.

For example a block of repeated fld, fadd, fmul would usually yield 2 ops/cycle, but aligned to the 8 byte decoder windows it yields full 3 ops/cycle. Because of this and also some branch penalties caused by "misalignement" I'll try to create some complex script, which tries to guess the final opcode size (should be easy for fp-only stuff) and insert pad bytes for 8byte and branch-target alignment.
Great work!!! :( Now if only we could have these optimizations added to the next client release? :(

Paying attention to the 16000+ Athlon machines running Prime95 would be beneficial.

When Athlon specific optimizations happens I will reconsider adding back the eight Athlon machines back to working on Prime95.

SALEM
SalemTheCat100 is offline   Reply With Quote
Old 2003-07-21, 18:32   #28
Dresdenboy
 
Dresdenboy's Avatar
 
Apr 2003
Berlin, Germany

192 Posts
Default

Quote:
Originally Posted by gbvalor
Prime95 , Dresdenboy, I already have the first working Glucas version using SSE2 (also multithreaded if you like) . IT could be compiled in an AMD64 machine if I had access to one of them. Have you such access?.

It would be interesting to compare how it runs using x86 with Intel C++ compiler versus x86_64 / x86 and GCC compiler.

Guillermo.
I'll try to find someone who could let you try out.

We should also try PGCC because the latest numbers I saw were very promising (have a look here:
http://www.aceshardware.com/forum?read=105021452 and here: http://www.pgroup.com/images/pg50vpg41.jpg, http://www.pgroup.com/images/pgf90vg77.jpg - more description on http://www.pgroup.com .. we are coming close to a URL overflow error ;)) AFAIR there is a beta licence for the compiler for 14 days - the executables would also refuse to work after that time. But that should be ok for testing.

Matthias
[/url]
Dresdenboy is offline   Reply With Quote
Old 2003-07-21, 23:56   #29
PageFault
 
PageFault's Avatar
 
Aug 2002
Dawn of the Dead

5×47 Posts
Default

The average time, 149 ms for 1792K FFT, is slightly faster than this 2000 MHz tbred - 0.160 ms. So, the 1800 MHz opteron can do what a ~2060 MHz current core can. The xeon score is not representative and must have been done with an old client - a 2240 MHz northwood runs the same 0.087 ms.

Quote:
Originally Posted by QuintLeo
I believe this was on a 1.4 Ghz Opteron - but it's been a while and I don't remember all the details.

PageFault is offline   Reply With Quote
Old 2003-07-21, 23:59   #30
PageFault
 
PageFault's Avatar
 
Aug 2002
Dawn of the Dead

5·47 Posts
Default

A few weeks ago an AMD optimized version was released, netting from three to eight percent improvement. I gained five percent ... get v23.5 ...

Quote:
Originally Posted by SalemTheCat100
[ Now if only we could have these optimizations added to the next client release? :(

Paying attention to the 16000+ Athlon machines running Prime95 would be beneficial.

When Athlon specific optimizations happens I will reconsider adding back the eight Athlon machines back to working on Prime95.

SALEM
PageFault is offline   Reply With Quote
Old 2003-07-22, 03:36   #31
trif
 
trif's Avatar
 
Aug 2002

110010102 Posts
Default

Quote:
Originally Posted by SalemTheCat100
When Athlon specific optimizations happens I will reconsider adding back the eight Athlon machines back to working on Prime95.
You haven't been paying attention. There already are Athlon specific optimizations in. Yes, it's quite likely there could be more, but that's true of the P4 code as well.

I get the sense you're saying, "You can't have my machines until the Athlon code is a fast as the P4 code is." And not understanding that this may not be possible.
trif is offline   Reply With Quote
Old 2003-07-22, 04:13   #32
SalemTheCat100
 
Oct 2002

2·13 Posts
Default

Quote:
Originally Posted by trif
Quote:
Originally Posted by SalemTheCat100
When Athlon specific optimizations happens I will reconsider adding back the eight Athlon machines back to working on Prime95.
You haven't been paying attention. There already are Athlon specific optimizations in. Yes, it's quite likely there could be more, but that's true of the P4 code as well.

I get the sense you're saying, "You can't have my machines until the Athlon code is a fast as the P4 code is." And not understanding that this may not be possible.
No it is you who isn't paying attention. And while you're at it stop trying to guess peoples motives as you aren't very good at it.

The Athlon code optimizations have been largely ignored in favor (or bias) to the P4. I have been reading the release notes with each new client. Any gains the Athlon gets seems to be because some optimization for the P4 slightly improves the Athlon performance. The last major improvement was the 20% jump when prefetch was implemented. And that also improved the P3.

I responded in this thread to a gentleman called dresdenman who got an 8.5% improvement in speed with just a few simple optimizations. he also mentioned that other code in the Prime95 client could be easily changed for further improvement.

Now why is it that an OUTSIDER who is using the publically available Athlon optimization guide could get these kinds of performance improvements. My take is that the Athlon is an afterthought, or more likely a no-thought.

If the continual bias against even trying to improve the Athlon performance continues I will pull the plug on all Prime95 clients.

Now what would happen if the other 16000+ Athlon Prime95 users decide the same.

An 8.5% Athlon speed improvement is waiting to be implemented.

I will be watching.

SALEM
SalemTheCat100 is offline   Reply With Quote
Old 2003-07-22, 04:25   #33
SalemTheCat100
 
Oct 2002

2×13 Posts
Default

After posting I noticed the speed improvement was 1.85% and the poster was dresdenboy.

If these changes for the Athlon are easy to implement they should be implemented.

Again, I’m just looking for the Athlon, and the AMD64 processors to be taken seriously.


SALEM
SalemTheCat100 is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
More bother paulunderwood Hardware 24 2019-05-01 13:19
Bother fivemack Hardware 25 2018-03-31 07:21
Unable to download 64bit Linux version brianread PrimeNet 2 2012-01-10 17:27
where can I download the latest version of GMP-ECM aaa120 GMP-ECM 2 2008-10-31 14:28
Where can I download the latest version of primo? aaa120 Software 7 2008-10-27 06:28

All times are UTC. The time now is 13:46.


Mon Aug 2 13:46:29 UTC 2021 up 10 days, 8:15, 1 user, load averages: 1.97, 1.93, 1.94

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.