mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2003-12-18, 05:25   #1
GP2
 
GP2's Avatar
 
Sep 2003

A1E16 Posts
Default Creative ways to achieve Athlon 64 / Opteron optimization

Currently Prime95 doesn't yet work well with Athlon 64 / Opteron.

If I recall correctly, the major stumbling block is that data can't be gotten into and out of the FPU nearly as fast as it ought to based on the specs.

In this thread, maybe we can think outside the box to come up with possible solutions.

I don't mean suggestions for assembly language coding... I don't know any Intel assembler and neither do most of us. I mean, rather than putting the entire burden on George (and maybe a few others like Dresdenboy), can we find some way to get third-party experts to help out?

Some suggestions below, feel free to add your own.
GP2 is offline   Reply With Quote
Old 2003-12-18, 05:37   #2
GP2
 
GP2's Avatar
 
Sep 2003

2·5·7·37 Posts
Default

One possibility is to try to throw a little money at the problem.

There are some sites like Google Answers where you can ask questions and offer a cash bounty for answers. [Does anyone know any other similar sites?]

If we could formulate a very specific question (ie, why doesn't this snippet of code get data to and from the FPU as fast as AMD claims it should), we could try posting it to Google Answers and see if there is any response. I'd personally be willing to contribute $100 to a cash bounty for solving whatever is currently the major stumbling block for Prime95 running efficiently on Athlon64.

One problem with Google Answers, though, is that it's a general forum. You might not find hardcore assembly language experts there. Perhaps there are some other specifically programming-oriented sites for this? I vaguely remember one, but don't recall the URL.

Another possibility is a site like RentACoder.com.

Any other suggestions?
GP2 is offline   Reply With Quote
Old 2003-12-18, 05:52   #3
GP2
 
GP2's Avatar
 
Sep 2003

2·5·7·37 Posts
Default

Another idea would be to purchase support from AMD.

Somewhere on AMD's web site there must be a board similar to this one, where you can ask questions and get answers directly from AMD hardware gurus. But presumably it's password-protected and you need to purchase a subscription or something.

For instance, the AMD Developer Center page specifically mentions "Code Optimizations: FPU through-put, SSE, and SSE2 optimizations"

AMD Developer Center is here:
http://www.developwithamd.com/apppar...fm?action=home

I think this is the form to fill in for this:
http://www.developwithamd.com/apppar...=DevCenterHome

Sorry if I'm mentioning stuff that is already well-known... the extra twist here would be, if a "premium" level exists for AMD developers for a couple hundred bucks a year, where you can actually get answers from a knowledgeable live person, perhaps we could buy into it.
GP2 is offline   Reply With Quote
Old 2003-12-18, 08:00   #4
Erix
 
Erix's Avatar
 
Aug 2003
Turkey

10002 Posts
Default

Before spending money maybe you can try http://forums.amd.com/
There is an Opteron section.
Erix is offline   Reply With Quote
Old 2003-12-18, 08:50   #5
GP2
 
GP2's Avatar
 
Sep 2003

2×5×7×37 Posts
Default

Quote:
Originally posted by Erix
Before spending money maybe you can try http://forums.amd.com/
There is an Opteron section.
Erix, if you're a member on that board, perhaps you could volunteer to ask a question there, provided that George or Dresdenboy can come up with a precise formulation for the question (including a code snippet).

George or Dresdenboy, can you post such a precisely formulated question in this thread?

However...
I'm guessing that that particular board is not unlike this one... general discussion by "laypersons". Just looking at the thread subjects, there doesn't seem to be any discussion at all of programming... just threads about what memory or motherboards to use, and so forth. Looks like a board for people buying or building Athlon64 boxes... not a developer board.

I think we want direct access to some of the folks who work for AMD and actually designed the chip and know.

Knowing how these things work, they probably charge "strategic software partners" for access, if only to filter out the thousands of random enthusiasts who would otherwise pester their key employees. That's why I think we might need to buy into this kind of access... and the Opteron fundraising showed that we can do this.


For the AMD Developer Center, they specifically promise help with: "Code Optimizations: FPU through-put, SSE, and SSE2 optimizations". That looks like precisely, exactly what we want.

Last fiddled with by GP2 on 2003-12-18 at 08:51
GP2 is offline   Reply With Quote
Old 2003-12-18, 09:43   #6
Erix
 
Erix's Avatar
 
Aug 2003
Turkey

23 Posts
Default

Quote:
Originally posted by GP2
Erix, if you're a member on that board, perhaps you could volunteer to ask a question there, provided that George or Dresdenboy can come up with a precise formulation for the question (including a code snippet).
Sure. I would like to help as much as I can. Just let me know the question.

As you mentioned; there is not much technical questions on that board but trying can't make us lose anything.
Erix is offline   Reply With Quote
Old 2003-12-18, 21:39   #7
Ethan (EO)
 
Ethan (EO)'s Avatar
 
"Ethan O'Connor"
Oct 2002
GIMPS since Jan 1996

2×72 Posts
Default

I would suggest posting very specific questions (I recall George mentioning that he's unable to do as many FP loads per second as he should be able to) and code snippets to the comp.arch newsgroup. Terje Mathisen and a number of other processor/assembly folks read that group and I've seen many constructive discussions come out of "this assembly fragment is not performing as I'd expect" type postings.

http://groups.google.com/groups?q=te...ro.com&rnum=30 for one example thread.


Ethan O'Connor
Ethan (EO) is offline   Reply With Quote
Old 2003-12-19, 00:06   #8
Dresdenboy
 
Dresdenboy's Avatar
 
Apr 2003
Berlin, Germany

192 Posts
Default

We already found a workaround very early after this problem has been identified.

The only bottleneck of the current available AMD64 CPUs is, that a SSE2 load instruction only manages to load 1 half SSE2 register (1 double) per cycle although the max possible rate is 2 64bit values/cycle, which can be achieved by using MMX, 64bit int loads. I don't remember if x87 loads have the same bandwidth limitation. The optimization manual states that MOVAPD (the used instruction) can be issued to the FADD/FMUL/FSTOR units (the same is the case for FLD), which implies somehow, that either one MOVAPD could load 2 doubles at once or that at least 2 MOVAPDs could be executed in parallel (each of them loading their register halves serially)

However - the full rate is only available when memory operands are used - that means, they are not expicitly loaded into a register but just used as an operand (which translates to one load and one execute instruction without lowering the decode and issue bandwith).

It is not that easy to modify tons of code to apply such a scheme.

And there are other ways to make use of free CPU resources. More on that can be found in different threads.

Regards,
Matthias
Dresdenboy is offline   Reply With Quote
Old 2003-12-19, 01:31   #9
GP2
 
GP2's Avatar
 
Sep 2003

2×5×7×37 Posts
Default

Quote:
Originally posted by Dresdenboy
The only bottleneck of the current available AMD64 CPUs is, that a SSE2 load instruction only manages to load 1 half SSE2 register (1 double) per cycle although the max possible rate is 2 64bit values/cycle
OK, perhaps this is where we could use some live-person third party expertise, to fully understand under what circumstances the maximum possible rate can be achieved, rather than relying solely on the optimization manual... sometimes documentation doesn't always match actual behavior of software or hardware.

Ethan, can you perhaps try to inquire about this on comp.arch ? And post a Google Groups link to the thread there if you start one...

Is this known to be a limitation of the architecture, for all Athlon64s and Opterons, or is it by any chance just a limitation of certain early steppings... wild guesses here, I really don't know much about CPUs.


Quote:

It is not that easy to modify tons of code to apply such a scheme.
Is there by any chance any part of such a modification that would be merely tedious and time-consuming and relatively mechanical and straightforward, rather than requiring a lot of creative thought to rewrite? If so, we could try subcontracting that part out (maybe RentACoder.com, or some volunteers here who know assembler).

Rather than modifying code for all FFTs, could we consider just modifying it for one or two FFT lengths (the ones where most testing is currently being done)...

Is there any chance that such a code modification would be more generally useful (applicable to a future Intel x86-64 chip for instance) rather than just a workaround for the current version of the Athlon64?

Once again, I'm not particularly familiar with CPUs and assembler, so I'm not sure if any of those questions make sense...
GP2 is offline   Reply With Quote
Old 2003-12-19, 14:58   #10
gbvalor
 
gbvalor's Avatar
 
Aug 2002

3×37 Posts
Default

Quote:
However - the full rate is only available when memory operands are used - that means, they are not expicitly loaded into a register but just used as an operand (which translates to one load and one execute instruction without lowering the decode and issue bandwith).
That's very interesting!. It could explain some extrange timings I got when trying to optimize Glucas.

Is it hard to force the compiler to make what one would make in assembler but using plain C and calls to Intel intrinsic library. Some times I was sure about some modification. I expected a better timing and the results was just the opposite. And some times i was expecting no gain and surprisingly it get better results.

Guillermo.
gbvalor is offline   Reply With Quote
Old 2004-01-19, 09:51   #11
Dresdenboy
 
Dresdenboy's Avatar
 
Apr 2003
Berlin, Germany

192 Posts
Default

http://www.amd.com/us-en/assets/cont...C_2003_pdf.pdf, page 13 also mentions the MOVAPD problem - with different reasons for this behaviour of the K8 chips.

It is understandable that MOVAPD will use the FMUL/FADD pipelines if FSTOR is already busy and thus could take away some FMUL/FADD issue slots. But I observed the 1 64bit load/cycle behaviour also while using MOVAPDs only.
Dresdenboy is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
How many ways can you code an LL test science_man_88 Lounge 20 2018-08-23 23:06
Photoshop Creative Suite 5 and CUDA Rodrigo GPU Computing 1 2011-07-04 10:51
ways to get rid of oil spills science_man_88 Puzzles 9 2010-07-30 21:22
AMD Athlon 64 vs AMD Opteron for ecm thomasn Factoring 6 2004-11-08 13:25
interesting tools and compilers (for P4, Athlon, Opteron) Dresdenboy Hardware 13 2003-05-21 10:36

All times are UTC. The time now is 15:14.


Fri Jul 7 15:14:25 UTC 2023 up 323 days, 12:42, 0 users, load averages: 1.84, 1.30, 1.18

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔