![]() |
|
|
#1 |
|
Sep 2003
A1E16 Posts |
Currently Prime95 doesn't yet work well with Athlon 64 / Opteron.
If I recall correctly, the major stumbling block is that data can't be gotten into and out of the FPU nearly as fast as it ought to based on the specs. In this thread, maybe we can think outside the box to come up with possible solutions. I don't mean suggestions for assembly language coding... I don't know any Intel assembler and neither do most of us. I mean, rather than putting the entire burden on George (and maybe a few others like Dresdenboy), can we find some way to get third-party experts to help out? Some suggestions below, feel free to add your own. |
|
|
|
|
|
#2 |
|
Sep 2003
2·5·7·37 Posts |
One possibility is to try to throw a little money at the problem.
There are some sites like Google Answers where you can ask questions and offer a cash bounty for answers. [Does anyone know any other similar sites?] If we could formulate a very specific question (ie, why doesn't this snippet of code get data to and from the FPU as fast as AMD claims it should), we could try posting it to Google Answers and see if there is any response. I'd personally be willing to contribute $100 to a cash bounty for solving whatever is currently the major stumbling block for Prime95 running efficiently on Athlon64. One problem with Google Answers, though, is that it's a general forum. You might not find hardcore assembly language experts there. Perhaps there are some other specifically programming-oriented sites for this? I vaguely remember one, but don't recall the URL. Another possibility is a site like RentACoder.com. Any other suggestions? |
|
|
|
|
|
#3 |
|
Sep 2003
2·5·7·37 Posts |
Another idea would be to purchase support from AMD.
Somewhere on AMD's web site there must be a board similar to this one, where you can ask questions and get answers directly from AMD hardware gurus. But presumably it's password-protected and you need to purchase a subscription or something. For instance, the AMD Developer Center page specifically mentions "Code Optimizations: FPU through-put, SSE, and SSE2 optimizations" AMD Developer Center is here: http://www.developwithamd.com/apppar...fm?action=home I think this is the form to fill in for this: http://www.developwithamd.com/apppar...=DevCenterHome Sorry if I'm mentioning stuff that is already well-known... the extra twist here would be, if a "premium" level exists for AMD developers for a couple hundred bucks a year, where you can actually get answers from a knowledgeable live person, perhaps we could buy into it. |
|
|
|
|
|
#4 |
|
Aug 2003
Turkey
10002 Posts |
Before spending money maybe you can try http://forums.amd.com/
There is an Opteron section. |
|
|
|
|
|
#5 | |
|
Sep 2003
2×5×7×37 Posts |
Quote:
George or Dresdenboy, can you post such a precisely formulated question in this thread? However... I'm guessing that that particular board is not unlike this one... general discussion by "laypersons". Just looking at the thread subjects, there doesn't seem to be any discussion at all of programming... just threads about what memory or motherboards to use, and so forth. Looks like a board for people buying or building Athlon64 boxes... not a developer board. I think we want direct access to some of the folks who work for AMD and actually designed the chip and know. Knowing how these things work, they probably charge "strategic software partners" for access, if only to filter out the thousands of random enthusiasts who would otherwise pester their key employees. That's why I think we might need to buy into this kind of access... and the Opteron fundraising showed that we can do this. For the AMD Developer Center, they specifically promise help with: "Code Optimizations: FPU through-put, SSE, and SSE2 optimizations". That looks like precisely, exactly what we want. Last fiddled with by GP2 on 2003-12-18 at 08:51 |
|
|
|
|
|
|
#6 | |
|
Aug 2003
Turkey
23 Posts |
Quote:
As you mentioned; there is not much technical questions on that board but trying can't make us lose anything. |
|
|
|
|
|
|
#7 |
|
"Ethan O'Connor"
Oct 2002
GIMPS since Jan 1996
2×72 Posts |
I would suggest posting very specific questions (I recall George mentioning that he's unable to do as many FP loads per second as he should be able to) and code snippets to the comp.arch newsgroup. Terje Mathisen and a number of other processor/assembly folks read that group and I've seen many constructive discussions come out of "this assembly fragment is not performing as I'd expect" type postings.
http://groups.google.com/groups?q=te...ro.com&rnum=30 for one example thread. Ethan O'Connor |
|
|
|
|
|
#8 |
|
Apr 2003
Berlin, Germany
192 Posts |
We already found a workaround very early after this problem has been identified.
The only bottleneck of the current available AMD64 CPUs is, that a SSE2 load instruction only manages to load 1 half SSE2 register (1 double) per cycle although the max possible rate is 2 64bit values/cycle, which can be achieved by using MMX, 64bit int loads. I don't remember if x87 loads have the same bandwidth limitation. The optimization manual states that MOVAPD (the used instruction) can be issued to the FADD/FMUL/FSTOR units (the same is the case for FLD), which implies somehow, that either one MOVAPD could load 2 doubles at once or that at least 2 MOVAPDs could be executed in parallel (each of them loading their register halves serially) However - the full rate is only available when memory operands are used - that means, they are not expicitly loaded into a register but just used as an operand (which translates to one load and one execute instruction without lowering the decode and issue bandwith). It is not that easy to modify tons of code to apply such a scheme. And there are other ways to make use of free CPU resources. More on that can be found in different threads. Regards, Matthias |
|
|
|
|
|
#9 | ||
|
Sep 2003
2×5×7×37 Posts |
Quote:
Ethan, can you perhaps try to inquire about this on comp.arch ? And post a Google Groups link to the thread there if you start one... Is this known to be a limitation of the architecture, for all Athlon64s and Opterons, or is it by any chance just a limitation of certain early steppings... wild guesses here, I really don't know much about CPUs. Quote:
Rather than modifying code for all FFTs, could we consider just modifying it for one or two FFT lengths (the ones where most testing is currently being done)... Is there any chance that such a code modification would be more generally useful (applicable to a future Intel x86-64 chip for instance) rather than just a workaround for the current version of the Athlon64? Once again, I'm not particularly familiar with CPUs and assembler, so I'm not sure if any of those questions make sense... |
||
|
|
|
|
|
#10 | |
|
Aug 2002
3×37 Posts |
Quote:
Is it hard to force the compiler to make what one would make in assembler but using plain C and calls to Intel intrinsic library. Some times I was sure about some modification. I expected a better timing and the results was just the opposite. And some times i was expecting no gain and surprisingly it get better results. Guillermo. |
|
|
|
|
|
|
#11 |
|
Apr 2003
Berlin, Germany
192 Posts |
http://www.amd.com/us-en/assets/cont...C_2003_pdf.pdf, page 13 also mentions the MOVAPD problem - with different reasons for this behaviour of the K8 chips.
It is understandable that MOVAPD will use the FMUL/FADD pipelines if FSTOR is already busy and thus could take away some FMUL/FADD issue slots. But I observed the 1 64bit load/cycle behaviour also while using MOVAPDs only. |
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| How many ways can you code an LL test | science_man_88 | Lounge | 20 | 2018-08-23 23:06 |
| Photoshop Creative Suite 5 and CUDA | Rodrigo | GPU Computing | 1 | 2011-07-04 10:51 |
| ways to get rid of oil spills | science_man_88 | Puzzles | 9 | 2010-07-30 21:22 |
| AMD Athlon 64 vs AMD Opteron for ecm | thomasn | Factoring | 6 | 2004-11-08 13:25 |
| interesting tools and compilers (for P4, Athlon, Opteron) | Dresdenboy | Hardware | 13 | 2003-05-21 10:36 |