![]() |
|
|
#176 | |
|
P90 years forever!
Aug 2002
Yeehaw, FL
19·397 Posts |
Quote:
|
|
|
|
|
|
|
#177 |
|
"Mike"
Aug 2002
5×17×97 Posts |
Whew!
|
|
|
|
|
|
#178 |
|
"Mike"
Aug 2002
5·17·97 Posts |
$945.13 in donations, $835.32 spent, and $109.81 left...
|
|
|
|
|
|
#179 |
|
Apr 2003
Berlin, Germany
192 Posts |
It's nice to hear that the revisions don't matter for mrime and glucas. And it also shows that the compiler isn't producing bad code (although not perfectly optimized) for SSE2. :)
Well, now it's time to explore what we can do with 32 doubles held in registers at once... |
|
|
|
|
|
#180 |
|
P90 years forever!
Aug 2002
Yeehaw, FL
19·397 Posts |
Here is the Opteron B timings. Note that test 1001 and 1004 are significantly faster on the C. These test read 16-bit values from the L1 cache to an MMX register. So, something was improved in the core.
Test 1012 and 1013 are about the same. These are 10000 iterations of the typical prime95 code chunk we talked about in a different thread. [code:1]Test 0: 0.000 sec. (137 clocks), avg: 0.000 sec. (149 clocks) Test 1: 0.000 sec. (2156 clocks), avg: 0.000 sec. (2162 clocks) Test 2: 0.000 sec. (654156 clocks), avg: 0.000 sec. (654249 clocks) Test 3: 0.002 sec. (2752185 clocks), avg: 0.002 sec. (2753265 clocks) Test 4: 0.002 sec. (3144129 clocks), avg: 0.002 sec. (3163605 clocks) Test 1000: 0.000 sec. (528168 clocks), avg: 0.000 sec. (529374 clocks) Test 1001: 0.002 sec. (2122592 clocks), avg: 0.002 sec. (2122969 clocks) Test 1002: 0.002 sec. (2279096 clocks), avg: 0.002 sec. (2298604 clocks) Test 1003: 0.000 sec. (525161 clocks), avg: 0.000 sec. (525214 clocks) Test 1004: 0.002 sec. (2460430 clocks), avg: 0.002 sec. (2463085 clocks) Test 1005: 0.003 sec. (4385117 clocks), avg: 0.003 sec. (4578506 clocks) Test 1006: 0.000 sec. (528167 clocks), avg: 0.000 sec. (528226 clocks) Test 1007: 0.002 sec. (2846988 clocks), avg: 0.002 sec. (2854400 clocks) Test 1008: 0.003 sec. (4238748 clocks), avg: 0.003 sec. (4401877 clocks) Test 1009: 0.001 sec. (1040154 clocks), avg: 0.001 sec. (1040256 clocks) Test 1010: 0.004 sec. (4928441 clocks), avg: 0.004 sec. (5036041 clocks) Test 1011: 0.004 sec. (5346607 clocks), avg: 0.004 sec. (5672540 clocks) Test 1012: 0.000 sec. (504970 clocks), avg: 0.000 sec. (505043 clocks) Test 1013: 0.001 sec. (780852 clocks), avg: 0.001 sec. (780980 clocks) Test 1014: 0.001 sec. (1120793 clocks), avg: 0.001 sec. (1122172 clocks) Test 1015: 0.001 sec. (1561666 clocks), avg: 0.001 sec. (1563385 clocks) Test 1016: 0.001 sec. (1290787 clocks), avg: 0.001 sec. (1290936 clocks) Test 1017: 0.001 sec. (1892635 clocks), avg: 0.001 sec. (1892894 clocks) Test 1018: 0.001 sec. (1152665 clocks), avg: 0.001 sec. (1152785 clocks) Test 1019: 0.001 sec. (1997701 clocks), avg: 0.001 sec. (1997987 clocks)[/code:1] If you want I can email you the code for the tests above along with a P4's timings. |
|
|
|
|
|
#181 |
|
Sep 2003
A1916 Posts |
Xyzzy, here's a silly question...
Did you disable the on-board video on that Tyan S2850 motherboard? I'm sure you did, cause I think I see a PCI video card in this photo: http://www.mersenneforum.org/viewtopic.php?p=10367#10367 But when you mentioned the system components in this thread I don't think you ever actually mentioned the video card, so I'm just being paranoid about the benchmarks. |
|
|
|
|
|
#182 |
|
"Mike"
Aug 2002
5×17×97 Posts |
The card I added was a network card...
Most onboard video solutions suck, but in this case it is okay because it is a real video card... It just happens to be on the motherboard... It has its own memory... Besides, we don't use it... All logins are via SSH... |
|
|
|
|
|
#183 | |
|
Apr 2003
Berlin, Germany
36110 Posts |
Quote:
On weekend I didn't find the time to test some different scheduling tricks on Opteron but I will do during this week. |
|
|
|
|
|
|
#184 |
|
"Mike"
Aug 2002
5·17·97 Posts |
$970.13 in donations, $835.32 spent, and $134.81 left...
The CD-ROM arrived a few days ago... As soon as we reboot I'll toss it in... |
|
|
|
|
|
#185 |
|
Apr 2003
Berlin, Germany
5518 Posts |
Could someone with revision C Opteron (Bok?) try a small test under Linux in 64bit mode?
Please download this ZIP file and run the included "readtest" (source code is included too): http://www.informatik.uni-rostock.de/~mw212/readtst.zip It does a read loop using MMX and SSE2 floating point and integer instructions. Thanks. |
|
|
|
|
|
#186 |
|
I ♥ BOINC!
Oct 2002
Glendale, AZ. (USA)
3×7×53 Posts |
Hopefully not too far off topic...
If I were wanting to purchase an Opteron now, which is the right one to get for single cpu? 940 pin? 939 pin? 740 pin? Which motherboard? Do we need to wait another month ? TIA |
|
|
|
|
|
#187 |
|
Apr 2003
Berlin, Germany
36110 Posts |
@Ironbits:
Well, you can already buy an Opteron. In future they will still have 940 pins. Only Athlon 64 FX will have 939 (no SMP but dual channel) and Athlon 64 will have 754 pins (single channel). There are a lot of motherboards around. Just look what socket they have and buy the CPU accordingly. BTW the 754 pin Athlon 64 is only a few percent slower than an equally clocked Athlon 64 FX. In most cases the single channel has very little to no impact except on the price which will be lower by a significant amount ![]() The Athlon 64's won't be one month late. And look out for MoBo+CPU combo's. There will be special offers where you can save a lot. |
|
|
|
|
|
#188 | |
|
Aug 2003
52 Posts |
Quote:
Bok |
|
|
|
|
|
|
#189 |
|
Apr 2003
Berlin, Germany
192 Posts |
No prob, thanks.
Here is also a win32 version of the program: http://www.informatik.uni-rostock.de...test_win32.zip (should be run in the console to see the results) Regards, DB/Matthias |
|
|
|
|
|
#190 |
|
Aug 2003
52 Posts |
ok,
I've got it running XP Pro at the moment anyway, so I'll do that one first. Or do you need a 64bit Win 2003 for the test ?? Bok |
|
|
|
|
|
#191 |
|
Apr 2003
Berlin, Germany
192 Posts |
I made different versions because the result is OS-independent. By providing versions for different OS' it is easier for people to run the test quickly.
|
|
|
|
|
|
#192 |
|
Aug 2003
52 Posts |
DresdenBoy, these are the results from XP Pro running on the opteron 1.8.
Running MMX/SSE/SSE2 reading speed tests... 78bf9ff Time for reading 32kB from cache using MOVQ :2593l (12.64 Bytes per cycle) Time for reading 4kB from cache using MOVQ :341l (12.01 Bytes per cycle) Time for reading 32kB cache using MOVAPD :4169l (7.86 Bytes per cycle) Time for reading 4kB from cache using MOVAPD :533l (7.68 Bytes per cycle) Time for reading 32kB from cache using MOVDQA :4130l (7.93 Bytes per cycle) Time for reading 4kB from cache using MOVDQA :533l (7.68 Bytes per cycle) Time for reading 32kB from cache using MOVAPS :4117l (7.96 Bytes per cycle) Time for reading 4kB from cache using MOVAPS :533l (7.68 Bytes per cycle) |
|
|
|
|
|
#193 |
|
Aug 2003
2510 Posts |
And results from Suse EL8.2 64-bit are
Opteron64:~/readtst # ./readtest Running MMX/SSE2 reading speed tests... Time for reading 32k from L1 cache using MMX :2239 (14.64 Bytes per cycle) Time for reading 32k from L1 cache using SSE2 :4116 (7.96 Bytes per cycle) Time for reading 32k from L1 cache using IntSSE2 :4116 (7.96 Bytes per cycle) Bok |
|
|
|
|
|
#194 |
|
Apr 2003
Berlin, Germany
192 Posts |
Thanks. The results show that in the revision C core nothing has changed for these instructions. And please ignore the l's after the win version cycle counts - they are relicts of the 64bit code.
![]() So we have to look for other alternatives to use in the code. BTW, according to AMD, they want to sell a lot of Athlon 64's in the next months. I hope we can win many of them if they know that they could get a "tuned" Prime95 client for Opteron. ![]() And before I forget to post it: Detailed Architecture of AMDs 64bit Core ![]() That should answer all remaining questions ![]() BTW, today is the official release of Athlon 64 and Athlon FX (maybe also Athlon 64 M). But I can't follow the events since my colleagues and me have to visit a project workshop the next days in Frankfurt. That needs all our time. So my Opteron research will stall for some days. ![]() Regards, Matthias/DB |
|
|
|
|
|
#195 | |
|
P90 years forever!
Aug 2002
Yeehaw, FL
1D7716 Posts |
From your article:
Quote:
|
|
|
|
|
|
|
#196 |
|
Apr 2003
Berlin, Germany
192 Posts |
If you go back to the mainpage (Chip-Architect), you'll find the articles, where Hans de Vries analyzed the Prescott core.
|
|
|
|
|
|
#197 |
|
P90 years forever!
Aug 2002
Yeehaw, FL
19×397 Posts |
The Chip-architect article says we should be able to load 16 bytes per cycle. Your tests indicate this is not happening.
Prime95 is slow on the Opteron because the FPU is starved for data. We need to figure out why your test gets only half the expected bandwidth. It would also be nice to run your tests reading data from the L2 cache. We may well have two separate problems Everything I've tried at resolving this data bandwidth problem has failed. |
|
|
|
|
|
#198 |
|
Apr 2003
Berlin, Germany
192 Posts |
I included the sourcecode in both versions. So you can change it to your needs. I will also extend and update the sources to do more tests.
The Chip-Architect article also gives reasons why it's not wise to use instructions on XMM registers which expect a different format. It also states that memory operands for FP operations are fetched by the Int units and delivered to the FPU. Did you also have a look at http://www.digit-life.com/articles2/...ily2-add0.html, http://www.digit-life.com/articles2/...ily2-add1.html and http://www.digit-life.com/articles2/...ily2-add2.html? The second of these has a lot of details about the behaviour of the L1/L2 caches on K8. Recently I looked at some old preview of the Hammer by www.tecchannel.de. They've run their TecMem benchmark on the CPU (an engineering sample running at 800MHz) and reached up to 16Bytes/cycle using MOVDQA. The same tests on Opteron (using a newer version of TecMem) achieved only up to 8Bytes/cycle for 128bit accesses while 64bit accesses read 16Bytes/cycle. But that could have a different reason: MOVDQA has the same opcode as MOVQ, but extended by a 0x66 prefix. If a CPU doesn't understand the SSE2 instruction (like my Athlon XP) it will execute it as the MMX equivalent. In that case the actually achieved bandwidth is just half of the expected. That could be the case for this engineering sample because SSE2 could have been disabled for it. |
|
|
|
|
|
#199 | |
|
Aug 2002
Rovereto (Italy)
3×53 Posts |
Quote:
Thanks in advance. Guido |
|
|
|
|
|
|
#200 | |
|
Sep 2003
5·11·47 Posts |
Quote:
http://www.chip-architect.com/news/2...4bit_Core.html Maybe the mods for this forum could just edit Dresdenboy's original post to fix this. [Edit: well it's not Dresdenboy's fault: it's the board! It introduced the same garbage characters for me as it did for him. ]Try: http:// www.chip-architect.com/news/ 2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Last fiddled with by GP2 on 2003-09-29 at 16:27 |
|
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Opteron is Hyperthreaded ? | bgbeuning | Information & Answers | 3 | 2016-01-10 08:26 |
| Opteron web server... | Xyzzy | Lounge | 14 | 2003-11-05 23:07 |
| Opteron Bottleneck?? | Prime95 | Hardware | 31 | 2003-09-17 06:54 |
| AMD Opteron | naclosagc | Software | 27 | 2003-08-10 19:14 |
| What will an AMD Opteron be classified as ? | dsouza123 | Software | 4 | 2003-08-02 14:29 |