mersenneforum.org  

Go Back   mersenneforum.org > Fun Stuff > Lounge

Reply
 
Thread Tools
Old 2003-09-06, 20:33   #176
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

19·397 Posts
Default

Quote:
Originally Posted by gbvalor
It seems there is no significative advantage using Revision C for Glucas.
From the timer values I see no significant advantage for rev C for prime95 either.
Prime95 is offline   Reply With Quote
Old 2003-09-07, 04:23   #177
Xyzzy
 
Xyzzy's Avatar
 
"Mike"
Aug 2002

203516 Posts
Default

Whew!
Xyzzy is offline   Reply With Quote
Old 2003-09-07, 11:08   #178
Xyzzy
 
Xyzzy's Avatar
 
"Mike"
Aug 2002

5·17·97 Posts
Default

$945.13 in donations, $835.32 spent, and $109.81 left...
Xyzzy is offline   Reply With Quote
Old 2003-09-07, 15:47   #179
Dresdenboy
 
Dresdenboy's Avatar
 
Apr 2003
Berlin, Germany

36110 Posts
Default

It's nice to hear that the revisions don't matter for mrime and glucas. And it also shows that the compiler isn't producing bad code (although not perfectly optimized) for SSE2. :)

Well, now it's time to explore what we can do with 32 doubles held in registers at once...
Dresdenboy is offline   Reply With Quote
Old 2003-09-07, 20:38   #180
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

19·397 Posts
Default

Here is the Opteron B timings. Note that test 1001 and 1004 are significantly faster on the C. These test read 16-bit values from the L1 cache to an MMX register. So, something was improved in the core.

Test 1012 and 1013 are about the same. These are 10000 iterations of the typical prime95 code chunk we talked about in a different thread.

[code:1]Test 0: 0.000 sec. (137 clocks), avg: 0.000 sec. (149 clocks)
Test 1: 0.000 sec. (2156 clocks), avg: 0.000 sec. (2162 clocks)
Test 2: 0.000 sec. (654156 clocks), avg: 0.000 sec. (654249 clocks)
Test 3: 0.002 sec. (2752185 clocks), avg: 0.002 sec. (2753265 clocks)
Test 4: 0.002 sec. (3144129 clocks), avg: 0.002 sec. (3163605 clocks)
Test 1000: 0.000 sec. (528168 clocks), avg: 0.000 sec. (529374 clocks)
Test 1001: 0.002 sec. (2122592 clocks), avg: 0.002 sec. (2122969 clocks)
Test 1002: 0.002 sec. (2279096 clocks), avg: 0.002 sec. (2298604 clocks)
Test 1003: 0.000 sec. (525161 clocks), avg: 0.000 sec. (525214 clocks)
Test 1004: 0.002 sec. (2460430 clocks), avg: 0.002 sec. (2463085 clocks)
Test 1005: 0.003 sec. (4385117 clocks), avg: 0.003 sec. (4578506 clocks)
Test 1006: 0.000 sec. (528167 clocks), avg: 0.000 sec. (528226 clocks)
Test 1007: 0.002 sec. (2846988 clocks), avg: 0.002 sec. (2854400 clocks)
Test 1008: 0.003 sec. (4238748 clocks), avg: 0.003 sec. (4401877 clocks)
Test 1009: 0.001 sec. (1040154 clocks), avg: 0.001 sec. (1040256 clocks)
Test 1010: 0.004 sec. (4928441 clocks), avg: 0.004 sec. (5036041 clocks)
Test 1011: 0.004 sec. (5346607 clocks), avg: 0.004 sec. (5672540 clocks)
Test 1012: 0.000 sec. (504970 clocks), avg: 0.000 sec. (505043 clocks)
Test 1013: 0.001 sec. (780852 clocks), avg: 0.001 sec. (780980 clocks)
Test 1014: 0.001 sec. (1120793 clocks), avg: 0.001 sec. (1122172 clocks)
Test 1015: 0.001 sec. (1561666 clocks), avg: 0.001 sec. (1563385 clocks)
Test 1016: 0.001 sec. (1290787 clocks), avg: 0.001 sec. (1290936 clocks)
Test 1017: 0.001 sec. (1892635 clocks), avg: 0.001 sec. (1892894 clocks)
Test 1018: 0.001 sec. (1152665 clocks), avg: 0.001 sec. (1152785 clocks)
Test 1019: 0.001 sec. (1997701 clocks), avg: 0.001 sec. (1997987 clocks)[/code:1]

If you want I can email you the code for the tests above along with a P4's timings.
Prime95 is offline   Reply With Quote
Old 2003-09-07, 21:40   #181
GP2
 
GP2's Avatar
 
Sep 2003

5·11·47 Posts
Default

Xyzzy, here's a silly question...

Did you disable the on-board video on that Tyan S2850 motherboard?
I'm sure you did, cause I think I see a PCI video card in this photo:

http://www.mersenneforum.org/viewtopic.php?p=10367#10367

But when you mentioned the system components in this thread I don't think you ever actually mentioned the video card, so I'm just being paranoid about the benchmarks.
GP2 is offline   Reply With Quote
Old 2003-09-07, 22:09   #182
Xyzzy
 
Xyzzy's Avatar
 
"Mike"
Aug 2002

824510 Posts
Default

The card I added was a network card...

Most onboard video solutions suck, but in this case it is okay because it is a real video card... It just happens to be on the motherboard... It has its own memory...

Besides, we don't use it... All logins are via SSH...
Xyzzy is offline   Reply With Quote
Old 2003-09-08, 07:09   #183
Dresdenboy
 
Dresdenboy's Avatar
 
Apr 2003
Berlin, Germany

192 Posts
Default

Quote:
Originally Posted by Prime95
If you want I can email you the code for the tests above along with a P4's timings.
That would be nice. Also it would be interesting to see how different code blocks behave on Opteron.

On weekend I didn't find the time to test some different scheduling tricks on Opteron but I will do during this week.
Dresdenboy is offline   Reply With Quote
Old 2003-09-13, 06:20   #184
Xyzzy
 
Xyzzy's Avatar
 
"Mike"
Aug 2002

824510 Posts
Default

$970.13 in donations, $835.32 spent, and $134.81 left...

The CD-ROM arrived a few days ago... As soon as we reboot I'll toss it in...
Xyzzy is offline   Reply With Quote
Old 2003-09-15, 13:52   #185
Dresdenboy
 
Dresdenboy's Avatar
 
Apr 2003
Berlin, Germany

16916 Posts
Default

Could someone with revision C Opteron (Bok?) try a small test under Linux in 64bit mode?

Please download this ZIP file and run the included "readtest" (source code is included too): http://www.informatik.uni-rostock.de/~mw212/readtst.zip

It does a read loop using MMX and SSE2 floating point and integer instructions.

Thanks.
Dresdenboy is offline   Reply With Quote
Old 2003-09-15, 15:56   #186
IronBits
I ♥ BOINC!
 
IronBits's Avatar
 
Oct 2002
Glendale, AZ. (USA)

3×7×53 Posts
Default

Hopefully not too far off topic...
If I were wanting to purchase an Opteron now, which is the right one to get for single cpu?
940 pin?
939 pin?
740 pin?
Which motherboard?
Do we need to wait another month ?
TIA
IronBits is offline   Reply With Quote
Old 2003-09-15, 16:04   #187
Dresdenboy
 
Dresdenboy's Avatar
 
Apr 2003
Berlin, Germany

192 Posts
Default

@Ironbits:

Well, you can already buy an Opteron. In future they will still have 940 pins. Only Athlon 64 FX will have 939 (no SMP but dual channel) and Athlon 64 will have 754 pins (single channel). There are a lot of motherboards around.

Just look what socket they have and buy the CPU accordingly. BTW the 754 pin Athlon 64 is only a few percent slower than an equally clocked Athlon 64 FX. In most cases the single channel has very little to no impact except on the price which will be lower by a significant amount

The Athlon 64's won't be one month late. And look out for MoBo+CPU combo's. There will be special offers where you can save a lot.
Dresdenboy is offline   Reply With Quote
Old 2003-09-22, 14:17   #188
Bok
 
Aug 2003

52 Posts
Default

Quote:
Originally posted by Dresdenboy
Could someone with revision C Opteron (Bok?) try a small test under Linux in 64bit mode?

Please download this ZIP file and run the included "readtest" (source code is included too): http://www.informatik.uni-rostock.de/~mw212/readtst.zip

It does a read loop using MMX and SSE2 floating point and integer instructions.

Thanks.
Sorry, been on vacation. I'll try that test later tonight.

Bok
Bok is offline   Reply With Quote
Old 2003-09-22, 14:41   #189
Dresdenboy
 
Dresdenboy's Avatar
 
Apr 2003
Berlin, Germany

192 Posts
Default

No prob, thanks.

Here is also a win32 version of the program:
http://www.informatik.uni-rostock.de...test_win32.zip (should be run in the console to see the results)

Regards,
DB/Matthias
Dresdenboy is offline   Reply With Quote
Old 2003-09-22, 15:06   #190
Bok
 
Aug 2003

52 Posts
Default

ok,

I've got it running XP Pro at the moment anyway, so I'll do that one first. Or do you need a 64bit Win 2003 for the test ??

Bok
Bok is offline   Reply With Quote
Old 2003-09-22, 16:59   #191
Dresdenboy
 
Dresdenboy's Avatar
 
Apr 2003
Berlin, Germany

192 Posts
Default

I made different versions because the result is OS-independent. By providing versions for different OS' it is easier for people to run the test quickly.
Dresdenboy is offline   Reply With Quote
Old 2003-09-22, 21:55   #192
Bok
 
Aug 2003

52 Posts
Default

DresdenBoy, these are the results from XP Pro running on the opteron 1.8.

Running MMX/SSE/SSE2 reading speed tests... 78bf9ff

Time for reading 32kB from cache using MOVQ :2593l (12.64 Bytes per cycle)
Time for reading 4kB from cache using MOVQ :341l (12.01 Bytes per cycle)
Time for reading 32kB cache using MOVAPD :4169l (7.86 Bytes per cycle)
Time for reading 4kB from cache using MOVAPD :533l (7.68 Bytes per cycle)
Time for reading 32kB from cache using MOVDQA :4130l (7.93 Bytes per cycle)
Time for reading 4kB from cache using MOVDQA :533l (7.68 Bytes per cycle)
Time for reading 32kB from cache using MOVAPS :4117l (7.96 Bytes per cycle)
Time for reading 4kB from cache using MOVAPS :533l (7.68 Bytes per cycle)
Bok is offline   Reply With Quote
Old 2003-09-23, 01:09   #193
Bok
 
Aug 2003

52 Posts
Default

And results from Suse EL8.2 64-bit are

Opteron64:~/readtst # ./readtest
Running MMX/SSE2 reading speed tests...

Time for reading 32k from L1 cache using MMX :2239 (14.64 Bytes per cycle)
Time for reading 32k from L1 cache using SSE2 :4116 (7.96 Bytes per cycle)
Time for reading 32k from L1 cache using IntSSE2 :4116 (7.96 Bytes per cycle)


Bok
Bok is offline   Reply With Quote
Old 2003-09-23, 06:09   #194
Dresdenboy
 
Dresdenboy's Avatar
 
Apr 2003
Berlin, Germany

192 Posts
Default

Thanks. The results show that in the revision C core nothing has changed for these instructions. And please ignore the l's after the win version cycle counts - they are relicts of the 64bit code.

So we have to look for other alternatives to use in the code.

BTW, according to AMD, they want to sell a lot of Athlon 64's in the next months. I hope we can win many of them if they know that they could get a "tuned" Prime95 client for Opteron.

And before I forget to post it:
Detailed Architecture of AMDs 64bit Core
That should answer all remaining questions

BTW, today is the official release of Athlon 64 and Athlon FX (maybe also Athlon 64 M). But I can't follow the events since my colleagues and me have to visit a project workshop the next days in Frankfurt. That needs all our time. So my Opteron research will stall for some days.

Regards,
Matthias/DB
Dresdenboy is offline   Reply With Quote
Old 2003-09-23, 06:41   #195
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

19×397 Posts
Default

From your article:

Quote:
The next Pentium, code-named Prescott has an extra Floating Point Multiplier and Adder as we could reveal to you here. We now think that the extra FP units are used for single port but full 128 bit operation. This would bring back the SSE2 latencies for Add and Multiply to 5 and 7 cycles, beneficial for single thread programs. It would double the Floating Point bandwidth which is mainly interesting for Hyper Threading performance.
This is potentially BIG news. If true, it DOUBLES the theoretical FPU throughput of the next generation P4!!! Decoder, data bandwidth, and register dependency limitations will prevent a doubling of prime95 speed. Still... interesting if true.
Prime95 is offline   Reply With Quote
Old 2003-09-23, 06:49   #196
Dresdenboy
 
Dresdenboy's Avatar
 
Apr 2003
Berlin, Germany

192 Posts
Default

If you go back to the mainpage (Chip-Architect), you'll find the articles, where Hans de Vries analyzed the Prescott core.
Dresdenboy is offline   Reply With Quote
Old 2003-09-23, 07:27   #197
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

19·397 Posts
Default

The Chip-architect article says we should be able to load 16 bytes per cycle. Your tests indicate this is not happening.

Prime95 is slow on the Opteron because the FPU is starved for data. We need to figure out why your test gets only half the expected bandwidth. It would also be nice to run your tests reading data from the L2 cache. We may well have two separate problems

Everything I've tried at resolving this data bandwidth problem has failed.
Prime95 is offline   Reply With Quote
Old 2003-09-23, 08:01   #198
Dresdenboy
 
Dresdenboy's Avatar
 
Apr 2003
Berlin, Germany

16916 Posts
Default

I included the sourcecode in both versions. So you can change it to your needs. I will also extend and update the sources to do more tests.

The Chip-Architect article also gives reasons why it's not wise to use instructions on XMM registers which expect a different format. It also states that memory operands for FP operations are fetched by the Int units and delivered to the FPU.

Did you also have a look at
http://www.digit-life.com/articles2/...ily2-add0.html,
http://www.digit-life.com/articles2/...ily2-add1.html and
http://www.digit-life.com/articles2/...ily2-add2.html?

The second of these has a lot of details about the behaviour of the L1/L2 caches on K8.

Recently I looked at some old preview of the Hammer by www.tecchannel.de. They've run their TecMem benchmark on the CPU (an engineering sample running at 800MHz) and reached up to 16Bytes/cycle using MOVDQA. The same tests on Opteron (using a newer version of TecMem) achieved only up to 8Bytes/cycle for 128bit accesses while 64bit accesses read 16Bytes/cycle.

But that could have a different reason: MOVDQA has the same opcode as MOVQ, but extended by a 0x66 prefix. If a CPU doesn't understand the SSE2 instruction (like my Athlon XP) it will execute it as the MMX equivalent. In that case the actually achieved bandwidth is just half of the expected. That could be the case for this engineering sample because SSE2 could have been disabled for it.
Dresdenboy is offline   Reply With Quote
Old 2003-09-29, 16:08   #199
guido72
 
guido72's Avatar
 
Aug 2002
Rovereto (Italy)

15910 Posts
Default

Quote:
Originally posted by Dresdenboy
And before I forget to post it:
Detailed Architecture of AMDs 64bit Core
Hello DB! The link above is "kaputt"... Have you got another one?
Thanks in advance.
Guido
guido72 is offline   Reply With Quote
Old 2003-09-29, 16:22   #200
GP2
 
GP2's Avatar
 
Sep 2003

5·11·47 Posts
Default

Quote:
Originally posted by guido72
Hello DB! The link above is "kaputt"... Have you got another one?
Thanks in advance.
Guido
Here's the good link, the other one just had some space and "<br/>" mixed in...

http://www.chip-architect.com/news/2...4bit_Core.html

Maybe the mods for this forum could just edit Dresdenboy's original post to fix this.


[Edit: well it's not Dresdenboy's fault: it's the board! It introduced the same garbage characters for me as it did for him. ]

Try:

http://

www.chip-architect.com/news/

2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html

Last fiddled with by GP2 on 2003-09-29 at 16:27
GP2 is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Opteron is Hyperthreaded ? bgbeuning Information & Answers 3 2016-01-10 08:26
Opteron web server... Xyzzy Lounge 14 2003-11-05 23:07
Opteron Bottleneck?? Prime95 Hardware 31 2003-09-17 06:54
AMD Opteron naclosagc Software 27 2003-08-10 19:14
What will an AMD Opteron be classified as ? dsouza123 Software 4 2003-08-02 14:29

All times are UTC. The time now is 22:29.


Fri Aug 6 22:29:40 UTC 2021 up 14 days, 16:58, 1 user, load averages: 3.21, 3.27, 3.21

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.