mersenneforum.org  

Go Back   mersenneforum.org > Fun Stuff > Lounge

Reply
 
Thread Tools
Old 2003-09-22, 14:17   #188
Bok
 
Aug 2003

52 Posts
Default

Quote:
Originally posted by Dresdenboy
Could someone with revision C Opteron (Bok?) try a small test under Linux in 64bit mode?

Please download this ZIP file and run the included "readtest" (source code is included too): http://www.informatik.uni-rostock.de/~mw212/readtst.zip

It does a read loop using MMX and SSE2 floating point and integer instructions.

Thanks.
Sorry, been on vacation. I'll try that test later tonight.

Bok
Bok is offline   Reply With Quote
Old 2003-09-22, 14:41   #189
Dresdenboy
 
Dresdenboy's Avatar
 
Apr 2003
Berlin, Germany

192 Posts
Default

No prob, thanks.

Here is also a win32 version of the program:
http://www.informatik.uni-rostock.de...test_win32.zip (should be run in the console to see the results)

Regards,
DB/Matthias
Dresdenboy is offline   Reply With Quote
Old 2003-09-22, 15:06   #190
Bok
 
Aug 2003

52 Posts
Default

ok,

I've got it running XP Pro at the moment anyway, so I'll do that one first. Or do you need a 64bit Win 2003 for the test ??

Bok
Bok is offline   Reply With Quote
Old 2003-09-22, 16:59   #191
Dresdenboy
 
Dresdenboy's Avatar
 
Apr 2003
Berlin, Germany

192 Posts
Default

I made different versions because the result is OS-independent. By providing versions for different OS' it is easier for people to run the test quickly.
Dresdenboy is offline   Reply With Quote
Old 2003-09-22, 21:55   #192
Bok
 
Aug 2003

52 Posts
Default

DresdenBoy, these are the results from XP Pro running on the opteron 1.8.

Running MMX/SSE/SSE2 reading speed tests... 78bf9ff

Time for reading 32kB from cache using MOVQ :2593l (12.64 Bytes per cycle)
Time for reading 4kB from cache using MOVQ :341l (12.01 Bytes per cycle)
Time for reading 32kB cache using MOVAPD :4169l (7.86 Bytes per cycle)
Time for reading 4kB from cache using MOVAPD :533l (7.68 Bytes per cycle)
Time for reading 32kB from cache using MOVDQA :4130l (7.93 Bytes per cycle)
Time for reading 4kB from cache using MOVDQA :533l (7.68 Bytes per cycle)
Time for reading 32kB from cache using MOVAPS :4117l (7.96 Bytes per cycle)
Time for reading 4kB from cache using MOVAPS :533l (7.68 Bytes per cycle)
Bok is offline   Reply With Quote
Old 2003-09-23, 01:09   #193
Bok
 
Aug 2003

52 Posts
Default

And results from Suse EL8.2 64-bit are

Opteron64:~/readtst # ./readtest
Running MMX/SSE2 reading speed tests...

Time for reading 32k from L1 cache using MMX :2239 (14.64 Bytes per cycle)
Time for reading 32k from L1 cache using SSE2 :4116 (7.96 Bytes per cycle)
Time for reading 32k from L1 cache using IntSSE2 :4116 (7.96 Bytes per cycle)


Bok
Bok is offline   Reply With Quote
Old 2003-09-23, 06:09   #194
Dresdenboy
 
Dresdenboy's Avatar
 
Apr 2003
Berlin, Germany

5518 Posts
Default

Thanks. The results show that in the revision C core nothing has changed for these instructions. And please ignore the l's after the win version cycle counts - they are relicts of the 64bit code.

So we have to look for other alternatives to use in the code.

BTW, according to AMD, they want to sell a lot of Athlon 64's in the next months. I hope we can win many of them if they know that they could get a "tuned" Prime95 client for Opteron.

And before I forget to post it:
Detailed Architecture of AMDs 64bit Core
That should answer all remaining questions

BTW, today is the official release of Athlon 64 and Athlon FX (maybe also Athlon 64 M). But I can't follow the events since my colleagues and me have to visit a project workshop the next days in Frankfurt. That needs all our time. So my Opteron research will stall for some days.

Regards,
Matthias/DB
Dresdenboy is offline   Reply With Quote
Old 2003-09-23, 06:41   #195
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

7,537 Posts
Default

From your article:

Quote:
The next Pentium, code-named Prescott has an extra Floating Point Multiplier and Adder as we could reveal to you here. We now think that the extra FP units are used for single port but full 128 bit operation. This would bring back the SSE2 latencies for Add and Multiply to 5 and 7 cycles, beneficial for single thread programs. It would double the Floating Point bandwidth which is mainly interesting for Hyper Threading performance.
This is potentially BIG news. If true, it DOUBLES the theoretical FPU throughput of the next generation P4!!! Decoder, data bandwidth, and register dependency limitations will prevent a doubling of prime95 speed. Still... interesting if true.
Prime95 is offline   Reply With Quote
Old 2003-09-23, 06:49   #196
Dresdenboy
 
Dresdenboy's Avatar
 
Apr 2003
Berlin, Germany

192 Posts
Default

If you go back to the mainpage (Chip-Architect), you'll find the articles, where Hans de Vries analyzed the Prescott core.
Dresdenboy is offline   Reply With Quote
Old 2003-09-23, 07:27   #197
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

7,537 Posts
Default

The Chip-architect article says we should be able to load 16 bytes per cycle. Your tests indicate this is not happening.

Prime95 is slow on the Opteron because the FPU is starved for data. We need to figure out why your test gets only half the expected bandwidth. It would also be nice to run your tests reading data from the L2 cache. We may well have two separate problems

Everything I've tried at resolving this data bandwidth problem has failed.
Prime95 is offline   Reply With Quote
Old 2003-09-23, 08:01   #198
Dresdenboy
 
Dresdenboy's Avatar
 
Apr 2003
Berlin, Germany

16916 Posts
Default

I included the sourcecode in both versions. So you can change it to your needs. I will also extend and update the sources to do more tests.

The Chip-Architect article also gives reasons why it's not wise to use instructions on XMM registers which expect a different format. It also states that memory operands for FP operations are fetched by the Int units and delivered to the FPU.

Did you also have a look at
http://www.digit-life.com/articles2/...ily2-add0.html,
http://www.digit-life.com/articles2/...ily2-add1.html and
http://www.digit-life.com/articles2/...ily2-add2.html?

The second of these has a lot of details about the behaviour of the L1/L2 caches on K8.

Recently I looked at some old preview of the Hammer by www.tecchannel.de. They've run their TecMem benchmark on the CPU (an engineering sample running at 800MHz) and reached up to 16Bytes/cycle using MOVDQA. The same tests on Opteron (using a newer version of TecMem) achieved only up to 8Bytes/cycle for 128bit accesses while 64bit accesses read 16Bytes/cycle.

But that could have a different reason: MOVDQA has the same opcode as MOVQ, but extended by a 0x66 prefix. If a CPU doesn't understand the SSE2 instruction (like my Athlon XP) it will execute it as the MMX equivalent. In that case the actually achieved bandwidth is just half of the expected. That could be the case for this engineering sample because SSE2 could have been disabled for it.
Dresdenboy is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
Opteron is Hyperthreaded ? bgbeuning Information & Answers 3 2016-01-10 08:26
Opteron web server... Xyzzy Lounge 14 2003-11-05 23:07
Opteron Bottleneck?? Prime95 Hardware 31 2003-09-17 06:54
AMD Opteron naclosagc Software 27 2003-08-10 19:14
What will an AMD Opteron be classified as ? dsouza123 Software 4 2003-08-02 14:29

All times are UTC. The time now is 13:46.


Mon Aug 2 13:46:35 UTC 2021 up 10 days, 8:15, 1 user, load averages: 1.89, 1.91, 1.94

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.