mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2015-10-04, 05:04   #540
Madpoo
Serpentine Vermin Jar
 
Madpoo's Avatar
 
Jul 2014

7·11·43 Posts
Default

Quote:
Originally Posted by Prime95 View Post
Yes, results are disappointing. I don't know if it is the chip or Apple's OS hampering it in some way.

My Haswell at 4GHz, DDR3-2400 does 4 2880K FFTs at 14.55 ms/iter. A 2880K uses 2880K * 8 bytes of memory, which is read twice and written twice plus about 5MB of sin/cos/weights data. That is, 8 * 2880K * 4 + 5 = 95MB, bandwidth = 95MB/iter * 4 workers / (.01455 s/iter) = 26.12GB/sec.

The Macbook Pro 2.3GHz, DDR3-1600 does 4 2560K FFTs in 16.8 ms/iter. Bandwidth = (8*2560K*4+5) * 4 / .0168 = 20.23GB/s.

One would expect the DDR3-1600 to deliver 1600/2400 * 26.12GB/s = 17.4GB/s. So the L4 cache does help, but much less than I hoped for.

Oh, and yes at the slow 2.3GHz clock speed the chip is bandwidth limited:
Code:
Timings for 2560K FFT length (1 cpu, 1 worker): 12.62 ms.  Throughput: 79.24 iter/sec.
Timings for 2560K FFT length (2 cpus, 2 workers): 12.49, 12.41 ms.  Throughput: 160.66 iter/sec.
Timings for 2560K FFT length (3 cpus, 3 workers): 14.14, 14.10, 14.17 ms.  Throughput: 212.27 iter/sec.
Timings for 2560K FFT length (4 cpus, 4 workers): 16.46, 16.03, 17.33, 17.24 ms.  Throughput: 238.83 iter/sec.
Timings for 2560K FFT length (4 cpus hyperthreaded, 4 workers): 16.85, 16.78, 16.82, 16.76 ms.  Throughput: 238.05 iter/sec.
I'm visiting some old ground here... I was just reading about Crystalwell and the L4 cache, etc.

It *seems* to me that the L4 cache should help, but only if an LL test is memory limited and not CPU limited.

Has anyone run a benchmark and ramped up the # of threads in a single worker to see how it scales? The issue I always see on many-core systems is that the 2nd core might come close to doubling iteration times, but beyond that you start to see smaller and smaller gains. The working assumption I have is that memory starts to be the bottleneck.

With a 128MB L4 cache and the higher bandwidth it offers, it seems like you could have 4 threads in a single worker and get closer to 4 times the performance of a single core?

I was trying to lookup the specs on Crystalwell... 1600 MHz, 128-bit path, so it's not super fast or anything (not as fast as the L2/L3) but still decent bandwidth compared to main memory I gather?

The one thing that stuck out to me was that the eDRAM is still shared with the GPU, so if you really expect to use *ALL* of that for the CPU, you need to have a discrete GPU installed on the system so the L4 can be just for CPU.

I was mostly just curious about the technology... I don't have plans to buy a Haswell with that feature, but I ran across something about it and it caught my eye.
Madpoo is offline   Reply With Quote
Old 2015-10-04, 20:42   #541
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

2·53·71 Posts
Default

I have a Crystalwell, bandwidth benchmarks attached (running in Windows using VMware). Highlights:

L3 (3rd Level) Data/Unified Cache : 91.27GB/s
L4 (4th Level) Data/Unified Cache : 33GB/s
256MB Data Set : 10.88GB/s (8.7GB/s - 10.88GB/s) (DDR3-1600 memory)


Alas, the L4 cache did not substantially change prime95's performance. I still get more throughput running one worker per core rather than 1 or 2 multithreaded workers. This may because it is a Mac laptop -- Apple has been known to do weird things with OS tuning and users are not allowed to tamper with any BIOS configurations.

Also, does anyone know how to measure or estimate delays introduced by maintaining cache coherence? I'm wondering if that could be one of the problems with prime95 multi-thread performance.
Attached Files
File Type: txt bench.txt (3.9 KB, 400 views)

Last fiddled with by Prime95 on 2015-10-04 at 20:45
Prime95 is offline   Reply With Quote
Old 2015-10-04, 22:36   #542
Mark Rose
 
Mark Rose's Avatar
 
"/X\(‘-‘)/X\"
Jan 2013

22×733 Posts
Default

Quote:
Originally Posted by Prime95 View Post
Also, does anyone know how to measure or estimate delays introduced by maintaining cache coherence? I'm wondering if that could be one of the problems with prime95 multi-thread performance.
It depends on what exactly you're doing with the cache. You may find Ulrich Drepper's "What Every Programmer Should Know About Memory" (PDF) insightful, specifically sections 3, 6, and 7. I'm sure you already know most of what is discussed, but I got a lot out of it. The author used to be the chief developer of glibc.
Mark Rose is offline   Reply With Quote
Old 2015-10-05, 02:17   #543
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

961110 Posts
Default

That is a brilliant document sir! Thanks for sharing it!

edit: To bring a bit of contribution, this paper (about floating numbers) is referred few times (and given in the citations), I googled it and it is very interesting to read too. It explains what the floats are, what they do, why they do it, and if their parents know... It is one of the first links google gives, I assume that a deeper google dig can find a better format (this format is a bit difficult to read due to the fact that almost all text is underlined, at least in my pdf viewer).

Last fiddled with by LaurV on 2015-10-05 at 02:33
LaurV is offline   Reply With Quote
Old 2015-10-05, 05:28   #544
Dubslow
Basketry That Evening!
 
Dubslow's Avatar
 
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88

11100001101012 Posts
Default

Quote:
Originally Posted by Mark Rose View Post
It depends on what exactly you're doing with the cache. You may find Ulrich Drepper's "What Every Programmer Should Know About Memory" (PDF) insightful, specifically sections 3, 6, and 7. I'm sure you already know most of what is discussed, but I got a lot out of it. The author used to be the chief developer of glibc.
Quite the link indeed. Thanks for sharing.
Dubslow is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
Haswell-E Prelim. Benchmark sdbardwick Hardware 37 2015-02-10 18:49
Prime95 and Haswell Pleco Information & Answers 22 2014-07-13 16:03
Haswell Rig Mini-Geek Hardware 64 2014-05-27 13:22
Prime95 version 27.1 early preview, not-even-close-to-beta release Prime95 Software 126 2012-02-09 16:17
Missing mouse-over preview text retina Forum Feedback 1 2011-09-12 15:32

All times are UTC. The time now is 05:51.


Sat Jul 17 05:51:11 UTC 2021 up 50 days, 3:38, 1 user, load averages: 1.19, 1.52, 1.78

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.