mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Hardware (https://www.mersenneforum.org/forumdisplay.php?f=9)
-   -   Haswell Preview Benchmark (https://www.mersenneforum.org/showthread.php?t=17982)

Prime95 2014-03-11 22:36

[QUOTE=Ken_g6;368788]Has anybody tried a system with Iris Pro/Crystal Well 128MB cache? [/QUOTE]

Yes, results are disappointing. I don't know if it is the chip or Apple's OS hampering it in some way.

My Haswell at 4GHz, DDR3-2400 does 4 2880K FFTs at 14.55 ms/iter. A 2880K uses 2880K * 8 bytes of memory, which is read twice and written twice plus about 5MB of sin/cos/weights data. That is, 8 * 2880K * 4 + 5 = 95MB, bandwidth = 95MB/iter * 4 workers / (.01455 s/iter) = 26.12GB/sec.

The Macbook Pro 2.3GHz, DDR3-1600 does 4 2560K FFTs in 16.8 ms/iter. Bandwidth = (8*2560K*4+5) * 4 / .0168 = 20.23GB/s.

One would expect the DDR3-1600 to deliver 1600/2400 * 26.12GB/s = 17.4GB/s. So the L4 cache does help, but much less than I hoped for.

Oh, and yes at the slow 2.3GHz clock speed the chip is bandwidth limited:
[CODE]Timings for 2560K FFT length (1 cpu, 1 worker): 12.62 ms. Throughput: 79.24 iter/sec.
Timings for 2560K FFT length (2 cpus, 2 workers): 12.49, 12.41 ms. Throughput: 160.66 iter/sec.
Timings for 2560K FFT length (3 cpus, 3 workers): 14.14, 14.10, 14.17 ms. Throughput: 212.27 iter/sec.
Timings for 2560K FFT length (4 cpus, 4 workers): 16.46, 16.03, 17.33, 17.24 ms. Throughput: 238.83 iter/sec.
Timings for 2560K FFT length (4 cpus hyperthreaded, 4 workers): 16.85, 16.78, 16.82, 16.76 ms. Throughput: 238.05 iter/sec.
[/CODE]

henryzz 2014-03-14 02:01

I assume that the problem is that 128MB is too small.
What results do you get if you run a smaller FFT length? 768K is small enough that data for 4 cores should fit I think. For LL this is 15M and LLR with larger k this is 10M exponent.
The vast majority of work done by pfgw/LLR should be speeded up massively I would guess.

Prime95 2014-03-14 02:29

[QUOTE=henryzz;368923]I assume that the problem is that 128MB is too small.[/QUOTE]

No. Four 2560K FFTs fit in 25MB each. Similar timings occur for 1024K FFTs which should easily fit in 128MB.

henryzz 2014-03-14 12:24

[QUOTE=Prime95;368925]No. Four 2560K FFTs fit in 25MB each. Similar timings occur for 1024K FFTs which should easily fit in 128MB.[/QUOTE]

Sorry misread your post.
Seems strange that this is so different to the bandwidth gained in [url]http://www.anandtech.com/show/6993/intel-iris-pro-5200-graphics-review-core-i74950hq-tested/3[/url]
I noticed that in all the graphs they skip from bandwidth at 64MB to 256MB. Why is that when 128MB seems the critical point? Is some of the cache filled by the GPU(are you using the integrated GPU?) and meaning that you never actually get to use the full 128MB?
It would be nice to work out what is causing the difference in the tests as they get much better bandwidth at 64MB than we are seeing here at ~100MB.

ldesnogu 2014-03-14 13:10

[QUOTE=Prime95;368793]The Macbook Pro 2.3GHz, DDR3-1600 does 4 2560K FFTs in 16.8 ms/iter. Bandwidth = (8*2560K*4+5) * 4 / .0168 = 20.23GB/s.

One would expect the DDR3-1600 to deliver 1600/2400 * 26.12GB/s = 17.4GB/s. So the L4 cache does help, but much less than I hoped for.[/QUOTE]
I'm not sure you can compute the expected bandwidth this way, there are too many other variables.

So this could also mean you get no speed up at all from the 128MB eDRAM perhaps because it's 100% reserved for GPU by Apple driver.

kracker 2014-03-15 08:49

I'm kinda not sure if FMA3 is worth it for computers with low memory bandwidth..

2048K FFT:
SSE: 210 iter/sec
AVX: 337 iter/sec
FMA3: 346 iter/sec

2400 MHz Dual Channel

TheJudger 2014-03-15 19:42

[QUOTE=kracker;369011]I'm kinda not sure if FMA3 is worth it for computers with low memory bandwidth..

2048K FFT:
SSE: 210 iter/sec
AVX: 337 iter/sec
FMA3: 346 iter/sec

2400 MHz Dual Channel[/QUOTE]

(Asuming Quadcore) What about "4 cores AVX" vs. "3 cores FMA3 + one core idle"?
I'm thinking about[LIST][*]power efficiency (performance per watt)[*]GIMPS throughput[*]system usability while running GIMPS[/LIST]
Oliver

kracker 2014-03-15 21:55

[QUOTE=TheJudger;369031](Asuming Quadcore) What about "4 cores AVX" vs. "3 cores FMA3 + one core idle"?
I'm thinking about[LIST][*]power efficiency (performance per watt)[*]GIMPS throughput[*]system usability while running GIMPS[/LIST]
Oliver[/QUOTE]
FMA3 only 3 cores 75W 306 iter/sec
AVX full 83W 336 iter/sec

axn 2014-03-17 05:24

[QUOTE=kracker;369011]2400 MHz Dual Channel[/QUOTE]

Are you sure that the memory is actually running @ 2400?

kracker 2014-03-17 06:51

[QUOTE=axn;369168]Are you sure that the memory is actually running @ 2400?[/QUOTE]

Yes.

kracker 2014-03-17 08:12

1 Attachment(s)
Is 3360K FFT a efficient FFT? I'm running a 64M exponent with 9.6 ms, benchmark says 7.8 ms for 3584K FFT.
(4670k two threads)


All times are UTC. The time now is 05:39.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.