mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Hardware (https://www.mersenneforum.org/forumdisplay.php?f=9)
-   -   Haswell Preview Benchmark (https://www.mersenneforum.org/showthread.php?t=17982)

ewmayer 2013-08-13 02:56

[QUOTE=kracker;349331]I finally decided to go with [URL="http://www.newegg.com/Product/Product.aspx?Item=N82E16820231474"]this[/URL], as I don't exactly have a huge money tree... Well, all I'm hoping for is decent performance compared to the
2.3 GHz Core2 Duo I had before.[/QUOTE]

That costs a smidge more per-byte than the [url=http://www.newegg.com/Product/Product.aspx?Item=N82E16820231587]ddr3 2400[/url] George & I got ... is the issue that the 2400 is not available [or not as cheap per-byte] in a 4GB option?

TheMawn 2013-08-13 04:19

I don't think there is much engineering put into new 2GB sticks. They are becoming a bit of a thing of the past... Hooray for the future.

Also, a budget build CPU isn't going to be able to handle 2400MHz RAM. All of ASUS's boards can handle up to 2800MHz (in the Z77 lineup, anyway) but they could only "guarantee" that with an i7-3770k, let alone even an i5-3570k. If you're going to run something like an i3, I can't see 2400MHz memory working. Nor would it be necessary.

I've gone on a bit of an editing spree since I'm spewing nonsense all over the place. Haswells look to be costing as much as Ivy Bridge so go fourth gen all the way if you ask me. The i3's aren't actually out yet but you're getting two cores with i3 and four cores with i5 for about 50% more money.

You can get a CPU for $180, a board for $150, the RAM you picked out is $50, and you can get a pretty cost effective GPUs for $200 apiece. On the other hand, I have a $250 CPU, a $250 board, $150 memory and a $400 GPU and I don't think my system is going to beat yours 1050 to 580. I'm kind of liking the sound of a budget PC, to be frank.

ewmayer 2013-08-14 00:03

[QUOTE=ewmayer;348624]Took a close look at the offending code here again the past several days and finally found the problem - briefly, I constructed the AVX-based Mersenne-mod carry macros by fusing the fancy-indexing footwork of the legacy SSE2 mersenne-mod-DWT carry macros and the AVX data-permute aspects of the AVX-based Fermat-mod carry macros - the result ran incredibly, awfully, unbelievably slowly in the initial implementation, which came online just before Haswell hit the market. I have traced the problem back to the mixing of legacy SSE instructions (using xmm-form registers) in the indexing-computation portions of the code with AVX instructions used for weights and carries in the new AVX code.[/QUOTE]

OK, I propagated the fixed-up carry macro code to all of my SIMD-optimized fused final-iFFT-pass/carry/initial-fFFT-pass routines - I have such for DFT-pass radices 16,20,24,28,32,36,40,44,52 and 60 - and after the expected round of debugging work, just ran a set of benchmarks on my Haswell quad to gauge the impact of using the "true AVX" carry math on Mersenne-mod arithmetic. Summary table follows.
[b]
# Test #5: AVX mode [now also including Mers-mod carry step] on 3.4 GHz Haswell quad, DDR3 2400 SDRAM (PC3 19200); times in ms/iteration:
[/b][code]
Mersenne-mod: Fermat-mod:
FFT len #threads [1 thread/core]
(Kdbl) 1 2 4 1 2 4
---- ----- ----- ----- ----- ----- -----
896 9.3 4.9 2.9 8.3 4.4 2.6
960 10.1 5.2 2.9 9.0 4.7 2.6
1024 10.4 5.5 3.0 9.7 5.0 2.9
1152 12.2 6.4 3.8
1280 14.2 7.4 4.1
1408 16.1 8.4 4.9
1536 17.4 9.8 5.3
1664 18.7 9.7 5.5
1792 21.3 11.1 6.6 19.2 10.1 5.8
1920 22.0 11.5 6.6 19.1 10.4 5.7
2048 23.9 12.4 7.1 22.2 11.6 6.6
2304 27.5 14.5 8.6
2560 32.2 16.8 9.5
2816 36.5 18.9 11.1
3072 35.8 19.3 12.2
3328 42.2 21.9 12.5
3584 43.4 22.9 14.6 38.6 20.5 12.9
3840 49.5 25.5 14.8 45.1 23.4 13.1
4096 49.1 26.2 16.0 44.5 23.6 15.0
4608 57.3 30.0 18.8
5120 65.4 34.5 21.0
5632 74.4 38.9 23.5
6144 68.2 38.1 27.6
6656 87.2 45.3 27.3
7168 83.0 45.6 33.7 74.8 41.2 30.8
7680 102.3 52.9 31.9 91.9 48.0 29.0
8192 104.7 52.2 37.3 86.9 47.4 34.4
-----------------------------------------------------------
Avg || Scaling: 1.903x 3.040x ---- 1.875x 2.908x
Avg runtime ratio for Mersenne-mod vs Fermat-mod for FFT lengths supporting both kinds of arithmetic:
1.127x 1.102x 1.099x
[/code][b]
Notes:
[/b]
[b]0.[/b] I simply did 1000-iteration timings for all these, and made no effort to account for initialization overhead, thus the times are likely a few % pessimistic;

[b]1.[/b] The 10-15% relative slowness of Mersenne-mod relative to Fermat-mod is expected for my code: Unlike George's which uses an optimized real-vector transform which is ideal for the real-signal Mersenne-mod IBDWT, my code is written around a more-general-purpose complex-signal FFT. Thus it is really more geared toward Fermat-mod arithmetic, where the negacyclic-transform-effecting DWT involves complex-valued weights and yields a so-called "right-angle" transform, which is ideal for handling via complex-signal FFT. For real-signal inputs such as those in Mersenne-mod arithmetic we need to wrap the dyadic-squaring step occurring between the forward and inverse FFT in a complex/real/complex wrapper step which typically results in a 10-20% runtime hit, depending on runlength and platform.

[b]2.[/b] 3072K is the optimal runlength [among this menu of choices] for the most recent M-prime. My target for such official verifies is typically to get the per-iteration time below 10ms [8.64 ms translates to 10 Miters/day, by way of handy rule of thumb]. For the verify of M57885161 Serge Batalov used an SSE2 build of Mlucas and found that the best [in terms of absolute throughput, not per-core efficiency] option on the 32-core Xeon [pre-Sandy-Bridge, i.e. no AVX option] cluster he had access to was to run at the next-higher available FFT length of 3328K, using 32 threads [precisely speaking, a combination of 26 and 32-threads, corresponding to the 2 distinct modmul phases Mlucas arranges things in]. He was able to get right around 8.6 ms/iter that way - so now we are within spitting distance of that total throughput using just 4 Haswell cores. [If I OC'ed aggressively like George does on his system I could probably get the 3072K 4-thread timing down to right around 10 ms/iter]. It will be interesting to see what kind of parallel scalings we can get on Haswell-based [or Ivy Bridge] systems with more than 4 cores. There is usually a big dropoff in parallelism beyond 4 cores ... for M-prime verifies we are usually elated to get *any* added total-throughput boost on > 4 cores.

kracker 2013-08-17 23:03

Hmm.. maybe I should have gotten faster memory. I just tried some tests (my Haswell CPU is still in the box) on my Dual Core(i3 3220) Ivy B with a single 4GB 1600 MHz ram, this is what I get..

One thread: 11 ms.
Two threads: 17 ms each.
(DC tests)

Duh.

EDIT: Is P-1 less memory intensive than LL btw?

Prime95 2013-08-18 01:52

[QUOTE=kracker;349976]Hmm.. maybe I should have gotten faster memory. I just tried some tests (my Haswell CPU is still in the box) on my Dual Core(i3 3220) Ivy B with a single 4GB 1600 MHz ram, this is what I get..

One thread: 11 ms.
Two threads: 17 ms each.
[/QUOTE]

You have to use two sticks of memory to take advantage of dual-channel memory. In essence, you are running your memory subsystem at half of its capabilities.

kracker 2013-08-22 15:42

On my quad Haswell with dual channel 1600:

1 thread :9 ms
2 threads:10 ms
3 threads: 11 ms
4 threads: 14 ms

Is P-1 as memory-bandwidth limited as of LL or ?

Prime95 2013-08-24 01:56

Version 28.1 preview
 
[B]For any Haswell owners[/B] that are interested, an evaluation version 28.1 is available. I have some ideas to improve it further, but they will take some time to implement.

I sure hope it works correctly because I've started using it on my Haswell box.

Download link: [url]http://www.sendspace.com/file/k66yc4[/url]

NBtarheel_33 2013-08-27 06:30

[QUOTE=Prime95;350674]I sure hope it works correctly because I've started using it on my Haswell box.[/QUOTE]

My rule-of-thumb is that if it runs a successful double-check, then all must be well.

Any benefits to non-Haswell adopters of this version?

And any plans to change the name of the secret forum to "David Haswellhoff"?

TheJudger 2013-08-31 21:22

[QUOTE=Prime95;350674][B]For any Haswell owners[/B] that are interested, an evaluation version 28.1 is available. I have some ideas to improve it further, but they will take some time to implement.

I sure hope it works correctly because I've started using it on my Haswell box.

Download link: [url]http://www.sendspace.com/file/k66yc4[/url][/QUOTE]

Any timeframe for a Linux v28.x?

[QUOTE=TheJudger;345339]my 4770k - continued

I see two options:[LIST=1][*]I'm too stupid to mount the heatsink properly [B]and[/B] I'm too stupid to run Prime95 (mprime)[*]"Others" don't stress their CPUs as hard as I do[/LIST]
i7 4770k + Gigabyte Z87X-UD3H + 2x 8GiB DDR3-2133 1.50V + 1x SATA HDD + 1x SATA SSD + Thermalright HR-02 Macho + Noctua NF-P12 @full speed, [I]80minus[/I] power supply
temporary build open on table, ambient temperature ~22°C
BIOS settings: Gigabytes BIOS defaults, voltages set to "normal", hyperthreading disabled, memory set to "XMP Profile 1"
OS: openSUSE 12.3, 64bits of course

Optimized HPL (Linpack) making heavy usage of AVX+FMA: 210W measured on AC, CPU reports ~120W, CPU temperatures 92-95°C
Prime95 (mprime v27.9), "blend test": 150-170W measured on AC, CPU reports 70-90W, CPU temperatures 60-72°C
Prime95 out indicates that it is using AVX FFTs, power and temperatures varies over different FFT lengths while HPL power consumption is very stable.

Oliver[/QUOTE]

Seems that I had two times bad luck:[LIST=1][*]I guess I know how to put some stress on CPU[*]I got a [B]really, really, really bad[/B] 4770k :sad: minimum vCore (25mV steps tested) not causing a total system lockup within 15 minutes:[LIST][*]4000MHz + mprime v27.9: 1.15V[*]4000MHz + HPL: 1.20V[*]4100MHz + mprime v27.9: 1.20V (CPU reports package power of up to 93W, CPU temperatures below 70°C)[*]4100MHz + HPL: 1.25V (CPU hits 100°C on core #2 and throttles sometimes, CPU reports a package power of up to 139W)[/LIST][/LIST]
I managed to improve the CPU temperatures 3-4°C by placing the heatsink ofcenter a few mm. Haswell die isn't located in the middle of the CPU package...

Oliver

ldesnogu 2013-09-01 02:00

On my stock 4770K (HT enabled, RAM @2400) with Noctua NH-U14S (with a second fan), LinX makes the CPU go up to ~93°C with 8 threads (one core reached 97°C) and 2°C less with 4 threads (one core reached 97°C too). I get about 165 GFLOPS in both cases (this is way above overclocked results I found; I guess the last version is faster).

This is hot for sure. I think I'll have to play with VCORE to reduce it too...

Prime95 2013-09-01 04:55

[QUOTE=TheJudger;351510]Any timeframe for a Linux v28.x?[/QUOTE]

Try [url]http://www.sendspace.com/file/l4k4bk[/url]

It is untested. I installed ubuntu 10.04 in a virtualbox VM, but mprime doesn't recognize the FMA feature. No idea if this is a Ubuntu, VirtualBox, or mprime problem.


All times are UTC. The time now is 22:37.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.