mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2013-08-13, 02:56   #232
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

19·613 Posts
Default

Quote:
Originally Posted by kracker View Post
I finally decided to go with this, as I don't exactly have a huge money tree... Well, all I'm hoping for is decent performance compared to the
2.3 GHz Core2 Duo I had before.
That costs a smidge more per-byte than the ddr3 2400 George & I got ... is the issue that the 2400 is not available [or not as cheap per-byte] in a 4GB option?
ewmayer is offline   Reply With Quote
Old 2013-08-13, 04:19   #233
TheMawn
 
TheMawn's Avatar
 
May 2013
East. Always East.

11×157 Posts
Default

I don't think there is much engineering put into new 2GB sticks. They are becoming a bit of a thing of the past... Hooray for the future.

Also, a budget build CPU isn't going to be able to handle 2400MHz RAM. All of ASUS's boards can handle up to 2800MHz (in the Z77 lineup, anyway) but they could only "guarantee" that with an i7-3770k, let alone even an i5-3570k. If you're going to run something like an i3, I can't see 2400MHz memory working. Nor would it be necessary.

I've gone on a bit of an editing spree since I'm spewing nonsense all over the place. Haswells look to be costing as much as Ivy Bridge so go fourth gen all the way if you ask me. The i3's aren't actually out yet but you're getting two cores with i3 and four cores with i5 for about 50% more money.

You can get a CPU for $180, a board for $150, the RAM you picked out is $50, and you can get a pretty cost effective GPUs for $200 apiece. On the other hand, I have a $250 CPU, a $250 board, $150 memory and a $400 GPU and I don't think my system is going to beat yours 1050 to 580. I'm kind of liking the sound of a budget PC, to be frank.

Last fiddled with by TheMawn on 2013-08-13 at 04:29
TheMawn is offline   Reply With Quote
Old 2013-08-14, 00:03   #234
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

19·613 Posts
Default

Quote:
Originally Posted by ewmayer View Post
Took a close look at the offending code here again the past several days and finally found the problem - briefly, I constructed the AVX-based Mersenne-mod carry macros by fusing the fancy-indexing footwork of the legacy SSE2 mersenne-mod-DWT carry macros and the AVX data-permute aspects of the AVX-based Fermat-mod carry macros - the result ran incredibly, awfully, unbelievably slowly in the initial implementation, which came online just before Haswell hit the market. I have traced the problem back to the mixing of legacy SSE instructions (using xmm-form registers) in the indexing-computation portions of the code with AVX instructions used for weights and carries in the new AVX code.
OK, I propagated the fixed-up carry macro code to all of my SIMD-optimized fused final-iFFT-pass/carry/initial-fFFT-pass routines - I have such for DFT-pass radices 16,20,24,28,32,36,40,44,52 and 60 - and after the expected round of debugging work, just ran a set of benchmarks on my Haswell quad to gauge the impact of using the "true AVX" carry math on Mersenne-mod arithmetic. Summary table follows.

# Test #5: AVX mode [now also including Mers-mod carry step] on 3.4 GHz Haswell quad, DDR3 2400 SDRAM (PC3 19200); times in ms/iteration:
Code:
             Mersenne-mod:                  Fermat-mod:
FFT len   #threads [1 thread/core]
(Kdbl)    1       2       4             1       2       4
----    -----   -----   -----         -----   -----   -----
 896      9.3     4.9     2.9           8.3     4.4     2.6
 960     10.1     5.2     2.9           9.0     4.7     2.6
1024     10.4     5.5     3.0           9.7     5.0     2.9
1152     12.2     6.4     3.8
1280     14.2     7.4     4.1
1408     16.1     8.4     4.9
1536     17.4     9.8     5.3
1664     18.7     9.7     5.5
1792     21.3    11.1     6.6           19.2   10.1     5.8
1920     22.0    11.5     6.6           19.1   10.4     5.7
2048     23.9    12.4     7.1           22.2   11.6     6.6
2304     27.5    14.5     8.6
2560     32.2    16.8     9.5
2816     36.5    18.9    11.1
3072     35.8    19.3    12.2
3328     42.2    21.9    12.5
3584     43.4    22.9    14.6           38.6   20.5    12.9
3840     49.5    25.5    14.8           45.1   23.4    13.1
4096     49.1    26.2    16.0           44.5   23.6    15.0
4608     57.3    30.0    18.8
5120     65.4    34.5    21.0
5632     74.4    38.9    23.5
6144     68.2    38.1    27.6
6656     87.2    45.3    27.3
7168     83.0    45.6    33.7           74.8   41.2    30.8
7680    102.3    52.9    31.9           91.9   48.0    29.0
8192    104.7    52.2    37.3           86.9   47.4    34.4
-----------------------------------------------------------
Avg || Scaling: 1.903x  3.040x          ----  1.875x  2.908x
Avg runtime ratio for Mersenne-mod vs Fermat-mod for FFT lengths supporting both kinds of arithmetic:
        1.127x  1.102x  1.099x

Notes:

0. I simply did 1000-iteration timings for all these, and made no effort to account for initialization overhead, thus the times are likely a few % pessimistic;

1. The 10-15% relative slowness of Mersenne-mod relative to Fermat-mod is expected for my code: Unlike George's which uses an optimized real-vector transform which is ideal for the real-signal Mersenne-mod IBDWT, my code is written around a more-general-purpose complex-signal FFT. Thus it is really more geared toward Fermat-mod arithmetic, where the negacyclic-transform-effecting DWT involves complex-valued weights and yields a so-called "right-angle" transform, which is ideal for handling via complex-signal FFT. For real-signal inputs such as those in Mersenne-mod arithmetic we need to wrap the dyadic-squaring step occurring between the forward and inverse FFT in a complex/real/complex wrapper step which typically results in a 10-20% runtime hit, depending on runlength and platform.

2. 3072K is the optimal runlength [among this menu of choices] for the most recent M-prime. My target for such official verifies is typically to get the per-iteration time below 10ms [8.64 ms translates to 10 Miters/day, by way of handy rule of thumb]. For the verify of M57885161 Serge Batalov used an SSE2 build of Mlucas and found that the best [in terms of absolute throughput, not per-core efficiency] option on the 32-core Xeon [pre-Sandy-Bridge, i.e. no AVX option] cluster he had access to was to run at the next-higher available FFT length of 3328K, using 32 threads [precisely speaking, a combination of 26 and 32-threads, corresponding to the 2 distinct modmul phases Mlucas arranges things in]. He was able to get right around 8.6 ms/iter that way - so now we are within spitting distance of that total throughput using just 4 Haswell cores. [If I OC'ed aggressively like George does on his system I could probably get the 3072K 4-thread timing down to right around 10 ms/iter]. It will be interesting to see what kind of parallel scalings we can get on Haswell-based [or Ivy Bridge] systems with more than 4 cores. There is usually a big dropoff in parallelism beyond 4 cores ... for M-prime verifies we are usually elated to get *any* added total-throughput boost on > 4 cores.
ewmayer is offline   Reply With Quote
Old 2013-08-17, 23:03   #235
kracker
 
kracker's Avatar
 
"Mr. Meeseeks"
Jan 2012
California, USA

23·271 Posts
Default

Hmm.. maybe I should have gotten faster memory. I just tried some tests (my Haswell CPU is still in the box) on my Dual Core(i3 3220) Ivy B with a single 4GB 1600 MHz ram, this is what I get..

One thread: 11 ms.
Two threads: 17 ms each.
(DC tests)

Duh.

EDIT: Is P-1 less memory intensive than LL btw?

Last fiddled with by kracker on 2013-08-17 at 23:08
kracker is offline   Reply With Quote
Old 2013-08-18, 01:52   #236
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

1D7716 Posts
Default

Quote:
Originally Posted by kracker View Post
Hmm.. maybe I should have gotten faster memory. I just tried some tests (my Haswell CPU is still in the box) on my Dual Core(i3 3220) Ivy B with a single 4GB 1600 MHz ram, this is what I get..

One thread: 11 ms.
Two threads: 17 ms each.
You have to use two sticks of memory to take advantage of dual-channel memory. In essence, you are running your memory subsystem at half of its capabilities.
Prime95 is online now   Reply With Quote
Old 2013-08-22, 15:42   #237
kracker
 
kracker's Avatar
 
"Mr. Meeseeks"
Jan 2012
California, USA

23·271 Posts
Default

On my quad Haswell with dual channel 1600:

1 thread :9 ms
2 threads:10 ms
3 threads: 11 ms
4 threads: 14 ms

Is P-1 as memory-bandwidth limited as of LL or ?
kracker is offline   Reply With Quote
Old 2013-08-24, 01:56   #238
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

19×397 Posts
Default Version 28.1 preview

For any Haswell owners that are interested, an evaluation version 28.1 is available. I have some ideas to improve it further, but they will take some time to implement.

I sure hope it works correctly because I've started using it on my Haswell box.

Download link: http://www.sendspace.com/file/k66yc4
Prime95 is online now   Reply With Quote
Old 2013-08-27, 06:30   #239
NBtarheel_33
 
NBtarheel_33's Avatar
 
"Nathan"
Jul 2008
Maryland, USA

100010110112 Posts
Default

Quote:
Originally Posted by Prime95 View Post
I sure hope it works correctly because I've started using it on my Haswell box.
My rule-of-thumb is that if it runs a successful double-check, then all must be well.

Any benefits to non-Haswell adopters of this version?

And any plans to change the name of the secret forum to "David Haswellhoff"?
NBtarheel_33 is offline   Reply With Quote
Old 2013-08-31, 21:22   #240
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

11·101 Posts
Default

Quote:
Originally Posted by Prime95 View Post
For any Haswell owners that are interested, an evaluation version 28.1 is available. I have some ideas to improve it further, but they will take some time to implement.

I sure hope it works correctly because I've started using it on my Haswell box.

Download link: http://www.sendspace.com/file/k66yc4
Any timeframe for a Linux v28.x?

Quote:
Originally Posted by TheJudger View Post
my 4770k - continued

I see two options:
  1. I'm too stupid to mount the heatsink properly and I'm too stupid to run Prime95 (mprime)
  2. "Others" don't stress their CPUs as hard as I do

i7 4770k + Gigabyte Z87X-UD3H + 2x 8GiB DDR3-2133 1.50V + 1x SATA HDD + 1x SATA SSD + Thermalright HR-02 Macho + Noctua NF-P12 @full speed, 80minus power supply
temporary build open on table, ambient temperature ~22°C
BIOS settings: Gigabytes BIOS defaults, voltages set to "normal", hyperthreading disabled, memory set to "XMP Profile 1"
OS: openSUSE 12.3, 64bits of course

Optimized HPL (Linpack) making heavy usage of AVX+FMA: 210W measured on AC, CPU reports ~120W, CPU temperatures 92-95°C
Prime95 (mprime v27.9), "blend test": 150-170W measured on AC, CPU reports 70-90W, CPU temperatures 60-72°C
Prime95 out indicates that it is using AVX FFTs, power and temperatures varies over different FFT lengths while HPL power consumption is very stable.

Oliver
Seems that I had two times bad luck:
  1. I guess I know how to put some stress on CPU
  2. I got a really, really, really bad 4770k minimum vCore (25mV steps tested) not causing a total system lockup within 15 minutes:
    • 4000MHz + mprime v27.9: 1.15V
    • 4000MHz + HPL: 1.20V
    • 4100MHz + mprime v27.9: 1.20V (CPU reports package power of up to 93W, CPU temperatures below 70°C)
    • 4100MHz + HPL: 1.25V (CPU hits 100°C on core #2 and throttles sometimes, CPU reports a package power of up to 139W)

I managed to improve the CPU temperatures 3-4°C by placing the heatsink ofcenter a few mm. Haswell die isn't located in the middle of the CPU package...

Oliver
TheJudger is offline   Reply With Quote
Old 2013-09-01, 02:00   #241
ldesnogu
 
ldesnogu's Avatar
 
Jan 2008
France

2×52×11 Posts
Default

On my stock 4770K (HT enabled, RAM @2400) with Noctua NH-U14S (with a second fan), LinX makes the CPU go up to ~93°C with 8 threads (one core reached 97°C) and 2°C less with 4 threads (one core reached 97°C too). I get about 165 GFLOPS in both cases (this is way above overclocked results I found; I guess the last version is faster).

This is hot for sure. I think I'll have to play with VCORE to reduce it too...
ldesnogu is offline   Reply With Quote
Old 2013-09-01, 04:55   #242
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

165678 Posts
Default

Quote:
Originally Posted by TheJudger View Post
Any timeframe for a Linux v28.x?
Try http://www.sendspace.com/file/l4k4bk

It is untested. I installed ubuntu 10.04 in a virtualbox VM, but mprime doesn't recognize the FMA feature. No idea if this is a Ubuntu, VirtualBox, or mprime problem.

Last fiddled with by Prime95 on 2013-09-01 at 04:56
Prime95 is online now   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Haswell-E Prelim. Benchmark sdbardwick Hardware 37 2015-02-10 18:49
Prime95 and Haswell Pleco Information & Answers 22 2014-07-13 16:03
Haswell Rig Mini-Geek Hardware 64 2014-05-27 13:22
Prime95 version 27.1 early preview, not-even-close-to-beta release Prime95 Software 126 2012-02-09 16:17
Missing mouse-over preview text retina Forum Feedback 1 2011-09-12 15:32

All times are UTC. The time now is 19:50.


Fri Aug 6 19:50:18 UTC 2021 up 14 days, 14:19, 1 user, load averages: 3.61, 3.30, 3.09

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.