mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2013-06-30, 17:54   #155
kracker
 
kracker's Avatar
 
"Mr. Meeseeks"
Jan 2012
California, USA

23×271 Posts
Default

+1

One more thing, make sure the heatsink is *properly* placed, I've done that before when it was not completely secured... And make sure to use good grease for it.
kracker is offline   Reply With Quote
Old 2013-06-30, 21:45   #156
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

11·101 Posts
Default

kladner: those AMDs have bigger die size thus they can move heat out of the silicon easilier (315mm2 vs. 177mm2). Of course I have doublechecked the heatsink.

I'm affraid that I'm very good in putting some load on the CPU... in some overclocking forums I've noticed some persons which claim they can run their 4770k @4.5GHz, 1.4V easily on air while running LinX (Linpack for Windows)... well they have used an old LinX which (I guess) only does SSE. At 4.5GHz the screenshot revealed ~60GFLOPS...
Back to my issue: seems that the PCU (Power Controlling Unit, part of the CPU) and BIOS (OK, OK EFI) aren't on my side. With default settings in BIOS the system does 3.9GHz 4-core turbo under heavy load (exceeding the TDP easily)... The CPU should do up to 3.7GHz 4-core turbo to stay within spec. For each step above non-turbo multiplier the PCU adds some voltage. And there are some comments on the web that for AVX code it adds even more voltage. I've measured ~1.26v under load (~1.1v default vCore). With voltage manually set to 1.100v and 4GHz I was able to keep the CPU temperatures at ~75°C while running Linpack.

Oliver
TheJudger is offline   Reply With Quote
Old 2013-06-30, 22:36   #157
kladner
 
kladner's Avatar
 
"Kieren"
Jul 2011
In My Own Galaxy!

2×3×1,693 Posts
Default

Thanks for the details on voltage and temperature. I know you would have checked the heatsinking, but over 90 C is in the borderlands even for Intel: startling, I would call it, even under extreme loads with anything but a stock cooler. To mention it is on the order of asking "Is the power cord connected?" The Linpack version you reference must be one mean mofo!
kladner is offline   Reply With Quote
Old 2013-06-30, 22:59   #158
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

21278 Posts
Default

I'll continue on this next weekend, perhaps (for comparison) I should check the temperatures while running mprime. A fine tweaked Linpack (hpl-2.0 + Intel MKL 11.someversion, properly choosen parameters for HPL and process pinning) is my worst case scenario for temperature and power consumption. If a system can do this I feel pretty comfortable with real world applications. Linpack makes heavy use of the new dual FMA capability of the haswell chips (16 DP ops per clock and core).

Edit: the Windows "LinX" isn't that bad if you choose the right version (AVX-capable, check performance, for comparison: I can do 200GFLOPS with 4 cores @3.9GHz on my system) if you want to give it a try, much easier than compiling the whole stuff by yourself.

Oliver

Last fiddled with by TheJudger on 2013-06-30 at 23:02
TheJudger is offline   Reply With Quote
Old 2013-07-01, 03:30   #159
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

1164710 Posts
Default

Quote:
Originally Posted by ewmayer View Post
After getting your e-mail a couple days ago in which you first described some of the above issues (and with advance knowledge that ADD/SUB are inherently limited to just 1 of the 2 issue ports), I came to the same conclusion - been spending the last 2 days restructuring (so far just the scalar-data C-code version of) one of my FFT-core building blocks, the radix-16 DIF DFT-with-twiddles macro, to use all FMA-based arithmetic.
Above radix-16 macro fully C-prototyped - first using FMA4 as the model for simplicity, then converting that to use FMA3 (in which the 'c' in FMA3(a,b,c) = a*b + c gets overwritten by the result), keeping in mind the 16-register constraint of the Intel CPUs. Assembly coding begins tomorrow...
ewmayer is offline   Reply With Quote
Old 2013-07-02, 18:31   #160
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

19·613 Posts
Default

Quote:
Originally Posted by ewmayer View Post
Above radix-16 macro fully C-prototyped - first using FMA4 as the model for simplicity, then converting that to use FMA3 (in which the 'c' in FMA3(a,b,c) = a*b + c gets overwritten by the result), keeping in mind the 16-register constraint of the Intel CPUs. Assembly coding begins tomorrow...
IAAI [i am an idiot] - in the Intel FMA3 model it's 'a' [i.e. the first of the 2 multiplicands] which is also used to store the result. Led astray by the opposite-operand-ordering of Intel vs AT&T syntax once again...

Anyhoo, rejiggering the prototype code shouldn't be too hard, just a lot of swapping out what goes into various register-copy temporaries. Still annoyed @myself for wasting my own time, though. Will try to use the extra work to also do some 2nd-pass optimization, so as to not make it feel entirely redundant - save a few register copies and improve the instruction scheduling to better hide latency.
ewmayer is offline   Reply With Quote
Old 2013-07-02, 19:55   #161
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

1D7716 Posts
Default

Quote:
Originally Posted by ewmayer View Post
IAAI [i am an idiot] - in the Intel FMA3 model it's 'a' [i.e. the first of the 2 multiplicands] which is also used to store the result.
You can overwrite either a, b, or c. To overwrite c, use vfmadd231.

I wrote a MASM macro that takes 4 args and outputs the optional register copy and the appropriate 132, 231, 213 version of the FMA instruction.
Prime95 is offline   Reply With Quote
Old 2013-07-02, 20:50   #162
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

19×613 Posts
Default

Ah, very good - nice of Intel to at least provide some options here, given that they don't (yet) support the desired FMA4 syntax.
ewmayer is offline   Reply With Quote
Old 2013-07-05, 19:17   #163
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

21278 Posts
Default

my 4770k - continued

I see two options:
  1. I'm too stupid to mount the heatsink properly and I'm too stupid to run Prime95 (mprime)
  2. "Others" don't stress their CPUs as hard as I do

i7 4770k + Gigabyte Z87X-UD3H + 2x 8GiB DDR3-2133 1.50V + 1x SATA HDD + 1x SATA SSD + Thermalright HR-02 Macho + Noctua NF-P12 @full speed, 80minus power supply
temporary build open on table, ambient temperature ~22°C
BIOS settings: Gigabytes BIOS defaults, voltages set to "normal", hyperthreading disabled, memory set to "XMP Profile 1"
OS: openSUSE 12.3, 64bits of course

Optimized HPL (Linpack) making heavy usage of AVX+FMA: 210W measured on AC, CPU reports ~120W, CPU temperatures 92-95°C
Prime95 (mprime v27.9), "blend test": 150-170W measured on AC, CPU reports 70-90W, CPU temperatures 60-72°C
Prime95 out indicates that it is using AVX FFTs, power and temperatures varies over different FFT lengths while HPL power consumption is very stable.

Oliver

Last fiddled with by TheJudger on 2013-07-05 at 19:18
TheJudger is offline   Reply With Quote
Old 2013-07-05, 19:35   #164
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

19×613 Posts
Default

Quote:
Originally Posted by TheJudger View Post
Optimized HPL (Linpack) making heavy usage of AVX+FMA: 210W measured on AC, CPU reports ~120W, CPU temperatures 92-95°C
Prime95 (mprime v27.9), "blend test": 150-170W measured on AC, CPU reports 70-90W, CPU temperatures 60-72°C
Since these are on the same setup, it appears there is a significant load-dependent aspect.

You note that Linpack is making heavy use of AVX2 (i.e. AVX+FMA) - there is one obvious difference between it and Prime95, which George is busy adding FMA-usage to, but your version uses just FMA-less AVX. AVX2 effectively doubles the floating-MUL bandwidth (also ADD, but the MUL is the biggie here) - those MULs generate a lot of heat.

Also, linear algebra tends to be able to use the FPU at much closer to max. theoretical capacity than FFTs, because the data access patterns are much simpler and the arithmetic mix is more favorable in the sense that optimized FFTs are ADD-dominated.

George, have you noticed any temperature impact from using FMA in your development code?
ewmayer is offline   Reply With Quote
Old 2013-07-05, 19:50   #165
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

111110 Posts
Default

Hi Ernst,

yepp, HPL is a power virus (but not the worst I can imagin*)
This perfect explains why I have so much trouble with the cooling of my CPU while other say that they can handle the heat of a Haswell.
In the past (few years ago) Prime95s power consumption/heat generation was close to Linpack but today...

*I guess running only DGEMM (BLAS) on reasonable sized inputs is even worse than Linpack. Linpack spents much time in this standard function but not all of the time, there are other calls to the BLAS library and communication between processes/threads aswell. I'm using Intel MKL as BLAS implementation, those functions are designed for optimal performance (not for maximum power consumption/heat generation). So I guess Intel wont say "don't run this code on our CPUs, it is just a stupid power virus".

Oliver

P.S. I've just improved my HPL settings: 1-2W more, 203GFLOPS @default clock

Last fiddled with by TheJudger on 2013-07-05 at 19:59
TheJudger is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
Haswell-E Prelim. Benchmark sdbardwick Hardware 37 2015-02-10 18:49
Prime95 and Haswell Pleco Information & Answers 22 2014-07-13 16:03
Haswell Rig Mini-Geek Hardware 64 2014-05-27 13:22
Prime95 version 27.1 early preview, not-even-close-to-beta release Prime95 Software 126 2012-02-09 16:17
Missing mouse-over preview text retina Forum Feedback 1 2011-09-12 15:32

All times are UTC. The time now is 05:39.


Fri Aug 6 05:39:45 UTC 2021 up 14 days, 8 mins, 1 user, load averages: 3.14, 3.22, 2.92

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.