mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2013-06-23, 04:38   #144
TheMawn
 
TheMawn's Avatar
 
May 2013
East. Always East.

11×157 Posts
Default

Quote:
Originally Posted by ewmayer View Post
Funny thing happened last night - I was in bed but not yet asleep, when I heard the Haswell CPU fan - case is sitting open on my desk about 4 feet away from my head - suddenly slow down for about 1 second, before resuming back to normal speed. That got me wondering, so I listened for another half-hour, same thing happened several more times. Then I noticed that this was happening every 10 minutes, almost to the minute - and my per-iteration time for 4-threaded F28 run is currently 0.0615 seconds. It's the savefile writes, which occur every 10000 iterations ... I use an SSD so there's no disk-write noise, but the brief interval in which multithreaded crunching stops and the current floating-point residue gets converted to bytewise endian-independent form and written to disk is enough to cause an audible "hiccup" in CPU fan speed.

"I can hear it working" (or 'not working', in this case).

If you run something like CPUID Hardware Monitor, you will notice that your current CPU temperature will be a few degrees below the max recorded temperature but WELL above the lowest recorded. For example, as we speak, CURRENT 72C, MAX 77C, MIN 30C, for me.

The audible hiccups in the fan speed are an indication that your cooling solution is running properly and that the chip doesn't have much thermal inertia. My chip is at roughly 72C at the moment. If I take off the load for even two seconds, the temperature drops to 40C. Another two seconds and it's into the low 30s.
TheMawn is offline   Reply With Quote
Old 2013-06-23, 06:09   #145
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

226778 Posts
Default

Quote:
Originally Posted by TheMawn View Post
The audible hiccups in the fan speed are an indication that your cooling solution is running properly and that the chip doesn't have much thermal inertia.
+1 to this (I wanted to say something similar, but you were faster)

Last fiddled with by LaurV on 2013-06-23 at 06:09
LaurV is offline   Reply With Quote
Old 2013-06-23, 18:44   #146
TheMawn
 
TheMawn's Avatar
 
May 2013
East. Always East.

11·157 Posts
Default

Load up your GPU and then stop it for a while and watch as the temperature slowly slowly crawls its way down to idle temperature. It can take even a minute.
TheMawn is offline   Reply With Quote
Old 2013-06-23, 19:55   #147
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2D7F16 Posts
Default

Quote:
Originally Posted by TheMawn View Post
The audible hiccups in the fan speed are an indication that your cooling solution is running properly and that the chip doesn't have much thermal inertia.
The audible hiccups in the fan speed are an indication that I need to multithread and SIMDize my residue conversion routine. :)
ewmayer is offline   Reply With Quote
Old 2013-06-24, 04:06   #148
TheMawn
 
TheMawn's Avatar
 
May 2013
East. Always East.

11·157 Posts
Default

Quote:
Originally Posted by ewmayer View Post
The audible hiccups in the fan speed are an indication that I need to multithread and SIMDize my residue conversion routine. :)
Well, if you can have the saves for each worker staggered then 7 of 8 (or 3 of 4, I can't remember which processor you have) workers can keep chugging along while the one worker waits for the save to complete, that could be kind of cool.
TheMawn is offline   Reply With Quote
Old 2013-06-24, 06:47   #149
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

1164710 Posts
Default

Good point - probably best to just copy the residue array and spin the existing savefile-write stuff off into a separate thread - no reason to keep the crunching threads waiting.
ewmayer is offline   Reply With Quote
Old 2013-06-26, 19:22   #150
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

11101011101112 Posts
Default

I've been working on figuring out why the key building block macro won't run in the theoretically possible 13 clocks. I mentioned earlier that this macro is taking 15 clocks.

I believe I have discovered the causes of the 2 clock delay. It is actually a combination of factors.

1) Since Intel ditched the 4-operand FMA instruction (A = B * C + D) in favor of a 3-operand version (one must overwrite B, C, or D), my macro is forced to do a number of register copies. These register copies have zero-latency, but have a cost as we will see later.
2) If you use both add and subtract instructions as well as FMA (or MUL) instructions on port 1, then you will encounter "dispatch bubbles". This is because the add and sub instruction take 3 clocks while the FMA takes 5. For example, if an FMA is scheduled for clocks 1-5 and 2-6, then an add or subtract cannot be dispatched for clocks 3-5 because you cannot have two instruction both end on clock 5. The add must be delayed until clocks 5-7 -- a two-clock dispatch bubble.
3) To avoid this 3 vs 5 clock dispatch bubble, a macro must contain exactly 25% or 50% add and subtract instructions that execute every other clock cycle or every clock cycle respectively. This is a pretty severe coding restriction.
4) Because of the restrictions in 3, it is best to always use FMA. This means that calculating A+B and A-B (a common FFT operation) requires 3 instructions: a register copy and two FMA instructions rather than 2 instructions: an add and subtract instruction.
5) The chip can retire only 4 instructions per clock cycle. The building block macro does 8 data loads, 4 sin/cos loads, 8 stores, 4 muls, 22 FMAs, and 11 register copies. That is 57 instructions at 4 retires per clock you get a 14.25 clock minimum.
6) [retracted]


What does all this mean for prime95? Probably not a lot. If I could achieve 13 clocks, the single worker case (or Haswell-E multi-worker case) would be a few percent faster. The 4 worker case will still be bandwidth limited.

Last fiddled with by Prime95 on 2013-06-27 at 16:04
Prime95 is offline   Reply With Quote
Old 2013-06-26, 20:37   #151
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

19×613 Posts
Default

Quote:
Originally Posted by Prime95 View Post
4) Because of the restrictions in 3, it is best to always use FMA. This means that calculating A+B and A-B (a common FFT operation) requires 3 instructions: a register copy and two FMA instructions rather than 2 instructions: an add and subtract instruction.
After getting your e-mail a couple days ago in which you first described some of the above issues (and with advance knowledge that ADD/SUB are inherently limited to just 1 of the 2 issue ports), I came to the same conclusion - been spending the last 2 days restructuring (so far just the scalar-data C-code version of) one of my FFT-core building blocks, the radix-16 DIF DFT-with-twiddles macro, to use all FMA-based arithmetic. Just as playing Tetris would be greatly simplified if all the bricks were the same size and shape, it's far easier to schedule code like this if all the arithmetic instructions are the same (in terms of latency/issuability - we actually use a mix of +-a*b +- c instructions in the form of the 4 corresponding fma/fms/fnma/fnms instructions) and can issue from either port, even if their latency is greater than ADD/SUB. But I'm glad I'm only looking at the 64-bit-OS (16-simd-register) case, because doing this kind of exercise with just 8 registers - even with one FMA input able to be read-from-memory, would be ugly.

Going from ADD/SUB -> FMA for typically-ADD/SUB-dominated DFT macros theoretically doubles our throughput because the resulting number of FMAs should match the previous ADD/SUB total, but 2 FMA can issue per cycle versus just 1 ADD/SUB. It will be interesting to see how much of that theoretical gain is realizable.
ewmayer is offline   Reply With Quote
Old 2013-06-27, 16:03   #152
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

19·397 Posts
Default

Quote:
Originally Posted by Prime95 View Post
6) I also discovered that while an FMA takes 5 clocks, the result can only be used immediately on the same FPU port. If you want to use the result on the other FPU port, the latency is 6 clocks.
Correction. This is not the case - there was a bug in my test case code.
Prime95 is offline   Reply With Quote
Old 2013-06-29, 22:00   #153
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

11×101 Posts
Default

Hi,

Quote:
Originally Posted by Prime95 View Post
We seem to be stable at 4.2GHz, 1.2V, DDR3-2400. Temps in the mid-70s running large FFTs and 80 running small FFTs. When I tried running 4.4GHz at 1.25V, as some overclockers have done, the large torture test seemed to be OK with temps in the low 80s. I then tried the small FFT torture test. Instantly, temps spiked to 100 and throttling began.
I've jumped on the Haswell wagon, too. Temperatures are... really bad for me.
i7 4770k, Gigabyte Z87X-UD3H, 2x 8GiB DDR3-2133 (1.5V), Thermalright "HR-02 Macho Rev A", Noctua NF-P12 fan running at ~1250rpm, open on table (without chassis):
I've set voltage to "normal" instead of "auto". I'm running a selfcompiled and configured HPL (MPI parallel Linpack) which has a higher efficency (and power consumption) than LinX (which Windows users might know). At default clock rates (3.5GHz, turbo enabled, Hyper-Threading disabled) and memory @DDR3-2133 I get temperatures a little bit above 90°C and 201GFLOPS. OK, prime95 temperature will be below that but this doesn't fell comfortable.

Oliver
TheJudger is offline   Reply With Quote
Old 2013-06-30, 03:47   #154
kladner
 
kladner's Avatar
 
"Kieren"
Jul 2011
In My Own Galaxy!

27AE16 Posts
Default

Quote:
Originally Posted by TheJudger View Post
Hi,



I've jumped on the Haswell wagon, too. Temperatures are... really bad for me.
i7 4770k, Gigabyte Z87X-UD3H, 2x 8GiB DDR3-2133 (1.5V), Thermalright "HR-02 Macho Rev A", Noctua NF-P12 fan running at ~1250rpm, open on table (without chassis):
I've set voltage to "normal" instead of "auto". I'm running a selfcompiled and configured HPL (MPI parallel Linpack) which has a higher efficency (and power consumption) than LinX (which Windows users might know). At default clock rates (3.5GHz, turbo enabled, Hyper-Threading disabled) and memory @DDR3-2133 I get temperatures a little bit above 90°C and 201GFLOPS. OK, prime95 temperature will be below that but this doesn't fell comfortable.

Oliver
I really have to suppose that there is something wrong in the heatsink interface, be it water block or air cooler. Another possibility is that your chip has worse than average Thermal Interface Material under the integrated heat spreader.

Also, what voltage does the "Normal" setting result in? Don't late model Intel chips run in the 1.2 volt range?

EDIT: My FX-8350 (@ stock 4 GHz) draws a LOT more power than a Haswell chip, and stays in the middle 50 C's on air cooling, on a warm day, with two substantial GPU's in the case. The CPU is running P-1 x8, and both GPU's are running mfaktc. I can't see how you would reach such scary temps if things were working correctly.

Last fiddled with by kladner on 2013-06-30 at 03:57
kladner is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
Haswell-E Prelim. Benchmark sdbardwick Hardware 37 2015-02-10 18:49
Prime95 and Haswell Pleco Information & Answers 22 2014-07-13 16:03
Haswell Rig Mini-Geek Hardware 64 2014-05-27 13:22
Prime95 version 27.1 early preview, not-even-close-to-beta release Prime95 Software 126 2012-02-09 16:17
Missing mouse-over preview text retina Forum Feedback 1 2011-09-12 15:32

All times are UTC. The time now is 05:39.


Fri Aug 6 05:39:45 UTC 2021 up 14 days, 8 mins, 1 user, load averages: 3.07, 3.20, 2.92

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.