mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2016-09-16, 04:35   #122
VBCurtis
 
VBCurtis's Avatar
 
"Curtis"
Feb 2005
Riverside, CA

562210 Posts
Default

Post deleted, mlucas author answered better than I did.

Last fiddled with by VBCurtis on 2016-09-16 at 04:36
VBCurtis is offline   Reply With Quote
Old 2016-09-16, 04:52   #123
xathor
 
Sep 2016

19 Posts
Default

Quote:
Originally Posted by ewmayer View Post
Mlucas currently only supports as high as AVX2 - the main point of getting the KNL dev-system was to allow us folks keen to add AVX-512 support to our codes a place to do that.

If you just auto-build the summer 2015 release using the simple instructions, you should end up with a working AVX2 binary. At that point see my previous note above for the cmd-line flag used to control/limit the threadcount. Suggest you try -nthread values 1,2,4,8,16,32,64, all at just the 4096K FFT length for now.

I'm still trying to work out an ssh-access issue ... David says I should not need a password on initial login, but I keep getting prompted for one. I asked him to simply create a temp-password for me to login and reset, but unlike us crazies he appears to keep sane 'no internet after ***pm' hours. :)
I'm doing that right now. I'm not primarily a programmer; I'm the lead systems admin for a couple HPC clusters. I purchased a KNL development box as soon as they were available to see if they would be a good upgrade to the cluster, but they neither scaled nor performed anywhere near as well as Intel claimed. The KNL system just sits idle now and I mess around with it occasionally.

As far as key issues, use ssh -i ~/.ssh/id_rsa or whatever key you created. Your id_rsa.pub key needs to be in your login users directory on the remote machine under the ~/.ssh/authorized_keys file. If you want to do a 'ssh -vvvvv' and PM it to me, I can take a look at it for you.





100 iterations of M77597293 with FFT length 4194304 = 4096 K
Res64: 8CC30E314BF3E556. AvgMaxErr = 0.293024554. MaxErr = 0.328125000. Program: E14.1
Res mod 2^36 = 5569242454
Res mod 2^35 - 1 = 22305398329
Res mod 2^36 - 1 = 64001568053
Clocks = 00:00:02.610

/ **************************************************************************** /


Done ...


Edit: Missed a 0

1000 iterations of M77597293 with FFT length 4194304 = 4096 K
Res64: 5F87421FA9DD8F1F. AvgMaxErr = 0.292703043. MaxErr = 0.343750000. Program: E14.1
Res mod 2^36 = 67274379039
Res mod 2^35 - 1 = 26302807323
Res mod 2^36 - 1 = 54919604018
Clocks = 00:00:23.097

Last fiddled with by xathor on 2016-09-16 at 05:13
xathor is offline   Reply With Quote
Old 2016-09-16, 05:46   #124
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

1175510 Posts
Default

@xathor:

Thanks - still waiting to hear from the sysadmin before trying other ssh stuff.

Good, you got a build - you def. want to use 1000 (or even 10000) iterations with multiple threads. You'll probably need the enhanced affinity-selection stuff I just added to my dev-branch code for decent || scaling, but it'll be interesting to see what the dumb-affinity mode of the current release does on KNL, anyway.
ewmayer is offline   Reply With Quote
Old 2016-09-16, 10:05   #125
Lorenzo
 
Lorenzo's Avatar
 
Aug 2010
Republic of Belarus

2628 Posts
Default

Quote:
Originally Posted by xathor View Post
1000 iterations of M77597293 with FFT length 4194304 = 4096 K
Res64: 5F87421FA9DD8F1F. AvgMaxErr = 0.292703043. MaxErr = 0.343750000. Program: E14.1
Res mod 2^36 = 67274379039
Res mod 2^35 - 1 = 26302807323
Res mod 2^36 - 1 = 54919604018
Clocks = 00:00:23.097
ohhh, that is ~23ms per iteration. Speed looks like on 1 core of a Haswell CPU (but for mprime).

Could you please do Trial Factoring? Very interesting to see the result. As far i know mprime using AVX for TF job?!
Lorenzo is offline   Reply With Quote
Old 2016-09-16, 13:14   #126
airsquirrels
 
airsquirrels's Avatar
 
"David"
Jul 2015
Ohio

11·47 Posts
Default

Sorry guys, I had some obligations yesterday evening that prevented me from debugging the SSH access. I should have that resolved today and the rest of the accounts setup. It sounds like we have another system we can test with now as wel, which is great.

To clarify, I have not seen the thread flip flop issue manifest itself on KNL, at least with mprime workload. My summary of performance is that it is realistically a nice fast system, but not quite as game changing as we thought it might be. At least not without software work.
airsquirrels is offline   Reply With Quote
Old 2016-09-16, 14:23   #127
xathor
 
Sep 2016

19 Posts
Default

Quote:
Originally Posted by Lorenzo View Post
ohhh, that is ~23ms per iteration. Speed looks like on 1 core of a Haswell CPU (but for mprime).

Could you please do Trial Factoring? Very interesting to see the result. As far i know mprime using AVX for TF job?!
I'm not sure what you guys mean by trial factoring.

Here is a Haswell (dual E5-2670v3 24c AVX2) for comparison:
1000 iterations of M77597293 with FFT length 4194304 = 4096 K
Res64: 5F87421FA9DD8F1F. AvgMaxErr = 0.292735259. MaxErr = 0.343750000. Program: E14.1
Res mod 2^36 = 67274379039
Res mod 2^35 - 1 = 26302807323
Res mod 2^36 - 1 = 54919604018
Clocks = 00:00:09.605

/ **************************************************************************** /


Done ...


Here is a Ivy-Bridge (dual E5-2670v2 20c AVX) for comparison:

1000 iterations of M77597293 with FFT length 4194304 = 4096 K
Res64: 5F87421FA9DD8F1F. AvgMaxErr = 0.249028471. MaxErr = 0.312500000. Program: E14.1
Res mod 2^36 = 67274379039
Res mod 2^35 - 1 = 26302807323
Res mod 2^36 - 1 = 54919604018
Clocks = 00:00:08.712

/ **************************************************************************** /


Done ...

Last fiddled with by xathor on 2016-09-16 at 14:23
xathor is offline   Reply With Quote
Old 2016-09-16, 14:29   #128
science_man_88
 
science_man_88's Avatar
 
"Forget I exist"
Jul 2009
Dartmouth NS

2×3×23×61 Posts
Default

Quote:
Originally Posted by xathor View Post
I'm not sure what you guys mean by trial factoring.

Here is a Haswell (dual E5-2670v3 24c AVX2) for comparison:
1000 iterations of M77597293 with FFT length 4194304 = 4096 K
Res64: 5F87421FA9DD8F1F. AvgMaxErr = 0.292735259. MaxErr = 0.343750000. Program: E14.1
Res mod 2^36 = 67274379039
Res mod 2^35 - 1 = 26302807323
Res mod 2^36 - 1 = 54919604018
Clocks = 00:00:09.605

/ **************************************************************************** /


Done ...


Here is a Ivy-Bridge (dual E5-2670v2 20c AVX) for comparison:

1000 iterations of M77597293 with FFT length 4194304 = 4096 K
Res64: 5F87421FA9DD8F1F. AvgMaxErr = 0.249028471. MaxErr = 0.312500000. Program: E14.1
Res mod 2^36 = 67274379039
Res mod 2^35 - 1 = 26302807323
Res mod 2^36 - 1 = 54919604018
Clocks = 00:00:08.712

/ **************************************************************************** /


Done ...
http://www.mersenne.org/various/math...rial_factoring
science_man_88 is offline   Reply With Quote
Old 2016-09-16, 15:35   #129
Madpoo
Serpentine Vermin Jar
 
Madpoo's Avatar
 
Jul 2014

338510 Posts
Default

Quote:
Originally Posted by xathor View Post
From my understanding and reading the Colfax documentation, this isn't true.

The processor forces itself to to flip between two of the four threads constantly each clock cycle. Running across 64 cores (with CPU affinity bound to each physical core) will only be about 50% efficient as the processor will sleep because no task is assigned to the other thread. Unlike other Xeon processors that will dynamically assign tasks to threads, the Knights Landing architecture FORCES a core to flip between two threads.
If that's correct, I don't know how that's entirely different from hyperthreading in the traditional sense. For example, a Xeon hyperthread means you have a physical core with an extra (mainly integer operation) pipeline that presents itself to the OS as another CPU. Under the hood, it shares the same cache, registers (?) and other things. The advantage is being able to use it to do certain operations at the same time.

With Xeon Phi x200 you have the physical cores (64 on the lower-end models we're talking about). What you really have are 32 "tiles", each with 2 of those Atom cores, and each tile has it's own L1/L2 cache that the 2 cores share.

Meanwhile, each of those 64 cores have a pair of vector processing units (VPUs) which I think is where we're getting the "128 cores" notion. In my mind I see that as analogous to the hyperthread integer pipeline in old Xeons...it's just more sophisticated now and can do floating point/AVX/SSE good things too.

In theory (and hopefully in practice) it shouldn't matter which VPU on the core is handling the work, just like it doesn't matter currently which physical or HT core you're affining to on other chips, because it's really the same thing, just different pipeline.

What *would* matter is how the CPUs map to the operating system. It might be better to have the 2 cores on the same tile, and the 4 VPUs on that same tile, to be working in the same thread (if it's a multithreaded worker). There's probably very little data the 4 threads would need to share with each other, but if, as you say, the CPU itself makes it's own decisions on which VPU will handle things, at least if they're sharing an L2 cache then it shouldn't matter.

For that, it would be best to turn off the all-to-all L2 consistency since we'd be deliberately using affinity to keep core consistency during any operation.

FYI, each core can handle 4 threads so it may appear to the OS as 256-cores, but really only 128 VPUs which matters the most for Mersenne prime hunting. I don't know what the capabilities are of the 4 threads per core... I'm guessing those are mainly integer just like current hyperthreading. Useful for some things, not for others.

Making sure Prime95 worker threads are mapping properly to cores with their own VPU, and keeping them affined to it, seems crucial. In your testing, were you able to do that successfully? It seems like that would be very dependent on the program itself... if the OS is handling CPU loads or letting the CPU itself assign cores to requests, it's probably not doing as good a job as if you micromanaged that aspect.
Madpoo is offline   Reply With Quote
Old 2016-09-16, 16:17   #130
retina
Undefined
 
retina's Avatar
 
"The unspeakable one"
Jun 2006
My evil lair

2×32×7×53 Posts
Default

Quote:
Originally Posted by Madpoo View Post
For example, a Xeon hyperthread means you have a physical core with an extra (mainly integer operation) pipeline that presents itself to the OS as another CPU.
Not quite. It is just another CPU state. An extra set of RAX-R15, RIP, RFLAGS etc.. So all the caches and computing resources are completely shared. It is just the register set, flags and current state that is duplicated. Not too dissimilar from standard software threading, just done at the hardware level instead.
retina is online now   Reply With Quote
Old 2016-09-16, 16:21   #131
xathor
 
Sep 2016

19 Posts
Default

Quote:
Originally Posted by Madpoo View Post
If that's correct, I don't know how that's entirely different from hyperthreading in the traditional sense. For example, a Xeon hyperthread means you have a physical core with an extra (mainly integer operation) pipeline that presents itself to the OS as another CPU. Under the hood, it shares the same cache, registers (?) and other things. The advantage is being able to use it to do certain operations at the same time.

With Xeon Phi x200 you have the physical cores (64 on the lower-end models we're talking about). What you really have are 32 "tiles", each with 2 of those Atom cores, and each tile has it's own L1/L2 cache that the 2 cores share.

Meanwhile, each of those 64 cores have a pair of vector processing units (VPUs) which I think is where we're getting the "128 cores" notion. In my mind I see that as analogous to the hyperthread integer pipeline in old Xeons...it's just more sophisticated now and can do floating point/AVX/SSE good things too.

In theory (and hopefully in practice) it shouldn't matter which VPU on the core is handling the work, just like it doesn't matter currently which physical or HT core you're affining to on other chips, because it's really the same thing, just different pipeline.

What *would* matter is how the CPUs map to the operating system. It might be better to have the 2 cores on the same tile, and the 4 VPUs on that same tile, to be working in the same thread (if it's a multithreaded worker). There's probably very little data the 4 threads would need to share with each other, but if, as you say, the CPU itself makes it's own decisions on which VPU will handle things, at least if they're sharing an L2 cache then it shouldn't matter.

For that, it would be best to turn off the all-to-all L2 consistency since we'd be deliberately using affinity to keep core consistency during any operation.

FYI, each core can handle 4 threads so it may appear to the OS as 256-cores, but really only 128 VPUs which matters the most for Mersenne prime hunting. I don't know what the capabilities are of the 4 threads per core... I'm guessing those are mainly integer just like current hyperthreading. Useful for some things, not for others.

Making sure Prime95 worker threads are mapping properly to cores with their own VPU, and keeping them affined to it, seems crucial. In your testing, were you able to do that successfully? It seems like that would be very dependent on the program itself... if the OS is handling CPU loads or letting the CPU itself assign cores to requests, it's probably not doing as good a job as if you micromanaged that aspect.
Traditional hyperthreading is out of order and only used when needed. KNL hyperthreading is round robin in order.
xathor is offline   Reply With Quote
Old 2016-09-16, 16:31   #132
ldesnogu
 
ldesnogu's Avatar
 
Jan 2008
France

22×149 Posts
Default

Quote:
Originally Posted by xathor View Post
Traditional hyperthreading is out of order and only used when needed. KNL hyperthreading is round robin in order.
Knight Corner was round robin for sure. Are you sure that still is the case? In a way it would make sense since the Silvermont cores used in KNL were not designed with HT in mind.
ldesnogu is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
LLR development version 3.8.7 is available! Jean Penné Software 39 2012-04-27 12:33
LLR 3.8.5 Development version Jean Penné Software 6 2011-04-28 06:21
Do you have a dedicated system for gimps? Surge Hardware 5 2010-12-09 04:07
Query - Running GIMPS on a 4 way system Unregistered Hardware 6 2005-07-04 04:27
System tweaks to speed GIMPS Uncwilly Software 46 2004-02-05 09:38

All times are UTC. The time now is 07:52.


Fri Jan 27 07:52:33 UTC 2023 up 162 days, 5:21, 0 users, load averages: 1.01, 0.89, 0.83

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔