mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2016-09-21, 15:32   #188
xathor
 
Sep 2016

238 Posts
Default

I've got my KNL box idle again, is there an updated repo where I can test out your code branch?

I've also got a machine with four K20's idle.
xathor is offline   Reply With Quote
Old 2016-09-21, 16:50   #189
ATH
Einyen
 
ATH's Avatar
 
Dec 2003
Denmark

2·17·101 Posts
Default

You should do some trial factoring on those K20's or even some LL testing. I believe they have decent double precision performance?

You can get trial factoring assignments through GPU72: https://www.gpu72.com
ATH is offline   Reply With Quote
Old 2016-09-21, 21:35   #190
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

5·2,351 Posts
Default

Quote:
Originally Posted by xathor View Post
I've got my KNL box idle again, is there an updated repo where I can test out your code branch?
To whose code are you referring?
ewmayer is offline   Reply With Quote
Old 2016-09-22, 03:10   #191
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

5×2,351 Posts
Default

I didn't want to do any lengthy runs on the shared-dev KNL so as to not interfere with any timings my fellow developers may be doing, but I do need to do some multihour runs to test various aspects related to optimal configuration for production-mode LL testing. If any other users of the system need to do some unloaded timings, please let me know (or ask David s. to kill my stuff).

Some further notes related to LL-testing, based on my experiments with some DC-exponents, all slightly larger than 40m, i.e. using FFT length 2304K:

1. I got no gain from trying to pin threads to virtual cores with index stagger > 1, e.g. for a 16-threaded run setting affinity to the default cores 0:15 is as fast or faster than anything else I tried. Timings get worse at larger staggers, e.g. using cores 0:255:16 (i.e. cores 0,16,32,...,240) runs only half as fast as 0:15.

2. Running just one job 16-threaded with default affinity (cores 0:15), 'top' shows ~1200% utilization, upping that to 4 such jobs shows each's utilization dropping to ~300%. To make sure this wasn;t some OS quirk I let all 4 same-affinity-set jobs run to the next checkpoint-write and indeed, the timings quadrupled relative to the single-job's 0.0090 sec/iter.

3. At that point I killed jobs 2-4, left #1 running on cores 0:15, and restarted 2-4 on core sets 16:31, 32:47, 48:63, respectively, using the new -cpu affinity-setting option of the dev-branch Mlucas code. Now the utilizations look as hoped-for, each job running at around the same 1200% of just the one 16-thread job. In other words the enhanced affinity-setting user option is mandatory for such multiple-multithreaded jobs.

4. Here is the first checkpoint output line from the first job, running all by itself:

[Sep 21 21:07:13] M40****** Iter# = 100000 [ 0.24% complete] [ 0.0090 sec/iter]

And here with all 4 jobs running on the above nonoverlapping core 16-tets:

[Sep 21 22:36:40] M40****** Iter# = 400000 [ 0.98% complete] [ 0.0091 sec/iter]
[Sep 21 22:26:58] M40****** Iter# = 200000 [ 0.49% complete] [ 0.0090 sec/iter]
[Sep 21 22:29:49] M40****** Iter# = 200000 [ 0.49% complete] [ 0.0091 sec/iter]
[Sep 21 22:30:55] M40****** Iter# = 200000 [ 0.49% complete] [ 0.0092 sec/iter]

5. Next I killed all 4 jobs and restarted with each running 32-threaded, i.e. a total of 128 threads on 64 physical cores - note that a quirk (polite term for 'bug') in the code's options-handling requires the user to supply an exponent matching the one in the topmost line of the local worktodo.ini file in order to also allow the core setting:

cd run0 && nice ../Mlucas -m 40****** -cpu 0:31 &
cd run1 && nice ../Mlucas -m 40****** -cpu 32:63 &
cd run2 && nice ../Mlucas -m 40****** -cpu 64:95 &
cd run3 && nice ../Mlucas -m 40****** -cpu 96:127 &

'top' now shows each job running at ~1800%, but the next-checkpoint summary outputs show things running ~5% slower for each job. So reverted to 4 x 16-thread, will let those DCs run overnight as a stability-under-load test. Each run is proceeding at ~10 million iters/day, thus with 4 of them going, getting around one DC-per-day throughput. Need to at least double that!
ewmayer is offline   Reply With Quote
Old 2016-09-22, 21:26   #192
Madpoo
Serpentine Vermin Jar
 
Madpoo's Avatar
 
Jul 2014

5×677 Posts
Default

Quote:
Originally Posted by ewmayer View Post
... Need to at least double that!
I'm pinning a lot of hope on AVX-512 helping out a lot... maybe I shouldn't get my hopes up too much, but it will be nice to see some #'s start coming out related to that part of it all. Apples-to-apples, single core, single worker, how does AVX compare to AVX-512; that kind of thing.
Madpoo is offline   Reply With Quote
Old 2016-09-23, 03:53   #193
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

7×13×41 Posts
Default

Quote:
Originally Posted by bsquared View Post
Yafu's QS code runs about 4x faster on a Xeon Phi card (KNC) compared to a CPU... and that is comparing generic C code on the Phi with hand-optimized AVX2 assembly code on a Haswell CPU. I hope to make it 20x or better with vector intrinsics on a Phi (and possibly tuning of the block size and other parameters). KNL will be much better yet. These cards promise to be massive sieving engines.
Ok, so 20x was pushing it, but I have managed to get yafu to run about 6x faster on a Phi versus a turbo clocked haswell core (3.6 GHz). Just completed a C110 in just under 3 hours (sieving only) versus ~ 19 hrs on the haswell. NFS takes about 5 hrs (all cpu nfs). It would be really interesting to test on a KNL but would need to do a bunch of intrinsics porting first.
bsquared is offline   Reply With Quote
Old 2016-09-23, 05:18   #194
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
"name field"
Jun 2011
Thailand

10,273 Posts
Default

Quote:
Originally Posted by xathor View Post
I've also got a machine with four K20's idle.
Linux or Windows? Do you mind giving someone access to it? I may not have any idea of phy, but I can squeeze all the juice from a GPU, for sure...
LaurV is offline   Reply With Quote
Old 2016-09-23, 05:18   #195
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

5·2,351 Posts
Default

Quote:
Originally Posted by bsquared View Post
Ok, so 20x was pushing it, but I have managed to get yafu to run about 6x faster on a Phi versus a turbo clocked haswell core (3.6 GHz). Just completed a C110 in just under 3 hours (sieving only) versus ~ 19 hrs on the haswell. NFS takes about 5 hrs (all cpu nfs). It would be really interesting to test on a KNL but would need to do a bunch of intrinsics porting first.
You can't just build the AVX2 code and run it on KNL by way of a first-look comparison, figuring at least a 2x gain from using AVX-512?
ewmayer is offline   Reply With Quote
Old 2016-09-23, 12:57   #196
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

7×13×41 Posts
Default

Quote:
Originally Posted by ewmayer View Post
You can't just build the AVX2 code and run it on KNL by way of a first-look comparison, figuring at least a 2x gain from using AVX-512?
I could, yes. Just excited to try out VPCONFLICTD/VPSCATTERDD in the context of sieving.
bsquared is offline   Reply With Quote
Old 2016-09-23, 20:36   #197
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

815010 Posts
Default

Quote:
Originally Posted by ewmayer View Post
cd run0 && nice ../Mlucas -m 40****** -cpu 0:31 &
cd run1 && nice ../Mlucas -m 40****** -cpu 32:63 &
cd run2 && nice ../Mlucas -m 40****** -cpu 64:95 &
cd run3 && nice ../Mlucas -m 40****** -cpu 96:127 &
Do you mind if I stop/resume these every now and then to run a quick timing test?
Prime95 is offline   Reply With Quote
Old 2016-09-23, 21:45   #198
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

5·2,351 Posts
Default

Quote:
Originally Posted by Prime95 View Post
Do you mind if I stop/resume these every now and then to run a quick timing test?
No, please do - you have the needed sudo access, yes? I want to run at least one batch of DCs to completion to check the results vs the 1st-time tests.
ewmayer is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
LLR development version 3.8.7 is available! Jean Penné Software 39 2012-04-27 12:33
LLR 3.8.5 Development version Jean Penné Software 6 2011-04-28 06:21
Do you have a dedicated system for gimps? Surge Hardware 5 2010-12-09 04:07
Query - Running GIMPS on a 4 way system Unregistered Hardware 6 2005-07-04 04:27
System tweaks to speed GIMPS Uncwilly Software 46 2004-02-05 09:38

All times are UTC. The time now is 07:44.


Fri Jan 27 07:44:25 UTC 2023 up 162 days, 5:12, 0 users, load averages: 1.04, 0.85, 0.78

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔