View Single Post
Old 2019-02-22, 23:44   #11
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
Rep├║blica de California

2D8A16 Posts
Default

Quote:
Originally Posted by ATH View Post
-DCARRY_16_WAY is not needed in v18 right?
Correct - if you open platform.h and search for CARRY_16_WAY you'll see it's now on by default for avx-512 builds.

Quote:
This time all 18 cores was fastest for some reason.
What's your best-radix-set timings for 8 and 9-threads at 4608K? I'm curious how much more ||ism we're getting at the higher 16 and 18-threadcounts.

Quote:
From the README.html should this be -cpu 0:n-1 ?
That snip indeed needs an edit, but of a different kind - the section in question is describing (or attempting to :) the simplest way to maximize total system throughput on most multicore x86 systems. That is 1 LL test per physical core, with each such job using 2-threads on Intel hyperthreaded CPUs and 1-thread otherwise (Intel non-HT, AMD, ARM, etc). Because of the way Intel numbers its logical cores, on a system with n physical cores, logical cores j and n+j map to phys-core j, for j = 0,...,n-1. So to generate a proper mlucas.cfg file for such a set-up, one should use -cpu 0,n (note: comma, not colon), then copy the resulting cfg-file to each of n run directories which will host such a 2-thread-on-1-physical-core job.

From a job-management perspective it's of course easier to just run 1 job using all the physical cores, and as long as n <= 4 one won't sacrifice much total throughput by doing so. So on both my non-HT Intel quad Haswell and my quad-ARM64-core Odroid C2 I use -cpu 0:3, as I do on my HT-enabled dual-core Intel Broadwell NUC because there I want to use 2-threads-per-physical-core and a single 4-thread job gives me nearly the same throughput as separate jobs using -cpu 0,2 and -cpu 1,3.

I need to carefully re-read the README.html page to try to catch remaining such ,-versus-: mixups, because they are easy to overlook.
ewmayer is offline   Reply With Quote