![]() |
|
|
#144 | |
|
Aug 2002
7×1,237 Posts |
Quote:
|
|
|
|
|
|
|
#145 |
|
"Oliver"
Sep 2017
Porta Westfalica, DE
23×71 Posts |
I guess no current application will scale nicely on Threadripper or Ryzen 1xxx, because of its CCXs. When I try to scale mlucas, Prime95, etc. to more than 1 CCX on my 1950X, performance/efficiency will take a great hit and the power consumption decreases.
|
|
|
|
|
|
#146 |
|
Aug 2002
7·1,237 Posts |
We had no idea what a CCX was, so we found this: https://www.tomshardware.com/reviews...ined,6338.html
Is there a way to lock a process and its threads to a particular CCX?
|
|
|
|
|
|
#147 | |
|
"Kieren"
Jul 2011
In My Own Galaxy!
2×3×1,693 Posts |
Quote:
|
|
|
|
|
|
|
#148 | |
|
"Curtis"
Feb 2005
Riverside, CA
2×2,927 Posts |
Quote:
I invoke msieve with: taskset -c 0-11 ./msieve -t 12 -nc2 That puts the 12 msieve threads on 12 different cores (on my machine, 12-23 are hyperthreads of 0-11). Once you know which threads are part of a single CCX, you can use taskset to hit just those threads. You can use a comma also, if the threads aren't contiguous: taskset -c 0-2,6-8 will run 6-threaded if you wish for some reason (e.g. to see if using all cores + hyperthreads on a single CCX gains you anything). |
|
|
|
|
|
|
#149 | ||
|
"Oliver"
Sep 2017
Porta Westfalica, DE
23×71 Posts |
Quote:
When I'm executing a benchmark on Prime95, the hwloc-library will write detailed information into results.bench.txt: Code:
AMD Ryzen Threadripper 1950X 16-Core Processor
CPU speed: 3432.97 MHz, 16 hyperthreaded cores
CPU features: 3DNow! Prefetch, SSE, SSE2, SSE4, AVX, AVX2, FMA
L1 cache size: 16x32 KB, L2 cache size: 16x512 KB, L3 cache size: 4x8 MB
L1 cache line size: 64 bytes, L2 cache line size: 64 bytes
Machine topology as determined by hwloc library:
Machine#0 (total=111520316KB, Backend=Windows, hwlocVersion=2.0.4, ProcessName=prime95.exe)
Package (total=111520316KB, CPUVendor=AuthenticAMD, CPUFamilyNumber=23, CPUModelNumber=1, CPUModel="AMD Ryzen Threadripper 1950X 16-Core Processor ", CPUStepping=1)
Group0 (total=64141560KB)
L3 (size=8192KB, linesize=64, ways=16, Inclusive=0)
L2 (size=512KB, linesize=64, ways=8, Inclusive=1)
L1d (size=32KB, linesize=64, ways=8, Inclusive=0)
Core (cpuset: 0x00000003)
PU#0 (cpuset: 0x00000001)
PU#1 (cpuset: 0x00000002)
L2 (size=512KB, linesize=64, ways=8, Inclusive=1)
L1d (size=32KB, linesize=64, ways=8, Inclusive=0)
Core (cpuset: 0x0000000c)
PU#2 (cpuset: 0x00000004)
PU#3 (cpuset: 0x00000008)
L2 (size=512KB, linesize=64, ways=8, Inclusive=1)
L1d (size=32KB, linesize=64, ways=8, Inclusive=0)
Core (cpuset: 0x00000030)
PU#4 (cpuset: 0x00000010)
PU#5 (cpuset: 0x00000020)
L2 (size=512KB, linesize=64, ways=8, Inclusive=1)
L1d (size=32KB, linesize=64, ways=8, Inclusive=0)
Core (cpuset: 0x000000c0)
PU#6 (cpuset: 0x00000040)
PU#7 (cpuset: 0x00000080)
L3 (size=8192KB, linesize=64, ways=16, Inclusive=0)
// here starts a new CCX!
<snip>
If you then set Prime95 to use workers with a maximum of 4 cores, that should be fine. Interestingly, I get best throughput with one DC exponent per physical core. That's totally different on my Intel machines. Your 1920X should have 4 CCXs with 3 cores each: Quote:
|
||
|
|
|
|
|
#150 |
|
Aug 2002
7×1,237 Posts |
Here is the hwlock info for our CPU:
Code:
AMD Ryzen Threadripper 1920X 12-Core Processor
CPU speed: 3493.43 MHz, 12 hyperthreaded cores
CPU features: 3DNow! Prefetch, SSE, SSE2, SSE4, AVX, AVX2, FMA
L1 cache size: 12x32 KB, L2 cache size: 12x512 KB, L3 cache size: 4x8 MB
L1 cache line size: 64 bytes, L2 cache line size: 64 bytes
Machine topology as determined by hwloc library:
Machine#0 (total=7457252KB, Backend=Windows, hwlocVersion=2.0.4, ProcessName=prime95.exe)
Package (total=7457252KB, CPUVendor=AuthenticAMD, CPUFamilyNumber=23, CPUModelNumber=1, CPUModel="AMD Ryzen Threadripper 1920X 12-Core Processor ", CPUStepping=1)
L3 (size=8192KB, linesize=64, ways=16, Inclusive=0)
L2 (size=512KB, linesize=64, ways=8, Inclusive=1)
L1d (size=32KB, linesize=64, ways=8, Inclusive=0)
Core (cpuset: 0x00000003)
PU#0 (cpuset: 0x00000001)
PU#1 (cpuset: 0x00000002)
L2 (size=512KB, linesize=64, ways=8, Inclusive=1)
L1d (size=32KB, linesize=64, ways=8, Inclusive=0)
Core (cpuset: 0x0000000c)
PU#2 (cpuset: 0x00000004)
PU#3 (cpuset: 0x00000008)
L2 (size=512KB, linesize=64, ways=8, Inclusive=1)
L1d (size=32KB, linesize=64, ways=8, Inclusive=0)
Core (cpuset: 0x00000030)
PU#4 (cpuset: 0x00000010)
PU#5 (cpuset: 0x00000020)
L3 (size=8192KB, linesize=64, ways=16, Inclusive=0)
L2 (size=512KB, linesize=64, ways=8, Inclusive=1)
L1d (size=32KB, linesize=64, ways=8, Inclusive=0)
Core (cpuset: 0x000000c0)
PU#6 (cpuset: 0x00000040)
PU#7 (cpuset: 0x00000080)
L2 (size=512KB, linesize=64, ways=8, Inclusive=1)
L1d (size=32KB, linesize=64, ways=8, Inclusive=0)
Core (cpuset: 0x00000300)
PU#8 (cpuset: 0x00000100)
PU#9 (cpuset: 0x00000200)
L2 (size=512KB, linesize=64, ways=8, Inclusive=1)
L1d (size=32KB, linesize=64, ways=8, Inclusive=0)
Core (cpuset: 0x00000c00)
PU#10 (cpuset: 0x00000400)
PU#11 (cpuset: 0x00000800)
L3 (size=8192KB, linesize=64, ways=16, Inclusive=0)
L2 (size=512KB, linesize=64, ways=8, Inclusive=1)
L1d (size=32KB, linesize=64, ways=8, Inclusive=0)
Core (cpuset: 0x00003000)
PU#12 (cpuset: 0x00001000)
PU#13 (cpuset: 0x00002000)
L2 (size=512KB, linesize=64, ways=8, Inclusive=1)
L1d (size=32KB, linesize=64, ways=8, Inclusive=0)
Core (cpuset: 0x0000c000)
PU#14 (cpuset: 0x00004000)
PU#15 (cpuset: 0x00008000)
L2 (size=512KB, linesize=64, ways=8, Inclusive=1)
L1d (size=32KB, linesize=64, ways=8, Inclusive=0)
Core (cpuset: 0x00030000)
PU#16 (cpuset: 0x00010000)
PU#17 (cpuset: 0x00020000)
L3 (size=8192KB, linesize=64, ways=16, Inclusive=0)
L2 (size=512KB, linesize=64, ways=8, Inclusive=1)
L1d (size=32KB, linesize=64, ways=8, Inclusive=0)
Core (cpuset: 0x000c0000)
PU#18 (cpuset: 0x00040000)
PU#19 (cpuset: 0x00080000)
L2 (size=512KB, linesize=64, ways=8, Inclusive=1)
L1d (size=32KB, linesize=64, ways=8, Inclusive=0)
Core (cpuset: 0x00300000)
PU#20 (cpuset: 0x00100000)
PU#21 (cpuset: 0x00200000)
L2 (size=512KB, linesize=64, ways=8, Inclusive=1)
L1d (size=32KB, linesize=64, ways=8, Inclusive=0)
Core (cpuset: 0x00c00000)
PU#22 (cpuset: 0x00400000)
PU#23 (cpuset: 0x00800000)
Code:
╔═════╦══════════╦══════════╗ ║ CCX ║ CORE ║ HT ║ ╠═════╬══════════╬══════════╣ ║ 1 ║ 0 2 4 ║ 1 3 5 ║ ╠═════╬══════════╬══════════╣ ║ 2 ║ 6 8 10 ║ 7 9 11 ║ ╠═════╬══════════╬══════════╣ ║ 3 ║ 12 14 16 ║ 13 15 17 ║ ╠═════╬══════════╬══════════╣ ║ 4 ║ 18 20 22 ║ 19 21 23 ║ ╚═════╩══════════╩══════════╝
|
|
|
|
|
|
#151 |
|
Aug 2002
7·1,237 Posts |
Possibly useful?
https://www.passmark.com/forum/perfo...d-threadripper https://blog.michael.kuron-germany.d...-and-htcondor/ More hardware info: https://docs.microsoft.com/en-us/sys...loads/coreinfo Code:
AMD Ryzen Threadripper 1920X 12-Core Processor
AMD64 Family 23 Model 1 Stepping 1, AuthenticAMD
Microcode signature: 00000000
HTT * Multicore
HYPERVISOR - Hypervisor is present
VMX - Supports Intel hardware-assisted virtualization
SVM * Supports AMD hardware-assisted virtualization
X64 * Supports 64-bit mode
SMX - Supports Intel trusted execution
SKINIT * Supports AMD SKINIT
NX * Supports no-execute page protection
SMEP * Supports Supervisor Mode Execution Prevention
SMAP * Supports Supervisor Mode Access Prevention
PAGE1GB * Supports 1 GB large pages
PAE * Supports > 32-bit physical addresses
PAT * Supports Page Attribute Table
PSE * Supports 4 MB pages
PSE36 * Supports > 32-bit address 4 MB pages
PGE * Supports global bit in page tables
SS - Supports bus snooping for cache operations
VME * Supports Virtual-8086 mode
RDWRFSGSBASE * Supports direct GS/FS base access
FPU * Implements i387 floating point instructions
MMX * Supports MMX instruction set
MMXEXT * Implements AMD MMX extensions
3DNOW - Supports 3DNow! instructions
3DNOWEXT - Supports 3DNow! extension instructions
SSE * Supports Streaming SIMD Extensions
SSE2 * Supports Streaming SIMD Extensions 2
SSE3 * Supports Streaming SIMD Extensions 3
SSSE3 * Supports Supplemental SIMD Extensions 3
SSE4a * Supports Streaming SIMDR Extensions 4a
SSE4.1 * Supports Streaming SIMD Extensions 4.1
SSE4.2 * Supports Streaming SIMD Extensions 4.2
AES * Supports AES extensions
AVX * Supports AVX instruction extensions
FMA * Supports FMA extensions using YMM state
MSR * Implements RDMSR/WRMSR instructions
MTRR * Supports Memory Type Range Registers
XSAVE * Supports XSAVE/XRSTOR instructions
OSXSAVE * Supports XSETBV/XGETBV instructions
RDRAND * Supports RDRAND instruction
RDSEED * Supports RDSEED instruction
CMOV * Supports CMOVcc instruction
CLFSH * Supports CLFLUSH instruction
CX8 * Supports compare and exchange 8-byte instructions
CX16 * Supports CMPXCHG16B instruction
BMI1 * Supports bit manipulation extensions 1
BMI2 * Supports bit manipulation extensions 2
ADX * Supports ADCX/ADOX instructions
DCA - Supports prefetch from memory-mapped device
F16C * Supports half-precision instruction
FXSR * Supports FXSAVE/FXSTOR instructions
FFXSR * Supports optimized FXSAVE/FSRSTOR instruction
MONITOR * Supports MONITOR and MWAIT instructions
MOVBE * Supports MOVBE instruction
ERMSB - Supports Enhanced REP MOVSB/STOSB
PCLMULDQ * Supports PCLMULDQ instruction
POPCNT * Supports POPCNT instruction
LZCNT * Supports LZCNT instruction
SEP * Supports fast system call instructions
LAHF-SAHF * Supports LAHF/SAHF instructions in 64-bit mode
HLE - Supports Hardware Lock Elision instructions
RTM - Supports Restricted Transactional Memory instructions
DE * Supports I/O breakpoints including CR4.DE
DTES64 - Can write history of 64-bit branch addresses
DS - Implements memory-resident debug buffer
DS-CPL - Supports Debug Store feature with CPL
PCID - Supports PCIDs and settable CR4.PCIDE
INVPCID - Supports INVPCID instruction
PDCM - Supports Performance Capabilities MSR
RDTSCP * Supports RDTSCP instruction
TSC * Supports RDTSC instruction
TSC-DEADLINE - Local APIC supports one-shot deadline timer
TSC-INVARIANT * TSC runs at constant rate
xTPR - Supports disabling task priority messages
EIST - Supports Enhanced Intel Speedstep
ACPI - Implements MSR for power management
TM - Implements thermal monitor circuitry
TM2 - Implements Thermal Monitor 2 control
APIC * Implements software-accessible local APIC
x2APIC - Supports x2APIC
CNXT-ID - L1 data cache mode adaptive or BIOS
MCE * Supports Machine Check, INT18 and CR4.MCE
MCA * Implements Machine Check Architecture
PBE - Supports use of FERR#/PBE# pin
PSN - Implements 96-bit processor serial number
PREFETCHW * Supports PREFETCHW instruction
Maximum implemented CPUID leaves: 0000000D (Basic), 8000001F (Extended).
Maximum implemented address width: 48 bits (virtual), 48 bits (physical).
Processor signature: 00800F11
Logical to Physical Processor Map:
**---------------------- Physical Processor 0 (Hyperthreaded)
--**-------------------- Physical Processor 1 (Hyperthreaded)
----**------------------ Physical Processor 2 (Hyperthreaded)
------**---------------- Physical Processor 3 (Hyperthreaded)
--------**-------------- Physical Processor 4 (Hyperthreaded)
----------**------------ Physical Processor 5 (Hyperthreaded)
------------**---------- Physical Processor 6 (Hyperthreaded)
--------------**-------- Physical Processor 7 (Hyperthreaded)
----------------**------ Physical Processor 8 (Hyperthreaded)
------------------**---- Physical Processor 9 (Hyperthreaded)
--------------------**-- Physical Processor 10 (Hyperthreaded)
----------------------** Physical Processor 11 (Hyperthreaded)
Logical Processor to Socket Map:
************************ Socket 0
Logical Processor to NUMA Node Map:
************************ NUMA Node 0
- NUMA Node 1
Calculating Cross-NUMA Node Access Cost...
Approximate Cross-NUMA Node Access Cost (relative to fastest):
00 01
00: 1.2 1.0
01: 0.0 0.0
Logical Processor to Cache Map:
**---------------------- Data Cache 0, Level 1, 32 KB, Assoc 8, LineSize 64
**---------------------- Instruction Cache 0, Level 1, 64 KB, Assoc 4, LineSize 64
**---------------------- Unified Cache 0, Level 2, 512 KB, Assoc 8, LineSize 64
******------------------ Unified Cache 1, Level 3, 8 MB, Assoc 16, LineSize 64
--**-------------------- Data Cache 1, Level 1, 32 KB, Assoc 8, LineSize 64
--**-------------------- Instruction Cache 1, Level 1, 64 KB, Assoc 4, LineSize 64
--**-------------------- Unified Cache 2, Level 2, 512 KB, Assoc 8, LineSize 64
----**------------------ Data Cache 2, Level 1, 32 KB, Assoc 8, LineSize 64
----**------------------ Instruction Cache 2, Level 1, 64 KB, Assoc 4, LineSize 64
----**------------------ Unified Cache 3, Level 2, 512 KB, Assoc 8, LineSize 64
------**---------------- Data Cache 3, Level 1, 32 KB, Assoc 8, LineSize 64
------**---------------- Instruction Cache 3, Level 1, 64 KB, Assoc 4, LineSize 64
------**---------------- Unified Cache 4, Level 2, 512 KB, Assoc 8, LineSize 64
------******------------ Unified Cache 5, Level 3, 8 MB, Assoc 16, LineSize 64
--------**-------------- Data Cache 4, Level 1, 32 KB, Assoc 8, LineSize 64
--------**-------------- Instruction Cache 4, Level 1, 64 KB, Assoc 4, LineSize 64
--------**-------------- Unified Cache 6, Level 2, 512 KB, Assoc 8, LineSize 64
----------**------------ Data Cache 5, Level 1, 32 KB, Assoc 8, LineSize 64
----------**------------ Instruction Cache 5, Level 1, 64 KB, Assoc 4, LineSize 64
----------**------------ Unified Cache 7, Level 2, 512 KB, Assoc 8, LineSize 64
------------**---------- Data Cache 6, Level 1, 32 KB, Assoc 8, LineSize 64
------------**---------- Instruction Cache 6, Level 1, 64 KB, Assoc 4, LineSize 64
------------**---------- Unified Cache 8, Level 2, 512 KB, Assoc 8, LineSize 64
------------******------ Unified Cache 9, Level 3, 8 MB, Assoc 16, LineSize 64
--------------**-------- Data Cache 7, Level 1, 32 KB, Assoc 8, LineSize 64
--------------**-------- Instruction Cache 7, Level 1, 64 KB, Assoc 4, LineSize 64
--------------**-------- Unified Cache 10, Level 2, 512 KB, Assoc 8, LineSize 64
----------------**------ Data Cache 8, Level 1, 32 KB, Assoc 8, LineSize 64
----------------**------ Instruction Cache 8, Level 1, 64 KB, Assoc 4, LineSize 64
----------------**------ Unified Cache 11, Level 2, 512 KB, Assoc 8, LineSize 64
------------------**---- Data Cache 9, Level 1, 32 KB, Assoc 8, LineSize 64
------------------**---- Instruction Cache 9, Level 1, 64 KB, Assoc 4, LineSize 64
------------------**---- Unified Cache 12, Level 2, 512 KB, Assoc 8, LineSize 64
------------------****** Unified Cache 13, Level 3, 8 MB, Assoc 16, LineSize 64
--------------------**-- Data Cache 10, Level 1, 32 KB, Assoc 8, LineSize 64
--------------------**-- Instruction Cache 10, Level 1, 64 KB, Assoc 4, LineSize 64
--------------------**-- Unified Cache 14, Level 2, 512 KB, Assoc 8, LineSize 64
----------------------** Data Cache 11, Level 1, 32 KB, Assoc 8, LineSize 64
----------------------** Instruction Cache 11, Level 1, 64 KB, Assoc 4, LineSize 64
----------------------** Unified Cache 15, Level 2, 512 KB, Assoc 8, LineSize 64
Logical Processor to Group Map:
************************ Group 0
|
|
|
|
|
|
#152 |
|
Aug 2002
7×1,237 Posts |
We set the memory interleaving option in the BIOS from "AUTO" to "CHANNEL".
Before: Code:
Logical Processor to NUMA Node Map:
************************ NUMA Node 0
- NUMA Node 1
Calculating Cross-NUMA Node Access Cost...
Approximate Cross-NUMA Node Access Cost (relative to fastest):
00 01
00: 1.2 1.0
01: 0.0 0.0
Code:
Logical Processor to NUMA Node Map:
************------------ NUMA Node 0
------------************ NUMA Node 1
Calculating Cross-NUMA Node Access Cost...
Approximate Cross-NUMA Node Access Cost (relative to fastest):
00 01
00: 1.0 1.2
01: 1.4 1.3
|
|
|
|
|
|
#153 | |
|
Aug 2002
7×1,237 Posts |
So to put a 6 thread job on only the "real" cores of the second CCD, we are using this command: taskset -c 12,14,16,18,20,22 nice -19 ../msieve -ncr -t 6 -v target_density=120 -i 107331526897849_17m1.ini
After this job is done we will run the benchmark on 6 cores again. Note that our BIOS offers the following memory interleaving options: Quote:
|
|
|
|
|
|
|
#154 | |
|
"Oliver"
Sep 2017
Porta Westfalica, DE
23×71 Posts |
Quote:
Since you changed the node interleaving settings after this, hwloc was not able to detect the CCDs. But I think you got it correctly after that. In contrast to Prime95, y-cruncher runs much better with node interleaving activated. |
|
|
|
|