mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Software (https://www.mersenneforum.org/forumdisplay.php?f=10)
-   -   Quad CPU setup question (https://www.mersenneforum.org/showthread.php?t=17783)

tcharron 2013-02-11 22:08

Quad CPU setup question
 
I have a Proliant DL580 G5 with four quad core xeon X7350s in it (16 cores total).

I am running 64 bit ubuntu.

When I run mprime with 16 tasks, I get poor performance. I think that this is because each of the 4 chips has no level 3 cache, and instead has 2 4M level 2 caches (with each of these 4M caches shared by two cores).

While the chip is not hyperthreaded, running 8 workers with 2 threads each gets better performance (about twice as good, if not better). Running 4 workers with 4 threads gets another improvement of approximately double.

/proc/cpuinfo has 16 very similar copies to this:
[CODE]
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Xeon(R) CPU X7350 @ 2.93GHz
stepping : 11
microcode : 0xb3
cpu MHz : 2933.353
cache size : 4096 KB
physical id : 0
siblings : 4
core id : 0
cpu cores : 4
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant
_tsc arch_perfmon pebs bts rep_good nopl aperfmperf pni dtes64 monitor ds_cpl vm
x est tm2 ssse3 cx16 xtpr pdcm dca lahf_lm dtherm tpr_shadow vnmi flexpriority
bogomips : 5866.70
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual[/CODE]and cpuid returns this:
[CODE] eax in eax ebx ecx edx
00000000 0000000a 756e6547 6c65746e 49656e69
00000001 000006fb 10040800 0004e3bd bfebfbff
00000002 05b0b101 005657f0 00000000 2cb43049
00000003 00000000 00000000 00000000 00000000
00000004 0c000122 01c0003f 0000003f 00000001
00000005 00000040 00000040 00000003 00002220
00000006 00000001 00000002 00000001 00000000
00000007 00000000 00000000 00000000 00000000
00000008 00000400 00000000 00000000 00000000
00000009 00000000 00000000 00000000 00000000
0000000a 07280202 00000000 00000000 00000503
80000000 80000008 00000000 00000000 00000000
80000001 00000000 00000000 00000001 20100800
80000002 65746e49 2952286c 6f655820 2952286e
80000003 55504320 20202020 20202020 58202020
80000004 30353337 20402020 33392e32 007a4847
80000005 00000000 00000000 00000000 00000000
80000006 00000000 00000000 10008040 00000000
80000007 00000000 00000000 00000000 00000000
80000008 00003028 00000000 00000000 00000000

Vendor ID: "GenuineIntel"; CPUID level 10

Intel-specific functions:
Version 000006fb:
Type 0 - Original OEM
Family 6 - Pentium Pro
Model 15 -
Extended model 0
Stepping 11
Reserved 0

Extended brand string: "Intel(R) Xeon(R) CPU X7350 @ 2.93GHz"
CLFLUSH instruction cache line size: 8
Initial APIC ID: 16
Hyper threading siblings: 4

Feature flags bfebfbff:
FPU Floating Point Unit
VME Virtual 8086 Mode Enhancements
DE Debugging Extensions
PSE Page Size Extensions
TSC Time Stamp Counter
MSR Model Specific Registers
PAE Physical Address Extension
MCE Machine Check Exception
CX8 COMPXCHG8B Instruction
APIC On-chip Advanced Programmable Interrupt Controller present and enabled
SEP Fast System Call
MTRR Memory Type Range Registers
PGE PTE Global Flag
MCA Machine Check Architecture
CMOV Conditional Move and Compare Instructions
FGPAT Page Attribute Table
PSE-36 36-bit Page Size Extension
CLFSH CFLUSH instruction
DS Debug store
ACPI Thermal Monitor and Clock Ctrl
MMX MMX instruction set
FXSR Fast FP/MMX Streaming SIMD Extensions save/restore
SSE Streaming SIMD Extensions instruction set
SSE2 SSE2 extensions
SS Self Snoop
HT Hyper Threading
TM Thermal monitor
31 reserved

TLB and cache info:
b1: unknown TLB/cache descriptor
b0: unknown TLB/cache descriptor
05: unknown TLB/cache descriptor
f0: unknown TLB/cache descriptor
57: unknown TLB/cache descriptor
56: unknown TLB/cache descriptor
49: unknown TLB/cache descriptor
30: unknown TLB/cache descriptor
b4: unknown TLB/cache descriptor
2c: unknown TLB/cache descriptor
Processor serial: 0000-06FB-0000-0000-0000-0000[/CODE]I can't see how to tell which cpus are on which chips, and wonder if there is a smart way to allocate that would improve performance.

When I run mprime with 8 workers, it assigns the workers in pairs (1,2 and 3,4, etc). When I run mprime with 4 workers, it assigns helpers to "any cpu". I don't want to run 16 workers due to the performance hit so how do I go about figuring out the best way to set this up?

Dubslow 2013-02-11 22:20

You should be able to tell it to run each worker on a specific core. "any cpu" is fancy for leaving it to the OS, which generally is a Bad Idea. Run mprime -m, though unfortunately I can't remember which option is the useful one. One of them will let you set a specific affinity for each worker, so running 16 workers with specific affinities should work great.

garo 2013-02-11 23:42

How do you define poor performance? And when you say that 8 workers get double the performance, does that means twice the total throughput or twice the speed.

tcharron 2013-02-12 00:52

I don't have exact timings, but something like this:
Running 16 workers gives about .095 per iteration (on each of the 16 workers).
Running 8 workers with one helper each gives about .047 per iteration (so, a mild improvement in overall throughput)
Running 4 workers with 3 helpers each gives about .023 per iteration.

In all cases I used "smart" affinity.

The 16 workers are each set up by mprime with affinity for a separate core.

With 8 workers, mprime assigns workers affinities of 1,3,5,7,9,11,13, and 15, with helper affinities (respectively) on 2,4,6,8,10,12,14 and 16. It specifically says '1 and 2' are a physical cpu etc.

With 4 workers it doesn't provide any information about physical cpus, and just seems to leave affinity to the OS. This is what I don't get... why doesn't it even try to assign affinities in this case?

It is my guess/assertion that 16 workers are slower than 8 due to too small a L2 cache but that is pure speculation. I also wonder if the 4 threads working on each of the 4 workers in the last case are inefficient due to the threads not being assigned to the same physical core (and hence not sharing any cache).

tcharron 2013-02-12 00:59

I checked the timings again from a log on the system

16 workers * 1 thread each: .094-.099
8 workers * 2 threads each: .035-.045
4 workers * 4 threads each: .020-.022

I should also mention that I have 56Gig of ram. Perhaps I should be running p-1 instead on some of the cores (?)

Dubslow 2013-02-12 01:02

[QUOTE=tcharron;329050]I checked the timings again from a log on the system

16 workers * 1 thread each: .094-.099
8 workers * 2 threads each: .035-.045
4 workers * 4 threads each: .020-.022

I should also mention that I have 56Gig of ram. Perhaps I should be running p-1 instead on some of the cores (?)[/QUOTE]

YES P-1

It looks like 8x2 has the highest possible efficiency.

I would still be curious how 16 threads do if you set their affinities individually though.

tcharron 2013-02-12 01:14

BLAH. Previous timings are all wrong - there were inconsistent exponents there and I need to redo my tests. Some were 30million range, and others 60million range. Will follow up with more info.

Regarding the idea to force affinity, I'd try that but I don't know which cores are sharing cache or physical cpu. Not sure how to figure that out except lots of trial and error.

Dubslow 2013-02-12 01:18

I'm not sure that will be an issue if every worker has its own core. Just as long as each of the 16 workers gets a unique affinity.

Prime95 2013-02-12 01:33

[QUOTE=tcharron;329050]I should also mention that I have 56Gig of ram. Perhaps I should be running p-1 instead on some of the cores (?)[/QUOTE]

First, do whatever interests you the most. Yes, GIMPS can use P-1 help, but LL and DC are also very valuable.

P-1 does not multithread quite as well as LL tests.

Mr. P-1 2013-02-12 04:39

[QUOTE=Prime95;329055]First, do whatever interests you the most.[/QUOTE]

The Golden Rule of GIMPS, IMO.

What interests some of us the most, however, is maximising the effect our contributions have toward the project throughput. If tcharron has the same interest, then P-1 is the way to go.

[QUOTE]P-1 does not multithread quite as well as LL tests.[/QUOTE]

multithread is less efficient all round.

Prime95 2013-02-12 05:36

[QUOTE=Mr. P-1;329067] If tcharron has the same interest, then P-1 is the way to go.
[/quote]

Not necessarily.

[quote]multithread is less efficient all round.[/QUOTE]

P-1 stage 2 does large muls and adds. The adds are not multi-threaded.
LL and DC only do muls. Thus, all LL and DC operations are multi-threaded.


All times are UTC. The time now is 05:40.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.