mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software

Reply
 
Thread Tools
Old 2013-02-11, 22:08   #1
tcharron
 
Jan 2013

22·7 Posts
Default Quad CPU setup question

I have a Proliant DL580 G5 with four quad core xeon X7350s in it (16 cores total).

I am running 64 bit ubuntu.

When I run mprime with 16 tasks, I get poor performance. I think that this is because each of the 4 chips has no level 3 cache, and instead has 2 4M level 2 caches (with each of these 4M caches shared by two cores).

While the chip is not hyperthreaded, running 8 workers with 2 threads each gets better performance (about twice as good, if not better). Running 4 workers with 4 threads gets another improvement of approximately double.

/proc/cpuinfo has 16 very similar copies to this:
Code:
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Xeon(R) CPU           X7350  @ 2.93GHz
stepping        : 11
microcode       : 0xb3
cpu MHz         : 2933.353
cache size      : 4096 KB
physical id     : 0
siblings        : 4
core id         : 0
cpu cores       : 4
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant
_tsc arch_perfmon pebs bts rep_good nopl aperfmperf pni dtes64 monitor ds_cpl vm
x est tm2 ssse3 cx16 xtpr pdcm dca lahf_lm dtherm tpr_shadow vnmi flexpriority
bogomips        : 5866.70
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
and cpuid returns this:
Code:
 eax in    eax      ebx      ecx      edx
00000000 0000000a 756e6547 6c65746e 49656e69
00000001 000006fb 10040800 0004e3bd bfebfbff
00000002 05b0b101 005657f0 00000000 2cb43049
00000003 00000000 00000000 00000000 00000000
00000004 0c000122 01c0003f 0000003f 00000001
00000005 00000040 00000040 00000003 00002220
00000006 00000001 00000002 00000001 00000000
00000007 00000000 00000000 00000000 00000000
00000008 00000400 00000000 00000000 00000000
00000009 00000000 00000000 00000000 00000000
0000000a 07280202 00000000 00000000 00000503
80000000 80000008 00000000 00000000 00000000
80000001 00000000 00000000 00000001 20100800
80000002 65746e49 2952286c 6f655820 2952286e
80000003 55504320 20202020 20202020 58202020
80000004 30353337 20402020 33392e32 007a4847
80000005 00000000 00000000 00000000 00000000
80000006 00000000 00000000 10008040 00000000
80000007 00000000 00000000 00000000 00000000
80000008 00003028 00000000 00000000 00000000

Vendor ID: "GenuineIntel"; CPUID level 10

Intel-specific functions:
Version 000006fb:
Type 0 - Original OEM
Family 6 - Pentium Pro
Model 15 -
Extended model 0
Stepping 11
Reserved 0

Extended brand string: "Intel(R) Xeon(R) CPU           X7350  @ 2.93GHz"
CLFLUSH instruction cache line size: 8
Initial APIC ID: 16
Hyper threading siblings: 4

Feature flags bfebfbff:
FPU    Floating Point Unit
VME    Virtual 8086 Mode Enhancements
DE     Debugging Extensions
PSE    Page Size Extensions
TSC    Time Stamp Counter
MSR    Model Specific Registers
PAE    Physical Address Extension
MCE    Machine Check Exception
CX8    COMPXCHG8B Instruction
APIC   On-chip Advanced Programmable Interrupt Controller present and enabled
SEP    Fast System Call
MTRR   Memory Type Range Registers
PGE    PTE Global Flag
MCA    Machine Check Architecture
CMOV   Conditional Move and Compare Instructions
FGPAT  Page Attribute Table
PSE-36 36-bit Page Size Extension
CLFSH  CFLUSH instruction
DS     Debug store
ACPI   Thermal Monitor and Clock Ctrl
MMX    MMX instruction set
FXSR   Fast FP/MMX Streaming SIMD Extensions save/restore
SSE    Streaming SIMD Extensions instruction set
SSE2   SSE2 extensions
SS     Self Snoop
HT     Hyper Threading
TM     Thermal monitor
31     reserved

TLB and cache info:
b1: unknown TLB/cache descriptor
b0: unknown TLB/cache descriptor
05: unknown TLB/cache descriptor
f0: unknown TLB/cache descriptor
57: unknown TLB/cache descriptor
56: unknown TLB/cache descriptor
49: unknown TLB/cache descriptor
30: unknown TLB/cache descriptor
b4: unknown TLB/cache descriptor
2c: unknown TLB/cache descriptor
Processor serial: 0000-06FB-0000-0000-0000-0000
I can't see how to tell which cpus are on which chips, and wonder if there is a smart way to allocate that would improve performance.

When I run mprime with 8 workers, it assigns the workers in pairs (1,2 and 3,4, etc). When I run mprime with 4 workers, it assigns helpers to "any cpu". I don't want to run 16 workers due to the performance hit so how do I go about figuring out the best way to set this up?
tcharron is offline   Reply With Quote
Old 2013-02-11, 22:20   #2
Dubslow
Basketry That Evening!
 
Dubslow's Avatar
 
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88

3×29×83 Posts
Default

You should be able to tell it to run each worker on a specific core. "any cpu" is fancy for leaving it to the OS, which generally is a Bad Idea. Run mprime -m, though unfortunately I can't remember which option is the useful one. One of them will let you set a specific affinity for each worker, so running 16 workers with specific affinities should work great.
Dubslow is offline   Reply With Quote
Old 2013-02-11, 23:42   #3
garo
 
garo's Avatar
 
Aug 2002
Termonfeckin, IE

22×691 Posts
Default

How do you define poor performance? And when you say that 8 workers get double the performance, does that means twice the total throughput or twice the speed.
garo is offline   Reply With Quote
Old 2013-02-12, 00:52   #4
tcharron
 
Jan 2013

2810 Posts
Default

I don't have exact timings, but something like this:
Running 16 workers gives about .095 per iteration (on each of the 16 workers).
Running 8 workers with one helper each gives about .047 per iteration (so, a mild improvement in overall throughput)
Running 4 workers with 3 helpers each gives about .023 per iteration.

In all cases I used "smart" affinity.

The 16 workers are each set up by mprime with affinity for a separate core.

With 8 workers, mprime assigns workers affinities of 1,3,5,7,9,11,13, and 15, with helper affinities (respectively) on 2,4,6,8,10,12,14 and 16. It specifically says '1 and 2' are a physical cpu etc.

With 4 workers it doesn't provide any information about physical cpus, and just seems to leave affinity to the OS. This is what I don't get... why doesn't it even try to assign affinities in this case?

It is my guess/assertion that 16 workers are slower than 8 due to too small a L2 cache but that is pure speculation. I also wonder if the 4 threads working on each of the 4 workers in the last case are inefficient due to the threads not being assigned to the same physical core (and hence not sharing any cache).
tcharron is offline   Reply With Quote
Old 2013-02-12, 00:59   #5
tcharron
 
Jan 2013

22×7 Posts
Default

I checked the timings again from a log on the system

16 workers * 1 thread each: .094-.099
8 workers * 2 threads each: .035-.045
4 workers * 4 threads each: .020-.022

I should also mention that I have 56Gig of ram. Perhaps I should be running p-1 instead on some of the cores (?)

Last fiddled with by tcharron on 2013-02-12 at 01:00
tcharron is offline   Reply With Quote
Old 2013-02-12, 01:02   #6
Dubslow
Basketry That Evening!
 
Dubslow's Avatar
 
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88

3·29·83 Posts
Default

Quote:
Originally Posted by tcharron View Post
I checked the timings again from a log on the system

16 workers * 1 thread each: .094-.099
8 workers * 2 threads each: .035-.045
4 workers * 4 threads each: .020-.022

I should also mention that I have 56Gig of ram. Perhaps I should be running p-1 instead on some of the cores (?)
YES P-1

It looks like 8x2 has the highest possible efficiency.

I would still be curious how 16 threads do if you set their affinities individually though.
Dubslow is offline   Reply With Quote
Old 2013-02-12, 01:14   #7
tcharron
 
Jan 2013

22·7 Posts
Default

BLAH. Previous timings are all wrong - there were inconsistent exponents there and I need to redo my tests. Some were 30million range, and others 60million range. Will follow up with more info.

Regarding the idea to force affinity, I'd try that but I don't know which cores are sharing cache or physical cpu. Not sure how to figure that out except lots of trial and error.
tcharron is offline   Reply With Quote
Old 2013-02-12, 01:18   #8
Dubslow
Basketry That Evening!
 
Dubslow's Avatar
 
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88

3×29×83 Posts
Default

I'm not sure that will be an issue if every worker has its own core. Just as long as each of the 16 workers gets a unique affinity.
Dubslow is offline   Reply With Quote
Old 2013-02-12, 01:33   #9
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

22×1,873 Posts
Default

Quote:
Originally Posted by tcharron View Post
I should also mention that I have 56Gig of ram. Perhaps I should be running p-1 instead on some of the cores (?)
First, do whatever interests you the most. Yes, GIMPS can use P-1 help, but LL and DC are also very valuable.

P-1 does not multithread quite as well as LL tests.
Prime95 is offline   Reply With Quote
Old 2013-02-12, 04:39   #10
Mr. P-1
 
Mr. P-1's Avatar
 
Jun 2003

100100100012 Posts
Default

Quote:
Originally Posted by Prime95 View Post
First, do whatever interests you the most.
The Golden Rule of GIMPS, IMO.

What interests some of us the most, however, is maximising the effect our contributions have toward the project throughput. If tcharron has the same interest, then P-1 is the way to go.

Quote:
P-1 does not multithread quite as well as LL tests.
multithread is less efficient all round.

Last fiddled with by Mr. P-1 on 2013-02-12 at 04:41
Mr. P-1 is offline   Reply With Quote
Old 2013-02-12, 05:36   #11
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

1D4416 Posts
Default

Quote:
Originally Posted by Mr. P-1 View Post
If tcharron has the same interest, then P-1 is the way to go.
Not necessarily.

Quote:
multithread is less efficient all round.
P-1 stage 2 does large muls and adds. The adds are not multi-threaded.
LL and DC only do muls. Thus, all LL and DC operations are multi-threaded.
Prime95 is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
GPU setup bcp19 GPU Computing 4 2011-10-15 14:09
Setup on Mac RussJones Information & Answers 4 2011-01-04 14:08
Mlucas on HP-UX/PA-RISC setup question smoky Mlucas 14 2009-05-05 15:40
Quad Quad-cores SlashDude Hardware 30 2009-01-30 22:22
Linux and proxy setup sz0wxc PrimeNet 3 2004-06-01 16:49

All times are UTC. The time now is 09:13.

Mon May 17 09:13:16 UTC 2021 up 39 days, 3:54, 0 users, load averages: 1.30, 1.50, 1.51

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.