Register FAQ Search Today's Posts Mark Forums Read

 2013-02-11, 22:08 #1 tcharron   Jan 2013 22·7 Posts Quad CPU setup question I have a Proliant DL580 G5 with four quad core xeon X7350s in it (16 cores total). I am running 64 bit ubuntu. When I run mprime with 16 tasks, I get poor performance. I think that this is because each of the 4 chips has no level 3 cache, and instead has 2 4M level 2 caches (with each of these 4M caches shared by two cores). While the chip is not hyperthreaded, running 8 workers with 2 threads each gets better performance (about twice as good, if not better). Running 4 workers with 4 threads gets another improvement of approximately double. /proc/cpuinfo has 16 very similar copies to this: Code: processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Xeon(R) CPU X7350 @ 2.93GHz stepping : 11 microcode : 0xb3 cpu MHz : 2933.353 cache size : 4096 KB physical id : 0 siblings : 4 core id : 0 cpu cores : 4 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant _tsc arch_perfmon pebs bts rep_good nopl aperfmperf pni dtes64 monitor ds_cpl vm x est tm2 ssse3 cx16 xtpr pdcm dca lahf_lm dtherm tpr_shadow vnmi flexpriority bogomips : 5866.70 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual and cpuid returns this: Code:  eax in eax ebx ecx edx 00000000 0000000a 756e6547 6c65746e 49656e69 00000001 000006fb 10040800 0004e3bd bfebfbff 00000002 05b0b101 005657f0 00000000 2cb43049 00000003 00000000 00000000 00000000 00000000 00000004 0c000122 01c0003f 0000003f 00000001 00000005 00000040 00000040 00000003 00002220 00000006 00000001 00000002 00000001 00000000 00000007 00000000 00000000 00000000 00000000 00000008 00000400 00000000 00000000 00000000 00000009 00000000 00000000 00000000 00000000 0000000a 07280202 00000000 00000000 00000503 80000000 80000008 00000000 00000000 00000000 80000001 00000000 00000000 00000001 20100800 80000002 65746e49 2952286c 6f655820 2952286e 80000003 55504320 20202020 20202020 58202020 80000004 30353337 20402020 33392e32 007a4847 80000005 00000000 00000000 00000000 00000000 80000006 00000000 00000000 10008040 00000000 80000007 00000000 00000000 00000000 00000000 80000008 00003028 00000000 00000000 00000000 Vendor ID: "GenuineIntel"; CPUID level 10 Intel-specific functions: Version 000006fb: Type 0 - Original OEM Family 6 - Pentium Pro Model 15 - Extended model 0 Stepping 11 Reserved 0 Extended brand string: "Intel(R) Xeon(R) CPU X7350 @ 2.93GHz" CLFLUSH instruction cache line size: 8 Initial APIC ID: 16 Hyper threading siblings: 4 Feature flags bfebfbff: FPU Floating Point Unit VME Virtual 8086 Mode Enhancements DE Debugging Extensions PSE Page Size Extensions TSC Time Stamp Counter MSR Model Specific Registers PAE Physical Address Extension MCE Machine Check Exception CX8 COMPXCHG8B Instruction APIC On-chip Advanced Programmable Interrupt Controller present and enabled SEP Fast System Call MTRR Memory Type Range Registers PGE PTE Global Flag MCA Machine Check Architecture CMOV Conditional Move and Compare Instructions FGPAT Page Attribute Table PSE-36 36-bit Page Size Extension CLFSH CFLUSH instruction DS Debug store ACPI Thermal Monitor and Clock Ctrl MMX MMX instruction set FXSR Fast FP/MMX Streaming SIMD Extensions save/restore SSE Streaming SIMD Extensions instruction set SSE2 SSE2 extensions SS Self Snoop HT Hyper Threading TM Thermal monitor 31 reserved TLB and cache info: b1: unknown TLB/cache descriptor b0: unknown TLB/cache descriptor 05: unknown TLB/cache descriptor f0: unknown TLB/cache descriptor 57: unknown TLB/cache descriptor 56: unknown TLB/cache descriptor 49: unknown TLB/cache descriptor 30: unknown TLB/cache descriptor b4: unknown TLB/cache descriptor 2c: unknown TLB/cache descriptor Processor serial: 0000-06FB-0000-0000-0000-0000 I can't see how to tell which cpus are on which chips, and wonder if there is a smart way to allocate that would improve performance. When I run mprime with 8 workers, it assigns the workers in pairs (1,2 and 3,4, etc). When I run mprime with 4 workers, it assigns helpers to "any cpu". I don't want to run 16 workers due to the performance hit so how do I go about figuring out the best way to set this up?
 2013-02-11, 22:20 #2 Dubslow Basketry That Evening!     "Bunslow the Bold" Jun 2011 40
 2013-02-11, 23:42 #3 garo     Aug 2002 Termonfeckin, IE 22×691 Posts How do you define poor performance? And when you say that 8 workers get double the performance, does that means twice the total throughput or twice the speed.
 2013-02-12, 00:52 #4 tcharron   Jan 2013 2810 Posts I don't have exact timings, but something like this: Running 16 workers gives about .095 per iteration (on each of the 16 workers). Running 8 workers with one helper each gives about .047 per iteration (so, a mild improvement in overall throughput) Running 4 workers with 3 helpers each gives about .023 per iteration. In all cases I used "smart" affinity. The 16 workers are each set up by mprime with affinity for a separate core. With 8 workers, mprime assigns workers affinities of 1,3,5,7,9,11,13, and 15, with helper affinities (respectively) on 2,4,6,8,10,12,14 and 16. It specifically says '1 and 2' are a physical cpu etc. With 4 workers it doesn't provide any information about physical cpus, and just seems to leave affinity to the OS. This is what I don't get... why doesn't it even try to assign affinities in this case? It is my guess/assertion that 16 workers are slower than 8 due to too small a L2 cache but that is pure speculation. I also wonder if the 4 threads working on each of the 4 workers in the last case are inefficient due to the threads not being assigned to the same physical core (and hence not sharing any cache).
 2013-02-12, 00:59 #5 tcharron   Jan 2013 22×7 Posts I checked the timings again from a log on the system 16 workers * 1 thread each: .094-.099 8 workers * 2 threads each: .035-.045 4 workers * 4 threads each: .020-.022 I should also mention that I have 56Gig of ram. Perhaps I should be running p-1 instead on some of the cores (?) Last fiddled with by tcharron on 2013-02-12 at 01:00
2013-02-12, 01:02   #6
Dubslow

"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88

3·29·83 Posts

Quote:
 Originally Posted by tcharron I checked the timings again from a log on the system 16 workers * 1 thread each: .094-.099 8 workers * 2 threads each: .035-.045 4 workers * 4 threads each: .020-.022 I should also mention that I have 56Gig of ram. Perhaps I should be running p-1 instead on some of the cores (?)
YES P-1

It looks like 8x2 has the highest possible efficiency.

I would still be curious how 16 threads do if you set their affinities individually though.

 2013-02-12, 01:14 #7 tcharron   Jan 2013 22·7 Posts BLAH. Previous timings are all wrong - there were inconsistent exponents there and I need to redo my tests. Some were 30million range, and others 60million range. Will follow up with more info. Regarding the idea to force affinity, I'd try that but I don't know which cores are sharing cache or physical cpu. Not sure how to figure that out except lots of trial and error.
 2013-02-12, 01:18 #8 Dubslow Basketry That Evening!     "Bunslow the Bold" Jun 2011 40
2013-02-12, 01:33   #9
Prime95
P90 years forever!

Aug 2002
Yeehaw, FL

22×1,873 Posts

Quote:
 Originally Posted by tcharron I should also mention that I have 56Gig of ram. Perhaps I should be running p-1 instead on some of the cores (?)
First, do whatever interests you the most. Yes, GIMPS can use P-1 help, but LL and DC are also very valuable.

P-1 does not multithread quite as well as LL tests.

2013-02-12, 04:39   #10
Mr. P-1

Jun 2003

100100100012 Posts

Quote:
 Originally Posted by Prime95 First, do whatever interests you the most.
The Golden Rule of GIMPS, IMO.

What interests some of us the most, however, is maximising the effect our contributions have toward the project throughput. If tcharron has the same interest, then P-1 is the way to go.

Quote:
 P-1 does not multithread quite as well as LL tests.
multithread is less efficient all round.

Last fiddled with by Mr. P-1 on 2013-02-12 at 04:41

2013-02-12, 05:36   #11
Prime95
P90 years forever!

Aug 2002
Yeehaw, FL

1D4416 Posts

Quote:
 Originally Posted by Mr. P-1 If tcharron has the same interest, then P-1 is the way to go.
Not necessarily.

Quote:
 multithread is less efficient all round.
LL and DC only do muls. Thus, all LL and DC operations are multi-threaded.

 Similar Threads Thread Thread Starter Forum Replies Last Post bcp19 GPU Computing 4 2011-10-15 14:09 RussJones Information & Answers 4 2011-01-04 14:08 smoky Mlucas 14 2009-05-05 15:40 SlashDude Hardware 30 2009-01-30 22:22 sz0wxc PrimeNet 3 2004-06-01 16:49

All times are UTC. The time now is 09:13.

Mon May 17 09:13:16 UTC 2021 up 39 days, 3:54, 0 users, load averages: 1.30, 1.50, 1.51