![]() |
|
|
#1 |
|
Jan 2013
22·7 Posts |
I have a Proliant DL580 G5 with four quad core xeon X7350s in it (16 cores total).
I am running 64 bit ubuntu. When I run mprime with 16 tasks, I get poor performance. I think that this is because each of the 4 chips has no level 3 cache, and instead has 2 4M level 2 caches (with each of these 4M caches shared by two cores). While the chip is not hyperthreaded, running 8 workers with 2 threads each gets better performance (about twice as good, if not better). Running 4 workers with 4 threads gets another improvement of approximately double. /proc/cpuinfo has 16 very similar copies to this: Code:
processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Xeon(R) CPU X7350 @ 2.93GHz stepping : 11 microcode : 0xb3 cpu MHz : 2933.353 cache size : 4096 KB physical id : 0 siblings : 4 core id : 0 cpu cores : 4 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant _tsc arch_perfmon pebs bts rep_good nopl aperfmperf pni dtes64 monitor ds_cpl vm x est tm2 ssse3 cx16 xtpr pdcm dca lahf_lm dtherm tpr_shadow vnmi flexpriority bogomips : 5866.70 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual Code:
eax in eax ebx ecx edx 00000000 0000000a 756e6547 6c65746e 49656e69 00000001 000006fb 10040800 0004e3bd bfebfbff 00000002 05b0b101 005657f0 00000000 2cb43049 00000003 00000000 00000000 00000000 00000000 00000004 0c000122 01c0003f 0000003f 00000001 00000005 00000040 00000040 00000003 00002220 00000006 00000001 00000002 00000001 00000000 00000007 00000000 00000000 00000000 00000000 00000008 00000400 00000000 00000000 00000000 00000009 00000000 00000000 00000000 00000000 0000000a 07280202 00000000 00000000 00000503 80000000 80000008 00000000 00000000 00000000 80000001 00000000 00000000 00000001 20100800 80000002 65746e49 2952286c 6f655820 2952286e 80000003 55504320 20202020 20202020 58202020 80000004 30353337 20402020 33392e32 007a4847 80000005 00000000 00000000 00000000 00000000 80000006 00000000 00000000 10008040 00000000 80000007 00000000 00000000 00000000 00000000 80000008 00003028 00000000 00000000 00000000 Vendor ID: "GenuineIntel"; CPUID level 10 Intel-specific functions: Version 000006fb: Type 0 - Original OEM Family 6 - Pentium Pro Model 15 - Extended model 0 Stepping 11 Reserved 0 Extended brand string: "Intel(R) Xeon(R) CPU X7350 @ 2.93GHz" CLFLUSH instruction cache line size: 8 Initial APIC ID: 16 Hyper threading siblings: 4 Feature flags bfebfbff: FPU Floating Point Unit VME Virtual 8086 Mode Enhancements DE Debugging Extensions PSE Page Size Extensions TSC Time Stamp Counter MSR Model Specific Registers PAE Physical Address Extension MCE Machine Check Exception CX8 COMPXCHG8B Instruction APIC On-chip Advanced Programmable Interrupt Controller present and enabled SEP Fast System Call MTRR Memory Type Range Registers PGE PTE Global Flag MCA Machine Check Architecture CMOV Conditional Move and Compare Instructions FGPAT Page Attribute Table PSE-36 36-bit Page Size Extension CLFSH CFLUSH instruction DS Debug store ACPI Thermal Monitor and Clock Ctrl MMX MMX instruction set FXSR Fast FP/MMX Streaming SIMD Extensions save/restore SSE Streaming SIMD Extensions instruction set SSE2 SSE2 extensions SS Self Snoop HT Hyper Threading TM Thermal monitor 31 reserved TLB and cache info: b1: unknown TLB/cache descriptor b0: unknown TLB/cache descriptor 05: unknown TLB/cache descriptor f0: unknown TLB/cache descriptor 57: unknown TLB/cache descriptor 56: unknown TLB/cache descriptor 49: unknown TLB/cache descriptor 30: unknown TLB/cache descriptor b4: unknown TLB/cache descriptor 2c: unknown TLB/cache descriptor Processor serial: 0000-06FB-0000-0000-0000-0000 When I run mprime with 8 workers, it assigns the workers in pairs (1,2 and 3,4, etc). When I run mprime with 4 workers, it assigns helpers to "any cpu". I don't want to run 16 workers due to the performance hit so how do I go about figuring out the best way to set this up? |
|
|
|
|
|
#2 |
|
Basketry That Evening!
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88
3×29×83 Posts |
You should be able to tell it to run each worker on a specific core. "any cpu" is fancy for leaving it to the OS, which generally is a Bad Idea. Run mprime -m, though unfortunately I can't remember which option is the useful one. One of them will let you set a specific affinity for each worker, so running 16 workers with specific affinities should work great.
|
|
|
|
|
|
#3 |
|
Aug 2002
Termonfeckin, IE
24×173 Posts |
How do you define poor performance? And when you say that 8 workers get double the performance, does that means twice the total throughput or twice the speed.
|
|
|
|
|
|
#4 |
|
Jan 2013
22×7 Posts |
I don't have exact timings, but something like this:
Running 16 workers gives about .095 per iteration (on each of the 16 workers). Running 8 workers with one helper each gives about .047 per iteration (so, a mild improvement in overall throughput) Running 4 workers with 3 helpers each gives about .023 per iteration. In all cases I used "smart" affinity. The 16 workers are each set up by mprime with affinity for a separate core. With 8 workers, mprime assigns workers affinities of 1,3,5,7,9,11,13, and 15, with helper affinities (respectively) on 2,4,6,8,10,12,14 and 16. It specifically says '1 and 2' are a physical cpu etc. With 4 workers it doesn't provide any information about physical cpus, and just seems to leave affinity to the OS. This is what I don't get... why doesn't it even try to assign affinities in this case? It is my guess/assertion that 16 workers are slower than 8 due to too small a L2 cache but that is pure speculation. I also wonder if the 4 threads working on each of the 4 workers in the last case are inefficient due to the threads not being assigned to the same physical core (and hence not sharing any cache). |
|
|
|
|
|
#5 |
|
Jan 2013
111002 Posts |
I checked the timings again from a log on the system
16 workers * 1 thread each: .094-.099 8 workers * 2 threads each: .035-.045 4 workers * 4 threads each: .020-.022 I should also mention that I have 56Gig of ram. Perhaps I should be running p-1 instead on some of the cores (?) Last fiddled with by tcharron on 2013-02-12 at 01:00 |
|
|
|
|
|
#6 | |
|
Basketry That Evening!
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88
3·29·83 Posts |
Quote:
It looks like 8x2 has the highest possible efficiency. I would still be curious how 16 threads do if you set their affinities individually though. |
|
|
|
|
|
|
#7 |
|
Jan 2013
348 Posts |
BLAH. Previous timings are all wrong - there were inconsistent exponents there and I need to redo my tests. Some were 30million range, and others 60million range. Will follow up with more info.
Regarding the idea to force affinity, I'd try that but I don't know which cores are sharing cache or physical cpu. Not sure how to figure that out except lots of trial and error. |
|
|
|
|
|
#8 |
|
Basketry That Evening!
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88
3·29·83 Posts |
I'm not sure that will be an issue if every worker has its own core. Just as long as each of the 16 workers gets a unique affinity.
|
|
|
|
|
|
#9 | |
|
P90 years forever!
Aug 2002
Yeehaw, FL
100000010101112 Posts |
Quote:
P-1 does not multithread quite as well as LL tests. |
|
|
|
|
|
|
#10 | |
|
Jun 2003
22218 Posts |
The Golden Rule of GIMPS, IMO.
What interests some of us the most, however, is maximising the effect our contributions have toward the project throughput. If tcharron has the same interest, then P-1 is the way to go. Quote:
Last fiddled with by Mr. P-1 on 2013-02-12 at 04:41 |
|
|
|
|
|
|
#11 | |
|
P90 years forever!
Aug 2002
Yeehaw, FL
17·487 Posts |
Not necessarily.
Quote:
LL and DC only do muls. Thus, all LL and DC operations are multi-threaded. |
|
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| GPU setup | bcp19 | GPU Computing | 4 | 2011-10-15 14:09 |
| Setup on Mac | RussJones | Information & Answers | 4 | 2011-01-04 14:08 |
| Mlucas on HP-UX/PA-RISC setup question | smoky | Mlucas | 14 | 2009-05-05 15:40 |
| Quad Quad-cores | SlashDude | Hardware | 30 | 2009-01-30 22:22 |
| Linux and proxy setup | sz0wxc | PrimeNet | 3 | 2004-06-01 16:49 |