![]() |
|
|
#56 |
|
Jun 2003
The Texas Hill Country
32×112 Posts |
I suspect that the scaling becomes limited by the memory architecture. At some point, a saturation effect will cause things to slow down (like the "thrashing" that occurs when you overload a VM system. Failure of one part of the system to "keep up" causes other caches to purge prematurely, etc.)
|
|
|
|
|
#57 |
|
"Serge"
Mar 2008
Phi(4,2^7658614+1)/2
947710 Posts |
Maybe this dependency is not a sqrt, but a c+log2 ?
(yes, yes, I am a pessimist) This pattern reminds me of the DSM organisation of the SGI servers which was a hypercube - and as a consequence memory access time from a CPU to his sibling's memory was proportional to the number of edges from node to node (which is <=4 for 16 units for example). The Opterons are probably reusing the (now extinct) SGI's DSM on a new coil of evolution. Last fiddled with by Batalov on 2008-06-19 at 02:28 |
|
|
|
|
#58 |
|
Tribal Bullet
Oct 2004
3,541 Posts |
So it looks like we're achieving the scaling that experience would predict if everything was properly tuned. What I wonder now, is how to analyze whether there's additional room for higher performance or maybe better scaling. Greg, could you run 'numastat' while the LA is in progress and see how many non-local page allocations we're getting? Another possibility is to run many more threads than necessary and reduce the amount of work given to any one thread, hoping to get more cache reuse and less idle time out of threads on a single CPU. I should also spend some time playing with oprofile on my machine (or getting one of you to install a kernel module), because otherwise we'll never get hard numbers about where machine cycles are going.
|
|
|
|
|
#59 | |
|
Bamboozled!
"πΊππ·π·π"
May 2003
Down not across
10,753 Posts |
Quote:
Actually, my first experiments were to set up a 32-cpu, 16 system cluster, to run 16, 36 and 64 processes (giving a square grid in each case) and to use 30 cpus on 15 systems with a 5x6 grid. The aggregate speed was 30 > 36 > 16 > 64. Putting the 16 processes on to 8 systems was faster than splitting them over 16, as might be expected. Howver, your implementation is quite different so it's an experiment worth trying to see whether it does lead to an improvement. Paul |
|
|
|
|
|
#60 | |
|
Nov 2003
22·5·373 Posts |
Quote:
better use of the processors to simply accept a longer ELAPSED time to do the LA. Run on 1 processor only and use the others to start sieving something else. Total throughput will improve. |
|
|
|
|
|
#61 | ||
|
Jul 2003
So Cal
2·34·13 Posts |
Quote:
Code:
node0 node1 node2 node3 node4 node5 node6 node7 numa_hit 274998776 106792182 113193412 267744560 967142980 973378485 971348098 1002440356 numa_miss 16635435 606023602 511580197 298009012 91617054 29327083 24241671 152533229 numa_foreign 1392825242 39423820 28824102 18038450 53407859 64046284 91550068 41851458 interleave_hit 11255 17473 13797 16425 20329 17243 18579 17571 local_node 274830103 106421928 112944257 267412361 966764545 973002107 971009627 1001980370 other_node 16804108 606393856 511829352 298341211 91995489 29703461 24580142 152993215 Quote:
Greg |
||
|
|
|
|
#62 | |
|
Bamboozled!
"πΊππ·π·π"
May 2003
Down not across
10,753 Posts |
Quote:
Paul |
|
|
|
|
|
#63 | |
|
Jul 2003
So Cal
83A16 Posts |
Quote:
Code:
oprofile: available events for CPU type "AMD64 family10" CPU_CLK_UNHALTED: (counter: all) Cycles outside of halt state (min count: 3000) DISPATCHED_FPU_OPS: (counter: all) Dispatched FPU ops (min count: 500) Unit masks (default 0x3f) ---------- 0x01: Add pipe ops excluding load ops and SSE move ops 0x02: Multiply pipe ops excluding load ops and SSE move ops 0x04: Store pipe ops excluding load ops and SSE move ops 0x08: Add pipe load ops and SSE move ops 0x10: Multiply pipe load ops and SSE move ops 0x20: Store pipe load ops and SSE move ops 0x3f: all ops CYCLES_FPU_EMPTY: (counter: all) The number of cycles in which the PFU is empty (min count: 500) DISPATCHED_FPU_OPS_FAST_FLAG: (counter: all) The number of FPU operations that use the fast flag interface (min count: 500) RETIRED_SSE_OPS: (counter: all) The number of SSE ops or uops retired (min count: 500) Unit masks (default 0x7f) ---------- 0x01: Single Precision add/subtract ops 0x02: Single precision multiply ops 0x04: Single precision divide/square root ops 0x08: Double precision add/subtract ops 0x10: Double precision multiply ops 0x20: Double precision divide/square root ops 0x40: OP type, 0=uops 1=FLOPS RETIRED_MOVE_OPS: (counter: all) The number of move uops retired (min count: 500) Unit masks (default 0xf) ---------- 0x01: Merging low quadword move uops 0x02: Merging high quadword move uops 0x04: All other merging move uops 0x08: All other move uops RETIRED_SERIALIZING_OPS: (counter: all) The number of serializing uops retired. (min count: 500) Unit masks (default 0xf) ---------- 0x01: SSE bottom-executing uops retired 0x02: SSE bottom-serializing uops retired 0x04: x87 bottom-executing uops retired 0x08: x87 bottom-serializing uops retired SERIAL_UOPS_IN_FP_SCHED: (counter: all) Number of cycles a serializing uop is in the FP scheduler (min count: 500) Unit masks (default 0x3) ---------- 0x01: Number of cycles a bottom-execute uops in FP scheduler 0x02: Number of cycles a bottom-serializing uops in FP scheduler SEGMENT_REGISTER_LOADS: (counter: all) Segment register loads (min count: 500) Unit masks (default 0x7f) ---------- 0x01: ES register 0x02: CS register 0x04: SS register 0x08: DS register 0x10: FS register 0x20: GS register 0x40: HS register PIPELINE_RESTART_DUE_TO_SELF_MODIFYING_CODE: (counter: all) Micro-architectural re-sync caused by self modifying code (min count: 500) PIPELINE_RESTART_DUE_TO_PROBE_HIT: (counter: all) Micro-architectural re-sync caused by snoop (min count: 500) LS_BUFFER_2_FULL_CYCLES: (counter: all) Cycles LS Buffer 2 Full (min count: 500) LOCKED_OPS: (counter: all) Locked operations (min count: 500) Unit masks (default 0xf) ---------- 0x01: Number of locked instructions executed 0x02: Cycles in speculative phase 0x04: Cycles in non-speculative phase (including cache miss penalty) 0x08: Cache miss penalty in cycles RETIRED_CLFLUSH: (counter: all) Retired CLFLUSH instructions (min count: 500) RETIRED_CPUID: (counter: all) Retired CPUID instructions (min count: 500) CANCELLED_STORE_TO_LOAD: (counter: all) Counts the number of cancelled store to load forward operations (min count: 500) Unit masks (default 0x7) ---------- 0x01: Address mismatches (starting byte not the same) 0x02: Store is smaller than load 0x04: Misaligned SMIS_RECEIVED: (counter: all) Counts the number of SMI received (min count: 500) DATA_CACHE_ACCESSES: (counter: all) Data cache accesses (min count: 500) DATA_CACHE_MISSES: (counter: all) Data cache misses (min count: 500) DATA_CACHE_REFILLS_FROM_L2_OR_NORTHBRIDGE: (counter: all) Data cache refills from L2 or northbridge (min count: 500) Unit masks (default 0x1e) ---------- 0x01: Refill from northbridge 0x02: Shared-state line from L2 0x04: Exclusive-state line from L2 0x08: Owner-state line from L2 0x10: Modified-state line from L2 0x1e: All cache states except refill from northbridge DATA_CACHE_REFILLS_FROM_NORTHBRIDGE: (counter: all) Data cache refills from northbridge (min count: 500) Unit masks (default 0x1f) ---------- 0x10: (M)odified cache state 0x08: (O)wner cache state 0x04: (E)xclusive cache state 0x02: (S)hared cache state 0x01: (I)nvalid cache state 0x1f: All cache states DATA_CACHE_LINES_EVICTED: (counter: all) Data cache lines evicted (min count: 500) Unit masks (default 0x1f) ---------- 0x01: (I)nvalid cache state 0x02: (S)hared cache state 0x04: (E)xclusive cache state 0x08: (O)wner cache state 0x10: (M)odified cache state 0x20: Cache line evict brought by PrefetchNTA 0x40: Cache line evict not brought by PrefetchNTA 0x1f: All cache states except PrefetchNTA L1_DTLB_MISS_AND_L2_DTLB_HIT: (counter: all) L1 DTLB misses and L2 DTLB hits (min count: 500) Unit masks (default 0x3) ---------- 0x01: L2 4K TLB hit 0x02: L2 2M TLB hit L1_DTLB_AND_L2_DTLB_MISS: (counter: all) L1 and L2 DTLB misses (min count: 500) Unit masks (default 0x7) ---------- 0x01: 4K TLB reload 0x02: 2M TLB reload 0x04: 1G TLB reload MISALIGNED_ACCESSES: (counter: all) Misaligned Accesses (min count: 500) MICRO_ARCH_LATE_CANCEL_ACCESS: (counter: all) Microarchitectural late cancel of an access (min count: 500) MICRO_ARCH_EARLY_CANCEL_ACCESS: (counter: all) Microarchitectural early cancel of an access (min count: 500) 1_BIT_ECC_ERRORS: (counter: all) Single-bit ECC errors recorded by scrubber (min count: 500) Unit masks (default 0xf) ---------- 0x01: Scrubber error 0x02: Piggyback scrubber errors 0x04: Load pipe error 0x08: Store write pip error PREFETCH_INSTRUCTIONS_DISPATCHED: (counter: all) The number of prefetch instructions dispatched by the decoder (min count: 500) Unit masks (default 0x7) ---------- 0x01: Load (Prefetch, PrefetchT0/T1/T2) 0x02: Store (PrefetchW) 0x04: NTA (PrefetchNTA) LOCKED_INSTRUCTIONS_DCACHE_MISSES: (counter: all) The number of dta cache misses by locked instructions. (min count: 500) Unit masks (default 0x2) ---------- 0x02: Data cache misses by locked instructions L1_DTLB_HIT: (counter: all) L1 DTLB hit (min count: 500) Unit masks (default 0x7) ---------- 0x01: L1 4K TLB hit 0x02: L1 2M TLB hit 0x04: L1 1G TLB hit INEFFECTIVE_SW_PREFETCHES: (counter: all) Number of software prefetches that did not fetch data outside of processor core (min count: 500) Unit masks (default 0x9) ---------- 0x01: Hit in L1 0x08: Hit in L2 GLOBAL_TLB_FLUSHES: (counter: all) The number of global TLB flushes (min count: 500) MEMORY_REQUESTS: (counter: all) Memory Requests by Type (min count: 500) Unit masks (default 0x83) ---------- 0x01: Requests to non-cacheable (UC) memory 0x02: Requests to write-combining (WC) memory or WC buffer flushes to WB memory 0x80: Streaming store (SS) requests DATA_PREFETCHES: (counter: all) Data Prefetcher (min count: 500) Unit masks (default 0x3) ---------- 0x01: Cancelled prefetches 0x02: Prefetch attempts NORTHBRIDGE_READ_RESPONSES: (counter: all) Northbridge Read Responses by Coherency State (min count: 500) Unit masks (default 0x17) ---------- 0x01: Exclusive 0x02: Modified 0x04: Shared 0x10: Data Error OCTWORD_WRITE_TRANSFERS: (counter: all) Octwords Written to System (min count: 500) Unit masks (default 0x1) ---------- 0x01: Quadword write transfer REQUESTS_TO_L2: (counter: all) Requests to L2 Cache (min count: 500) Unit masks (default 0x3f) ---------- 0x01: IC fill 0x02: DC fill 0x04: TLB fill (page table walks) 0x08: Tag snoop request 0x10: Canceled request 0x20: Hardware prefetch from data cache L2_CACHE_MISS: (counter: all) L2 Cache Misses (min count: 500) Unit masks (default 0xf) ---------- 0x01: IC fill 0x02: DC fill (includes possible replays) 0x04: TLB page table walk 0x08: Hardwareprefetch from data cache L2_CACHE_FILL_WRITEBACK: (counter: all) L2 Fill/Writeback (min count: 500) Unit masks (default 0x3) ---------- 0x01: L2 fills (victims from L1 caches, TLB page table walks and data prefetches) 0x02: L2 Writebacks to system INSTRUCTION_CACHE_FETCHES: (counter: all) Instruction cache fetches (RevE) (min count: 500) INSTRUCTION_CACHE_MISSES: (counter: all) Instruction cache misses (min count: 500) INSTRUCTION_CACHE_REFILLS_FROM_L2: (counter: all) Instruction Cache Refills from L2 (min count: 500) INSTRUCTION_CACHE_REFILLS_FROM_SYSTEM: (counter: all) Instruction Cache Refills from System (min count: 500) L1_ITLB_MISS_AND_L2_ITLB_HIT: (counter: all) L1 ITLB misses (and L2 ITLB hits) (min count: 500) L1_ITLB_MISS_AND_L2_ITLB_MISS: (counter: all) L1 ITLB Miss, L2 ITLB Miss (min count: 500) Unit masks (default 0x3) ---------- 0x01: Instruction fetches to 4K pages 0x02: Instruction fetches to 2M pages PIPELINE_RESTART_DUE_TO_INSTRUCTION_STREAM_PROBE: (counter: all) Pipeline Restart Due to Instruction Stream Probe (min count: 500) INSTRUCTION_FETCH_STALL: (counter: all) Instruction fetch stall (min count: 500) RETURN_STACK_HITS: (counter: all) Return stack hit (min count: 500) RETURN_STACK_OVERFLOWS: (counter: all) Return stack overflow (min count: 500) INSTRUCTION_CACHE_VICTIMS: (counter: all) Number of instruction cachelines evicticed to L2 (min count: 500) INSTRUCTION_CHCHE_INVALIDATED: (counter: all) Instruction cache lines invalidated (min count: 500) Unit masks (default 0xf) ---------- 0x01: Invalidating probe that did not hit any in-flight instructions 0x02: Invalidating probe that hit one or more in-flight instructions 0x04: SMC that did not hit any in-flight instructions 0x08: SMC that hit one or more in-flight instructions ITLB_RELOADS: (counter: all) The number of ITLB reloads requests (min count: 500) ITLB_RELOADS_ABORTED: (counter: all) The number of ITLB reloads aborted (min count: 500) RETIRED_INSTRUCTIONS: (counter: all) Retired instructions (includes exceptions, interrupts, re-syncs) (min count: 3000) RETIRED_UOPS: (counter: all) Retired micro-ops (min count: 500) RETIRED_BRANCH_INSTRUCTIONS: (counter: all) Retired branches (conditional, unconditional, exceptions, interrupts) (min count: 500) RETIRED_MISPREDICTED_BRANCH_INSTRUCTIONS: (counter: all) Retired Mispredicted Branch Instructions (min count: 500) RETIRED_TAKEN_BRANCH_INSTRUCTIONS: (counter: all) Retired taken branch instructions (min count: 500) RETIRED_TAKEN_BRANCH_INSTRUCTIONS_MISPREDICTED: (counter: all) Retired taken branches mispredicted (min count: 500) RETIRED_FAR_CONTROL_TRANSFERS: (counter: all) Retired far control transfers (min count: 500) RETIRED_BRANCH_RESYNCS: (counter: all) Retired branches resyncs (only non-control transfer branches) (min count: 500) RETIRED_NEAR_RETURNS: (counter: all) Retired near returns (min count: 500) RETIRED_NEAR_RETURNS_MISPREDICTED: (counter: all) Retired near returns mispredicted (min count: 500) RETIRED_INDIRECT_BRANCHES_MISPREDICTED: (counter: all) Retired Indirect Branches Mispredicted (min count: 500) RETIRED_MMX_FP_INSTRUCTIONS: (counter: all) Retired MMX/FP instructions (min count: 500) Unit masks (default 0x7) ---------- 0x01: x87 instructions 0x02: MMX & 3DNow instructions 0x04: SSE & SSE2 instructions RETIRED_FASTPATH_DOUBLE_OP_INSTRUCTIONS: (counter: all) Retired FastPath double-op instructions (min count: 500) Unit masks (default 0x7) ---------- 0x01: With low op in position 0 0x02: With low op in position 1 0x04: With low op in position 2 INTERRUPTS_MASKED_CYCLES: (counter: all) Cycles with interrupts masked (IF=0) (min count: 500) INTERRUPTS_MASKED_CYCLES_WITH_INTERRUPT_PENDING: (counter: all) Cycles with interrupts masked while interrupt pending (min count: 500) INTERRUPTS_TAKEN: (counter: all) Number of taken hardware interrupts (min count: 10) DECODER_EMPTY: (counter: all) Nothing to dispatch (decoder empty) (min count: 500) DISPATCH_STALLS: (counter: all) Dispatch stalls (min count: 500) DISPATCH_STALL_FOR_BRANCH_ABORT: (counter: all) Dispatch stall from branch abort to retire (min count: 500) DISPATCH_STALL_FOR_SERIALIZATION: (counter: all) Dispatch stall for serialization (min count: 500) DISPATCH_STALL_FOR_SEGMENT_LOAD: (counter: all) Dispatch stall for segment load (min count: 500) DISPATCH_STALL_FOR_REORDER_BUFFER_FULL: (counter: all) Dispatch stall for reorder buffer full (min count: 500) DISPATCH_STALL_FOR_RESERVATION_STATION_FULL: (counter: all) Dispatch stall when reservation stations are full (min count: 500) DISPATCH_STALL_FOR_FPU_FULL: (counter: all) Dispatch stall when FPU is full (min count: 500) DISPATCH_STALL_FOR_LS_FULL: (counter: all) Dispatch stall when LS is full (min count: 500) DISPATCH_STALL_WAITING_FOR_ALL_QUIET: (counter: all) Dispatch stall when waiting for all to be quiet (min count: 500) DISPATCH_STALL_FOR_FAR_TRANSFER_OR_RESYNC: (counter: all) Dispatch Stall for Far Transfer or Resync to Retire (min count: 500) FPU_EXCEPTIONS: (counter: all) FPU exceptions (min count: 1) Unit masks (default 0xf) ---------- 0x01: x87 reclass microfaults 0x02: SSE retype microfaults 0x04: SSE reclass microfaults 0x08: SSE and x87 microtraps DR0_BREAKPOINTS: (counter: all) The number of matches on the address in breakpoint register DR0 (min count: 1) DR1_BREAKPOINTS: (counter: all) The number of matches on the address in breakpoint register DR1 (min count: 1) DR2_BREAKPOINTS: (counter: all) The number of matches on the address in breakpoint register DR2 (min count: 1) DR3_BREAKPOINTS: (counter: all) The number of matches on the address in breakpoint register DR3 (min count: 1) DRAM_ACCESSES: (counter: all) DRAM Accesses (min count: 500) Unit masks (default 0xff) ---------- 0x01: DCT0 Page hit 0x02: DCT0 Page miss 0x04: DCT0 Page conflict 0x08: DCT1 Page hit 0x10: DCT1 Page miss 0x20: DCT1 Page Conflict 0x40: Write request 0x80: Read request MEMORY_CONTROLLER_PAGE_TABLE_OVERFLOWS: (counter: all) Memory controller page table overflows (min count: 500) Unit masks (default 0x3) ---------- 0x01: DCT0 Page Table Overflow 0x02: DCT1 Page Table Overflow MEMORY_CONTROLLER_SLOT_MISSED: (counter: all) Memory controller DRAM command slots missed (min count: 500) Unit masks (default 0x3) ---------- 0x01: DCT0 Command slots missed 0x02: DCT2 Command slots missed MEMORY_CONTROLLER_TURNAROUNDS: (counter: all) Memory controller turnarounds (min count: 500) Unit masks (default 0x3f) ---------- 0x01: DCT0 DIMM (chip select) turnaround 0x02: DCT0 Read to write turnaround 0x04: DCT0 Write to read turnaround 0x08: DCT1 DIMM (chip select) turnaround 0x10: DCT1 Read to write turnaround 0x20: DCT1 Write to read turnaround MEMORY_CONTROLLER_BYPASS_COUNTER_SATURATION: (counter: all) Memory controller bypass saturation (min count: 500) Unit masks (default 0xf) ---------- 0x01: Memory controller high priority bypass 0x02: Memory controller medium priority bypass 0x04: DCT0 DCQ bypass 0x08: DCT1 DCQ bypass THERMAL_STATUS: (counter: all) Thermal status (min count: 500) Unit masks (default 0x7c) ---------- 0x04: Number of times the HTC trip point is crossed 0x08: Number of clocks when STC trip point active 0x10: Number of times the STC trip point is crossed 0x20: Number of clocks HTC P-state is inactive 0x40: Number of clocks HTC P-state is active CPU_IO_REQUESTS_TO_MEMORY_IO: (counter: all) CPU/IO Requests to Memory/IO (RevE) (min count: 500) Unit masks (default 0x8) ---------- 0x01: IO to IO 0x04: IO to Mem 0x08: CPU to IO 0x10: To remote node 0x20: To local node 0x40: From remote node 0x80: From local node CACHE_BLOCK_COMMANDS: (counter: all) Cache Block Commands (RevE) (min count: 500) Unit masks (default 0x3d) ---------- 0x01: Victim Block (Writeback) 0x04: Read Block (Dcache load miss refill) 0x08: Read Block Shared (Icache refill) 0x10: Read Block Modified (Dcache store miss refill) 0x20: Change to Dirty (first store to clean block already in cache) SIZED_COMMANDS: (counter: all) Sized Commands (min count: 500) Unit masks (default 0x3f) ---------- 0x01: non-posted write byte (1-32 bytes) 0x02: non-posted write dword (1-16 dwords) 0x04: posted write byte (1-32 bytes) 0x08: posted write dword (1-16 dwords) 0x10: read byte (4 bytes) 0x20: read dword (1-16 dwords) PROBE_RESPONSES_AND_UPSTREAM_REQUESTS: (counter: all) Probe Responses and Upstream Requests (min count: 500) Unit masks (default 0xff) ---------- 0x01: Probe miss 0x02: Probe hit clean 0x04: Probe hit dirty without memory cancel 0x08: Probe hit dirty with memory cancel 0x10: Upstream display refresh/ISOC reads 0x20: Upstream non-display refresh reads 0x40: Upstream ISOC writes 0x80: Upstream non-ISOC writes GART_EVENTS: (counter: all) GART Events (min count: 500) Unit masks (default 0xff) ---------- 0x01: GART aperture hit on access from CPU 0x02: GART aperture hit on access from I/O 0x04: GART miss 0x08: GART/DEV Request hit table walk in progress 0x10: DEV hit 0x20: DEV miss 0x40: DEV error 0x80: GART/DEV multiple table walk in progress MEMORY_CONTROLLER_REQUESTS: (counter: all) Sized Read/Write activity. (min count: 500) Unit masks (default 0x78) ---------- 0x01: Write requests 0x02: Read Requests including Prefetch 0x04: Prefetch Request 0x08: 32 Bytes Sized Writes 0x10: 64 Bytes Sized Writes 0x20: 32 Bytes Sized Reads 0x40: 64 Byte Sized Reads 0x80: Read Requests while writes pending in DCQ CPU_DRAM_REQUEST_TO_NODE: (counter: all) CPU to DRAM requests to target node (min count: 500) Unit masks (default 0xff) ---------- 0x01: From local node to node 0 0x02: From local node to node 1 0x04: From local node to node 2 0x08: From local node to node 3 0x10: From local node to node 4 0x20: From local node to node 5 0x40: From local node to node 6 0x80: From local node to node 7 IO_DRAM_REQUEST_TO_NODE: (counter: all) IO to DRAM requests to target node (min count: 500) Unit masks (default 0xff) ---------- 0x01: From local node to node 0 0x02: From local node to node 1 0x04: From local node to node 2 0x08: From local node to node 3 0x10: From local node to node 4 0x20: From local node to node 5 0x40: From local node to node 6 0x80: From local node to node 7 CPU_READ_COMMAND_LATENCY_NODE_0_3: (counter: all) Latency between the local node and remote node (min count: 500) Unit masks (default 0xff) ---------- 0x01: Read block 0x02: Read block shared 0x04: Read block modified 0x08: Change to dirty 0x10: From local node to node 0 0x20: From local node to node 1 0x40: From local node to node 2 0x80: From local node to node 3 CPU_READ_COMMAND_REQUEST_NODE_0_3: (counter: all) Number of requests that a latency measurment is made for Event 0x1E2 (min count: 500) Unit masks (default 0xff) ---------- 0x01: Read block 0x02: Read block shared 0x04: Read block modified 0x08: Change to dirty 0x10: From local node to node 0 0x20: From local node to node 1 0x40: From local node to node 2 0x80: From local node to node 3 CPU_READ_COMMAND_LATENCY_NODE_4_7: (counter: all) Latency between the local node and remote node (min count: 500) Unit masks (default 0xff) ---------- 0x01: Read block 0x02: Read block shared 0x04: Read block modified 0x08: Change to dirty 0x10: From local node to node 4 0x20: From local node to node 5 0x40: From local node to node 6 0x80: From local node to node 7 CPU_READ_COMMAND_REQUEST_NODE_4_7: (counter: all) Number of requests that a latency measurment is made for Event 0x1E2 (min count: 500) Unit masks (default 0xff) ---------- 0x01: Read block 0x02: Read block shared 0x04: Read block modified 0x08: Change to dirty 0x10: From local node to node 4 0x20: From local node to node 5 0x40: From local node to node 6 0x80: From local node to node 7 CPU_COMMAND_LATENCY_TARGET: (counter: all) Determine latency between the local node and a remote node. (min count: 500) Unit masks (default 0xf7) ---------- 0x01: Read sized 0x02: Write sized 0x04: Victim block 0x08: Node group select. 0=Nodes 0-3. 1=Nodes 4-7 0x10: From local node to node 0/4 0x20: From local node to node 1/5 0x40: From local node to node 2/6 0x80: From local node to node 3/7 CPU_REQUEST_TARGET: (counter: all) Number of requests that a latency measurement is made for Event 0x1E6 (min count: 500) Unit masks (default 0xf7) ---------- 0x01: Read sized 0x02: Write sized 0x04: Victim block 0x08: Node group select. 0=Nodes 0-3. 1=Nodes 4-7 0x10: From local node to node 0/4 0x20: From local node to node 1/5 0x40: From local node to node 2/6 0x80: From local node to node 3/7 HYPERTRANSPORT_LINK0_TRANSMIT_BANDWIDTH: (counter: all) HyperTransport(tm) link 0 transmit bandwidth (min count: 500) Unit masks (default 0xbf) ---------- 0x01: Command DWORD sent 0x02: DWORD sent 0x04: Buffer release DWORD sent 0x08: Nop DW sent (idle) 0x10: Address extension DWORD sent 0x20: Per packet CRC sent 0x80: SubLink Mask HYPERTRANSPORT_LINK1_TRANSMIT_BANDWIDTH: (counter: all) HyperTransport(tm) link 1 transmit bandwidth (min count: 500) Unit masks (default 0xbf) ---------- 0x01: Command DWORD sent 0x02: DWORD sent 0x04: Buffer release DWORD sent 0x08: Nop DW sent (idle) 0x10: Address extension DWORD sent 0x20: Per packet CRC sent 0x80: SubLink Mask HYPERTRANSPORT_LINK2_TRANSMIT_BANDWIDTH: (counter: all) HyperTransport(tm) link 2 transmit bandwidth (min count: 500) Unit masks (default 0xbf) ---------- 0x01: Command DWORD sent 0x02: DWORD sent 0x04: Buffer release DWORD sent 0x08: Nop DW sent (idle) 0x10: Address extension DWORD sent 0x20: Per packet CRC sent 0x80: SubLink Mask HYPERTRANSPORT_LINK3_TRANSMIT_BANDWIDTH: (counter: all) HyperTransport(tm) link 3 transmit bandwidth (min count: 500) Unit masks (default 0xbf) ---------- 0x01: Command DWORD sent 0x02: DWORD sent 0x04: Buffer release DWORD sent 0x08: Nop DW sent (idle) 0x10: Address extension DWORD sent 0x20: Per packet CRC sent 0x80: SubLink Mask READ_REQUEST_L3_CACHE: (counter: all) Tracks the red requests from each core to L3 cache (min count: 500) Unit masks (default 0xf7) ---------- 0x01: Read Block Exclusive (Data cache read) 0x02: Read Block Shared (Instruciton cache read) 0x04: Read Block Modify 0x10: Core 0 Select 0x20: Core 1 Select 0x40: Core 2 Select 0x80: Core 3 Select L3_CACHE_MISSES: (counter: all) Tracks the L3 cache misses from each core (min count: 500) Unit masks (default 0xf7) ---------- 0x01: Read Block Exclusive (Data cache read) 0x02: Read Block Shared (Instruciton cache read) 0x04: Read Block Modify 0x10: Core 0 Select 0x20: Core 1 Select 0x40: Core 2 Select 0x80: Core 3 Select L3_FILLS_CAUSED_BY_L2_EVICTIONS: (counter: all) Tracks the L3 fills caused by L2 evictions per core (min count: 500) Unit masks (default 0xff) ---------- 0x01: Shared 0x02: Exclusive 0x04: Owned 0x08: Modified 0x10: Core 0 Select 0x20: Core 1 Select 0x40: Core 2 Select 0x80: Core 3 Select L3_EVICTIONS: (counter: all) Tracks the state of the L3 line when it was evicted (min count: 500) Unit masks (default 0xf) ---------- 0x01: Shared 0x02: Exclusive 0x04: Owned 0x08: Modified |
|
|
|
|
|
#64 | |
|
Jul 2003
So Cal
2×34×13 Posts |
Quote:
Greg |
|
|
|
|
|
#65 | |
|
Tribal Bullet
Oct 2004
3,541 Posts |
Quote:
Regarding the NUMA stats, it looks like there's been a lot of mixing things due to previous jobs. Maybe the counters can be zeroed first... Last fiddled with by jasonp on 2008-06-20 at 15:30 |
|
|
|
|
|
#66 |
|
Jul 2003
So Cal
2·34·13 Posts |
Googling didn't reveal a way to do that, other than rebooting the computer. Perhaps looking at the differences from yesterday? Here are the stats from today. Since yesterday, the large run on nodes 0 and 1 simply continued. I restarted the run on nodes 2 and 3 to use 8 threads instead of 7. ECM and PFGW in Wine have been running on nodes 4-7.
Code:
node0 node1 node2 node3 node4 node5 node6 node7 numa_hit 276846730 110012018 117206662 269579970 1138760079 1140979275 1121827928 1170021262 numa_miss 16635435 606023602 511580197 298322163 93107362 29327083 24241671 152533229 numa_foreign 1392825242 39423820 29192402 19473609 53407859 64046284 91550068 41851458 interleave_hit 12216 18867 15296 16494 22148 18429 19921 18818 local_node 276651789 109620572 116936346 269238614 1138359298 1140585367 1121470224 1169540179 other_node 16830376 606415048 511850513 298663519 93508143 29720991 24599375 153014312 Code:
node0 node1 node2 node3 node4 node5 node6 node7 numa_hit 1847954 3219836 4013250 1835410 171617099 167600790 150479830 167580906 numa_miss 0 0 0 313151 1490308 0 0 0 numa_foreign 0 0 368300 1435159 0 0 0 0 interleave_hit 961 1394 1499 69 1819 1186 1342 1247 local_node 1821686 3198644 3992089 1826253 171594753 167583260 150460597 167559809 other_node 26268 21192 21161 322308 1512654 17530 19233 21097 |
|
|
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Considering current hardware on the status page | petrw1 | PrimeNet | 20 | 2007-05-24 18:10 |
| Current status | fivemack | NFSNET Discussion | 90 | 2006-11-13 13:37 |
| Current Status | moo | LMH > 100M | 0 | 2006-09-02 01:15 |
| Current status "fishing" | HiddenWarrior | Operation Billion Digits | 1 | 2005-08-19 21:42 |
| Current Status of the Cunningham Tables | rogue | Cunningham Tables | 4 | 2005-06-10 18:28 |