mersenneforum.org  

Go Back   mersenneforum.org > Other Stuff > Archived Projects > NFSNET Discussion

 
 
Thread Tools
Old 2008-06-19, 02:12   #56
Wacky
 
Wacky's Avatar
 
Jun 2003
The Texas Hill Country

32×112 Posts
Default

I suspect that the scaling becomes limited by the memory architecture. At some point, a saturation effect will cause things to slow down (like the "thrashing" that occurs when you overload a VM system. Failure of one part of the system to "keep up" causes other caches to purge prematurely, etc.)
Wacky is offline  
Old 2008-06-19, 02:18   #57
Batalov
 
Batalov's Avatar
 
"Serge"
Mar 2008
Phi(4,2^7658614+1)/2

36·13 Posts
Default

Maybe this dependency is not a sqrt, but a c+log2 ?
(yes, yes, I am a pessimist)

This pattern reminds me of the DSM organisation of the SGI servers which was a hypercube - and as a consequence memory access time from a CPU to his sibling's memory was proportional to the number of edges from node to node (which is <=4 for 16 units for example). The Opterons are probably reusing the (now extinct) SGI's DSM on a new coil of evolution.

Last fiddled with by Batalov on 2008-06-19 at 02:28
Batalov is offline  
Old 2008-06-19, 15:47   #58
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

3,541 Posts
Default

So it looks like we're achieving the scaling that experience would predict if everything was properly tuned. What I wonder now, is how to analyze whether there's additional room for higher performance or maybe better scaling. Greg, could you run 'numastat' while the LA is in progress and see how many non-local page allocations we're getting? Another possibility is to run many more threads than necessary and reduce the amount of work given to any one thread, hoping to get more cache reuse and less idle time out of threads on a single CPU. I should also spend some time playing with oprofile on my machine (or getting one of you to install a kernel module), because otherwise we'll never get hard numbers about where machine cycles are going.
jasonp is offline  
Old 2008-06-19, 16:21   #59
xilman
Bamboozled!
 
xilman's Avatar
 
"π’‰Ίπ’ŒŒπ’‡·π’†·π’€­"
May 2003
Down not across

10,753 Posts
Default

Quote:
Originally Posted by jasonp View Post
Another possibility is to run many more threads than necessary and reduce the amount of work given to any one thread, hoping to get more cache reuse and less idle time out of threads on a single CPU.
When I tried that on the MPI-version running at MSR, I found it slowed things down, even if all the processes were on the same system and no messages were sent over the network.

Actually, my first experiments were to set up a 32-cpu, 16 system cluster, to run 16, 36 and 64 processes (giving a square grid in each case) and to use 30 cpus on 15 systems with a 5x6 grid. The aggregate speed was 30 > 36 > 16 > 64. Putting the 16 processes on to 8 systems was faster than splitting them over 16, as might be expected.

Howver, your implementation is quite different so it's an experiment worth trying to see whether it does lead to an improvement.

Paul
xilman is offline  
Old 2008-06-19, 17:15   #60
R.D. Silverman
 
R.D. Silverman's Avatar
 
Nov 2003

22·5·373 Posts
Default

Quote:
Originally Posted by jasonp View Post
So it looks like we're achieving the scaling that experience would predict if everything was properly tuned. What I wonder now, is how to analyze whether there's additional room for higher performance or maybe better scaling. Greg, could you run 'numastat' while the LA is in progress and see how many non-local page allocations we're getting? Another possibility is to run many more threads than necessary and reduce the amount of work given to any one thread, hoping to get more cache reuse and less idle time out of threads on a single CPU. I should also spend some time playing with oprofile on my machine (or getting one of you to install a kernel module), because otherwise we'll never get hard numbers about where machine cycles are going.
Since the LA does not scale at anything close to linear, it is a much
better use of the processors to simply accept a longer ELAPSED time
to do the LA. Run on 1 processor only and use the others to start
sieving something else. Total throughput will improve.
R.D. Silverman is offline  
Old 2008-06-19, 17:58   #61
frmky
 
frmky's Avatar
 
Jul 2003
So Cal

2×34×13 Posts
Default

Quote:
Originally Posted by jasonp View Post
Greg, could you run 'numastat' while the LA is in progress and see how many non-local page allocations we're getting?
Here's the output. When this was run, nodes 0 and 1 were running the LA (8 threads) on 2,949+ (15875755 x 15876003 (4555.3 MB) with weight 1114396697 (70.19/col)), nodes 2 and 3 were running the smaller LA (7 threads with one core idle ... I need to restart it soon) on 10,241- (9170126 x 9170374 (2655.8 MB) with weight 662977955 (72.30/col)), and nodes 4-7 were idle. Each node has 8 GB of memory.

Code:
                           node0           node1           node2           node3           node4           node5           node6           node7
numa_hit               274998776       106792182       113193412       267744560       967142980       973378485       971348098      1002440356
numa_miss               16635435       606023602       511580197       298009012        91617054        29327083        24241671       152533229
numa_foreign          1392825242        39423820        28824102        18038450        53407859        64046284        91550068        41851458
interleave_hit             11255           17473           13797           16425           20329           17243           18579           17571
local_node             274830103       106421928       112944257       267412361       966764545       973002107       971009627      1001980370
other_node              16804108       606393856       511829352       298341211        91995489        29703461        24580142       152993215
Quote:
I should also spend some time playing with oprofile on my machine (or getting one of you to install a kernel module), because otherwise we'll never get hard numbers about where machine cycles are going.
I'm the only one using the computer now, so I can help. Just let me know what to do. The computer is running CentOS pretending to be RHEL5.

Greg
frmky is online now  
Old 2008-06-19, 18:04   #62
xilman
Bamboozled!
 
xilman's Avatar
 
"π’‰Ίπ’ŒŒπ’‡·π’†·π’€­"
May 2003
Down not across

1075310 Posts
Default

Quote:
Originally Posted by R.D. Silverman View Post
Since the LA does not scale at anything close to linear, it is a much
better use of the processors to simply accept a longer ELAPSED time
to do the LA. Run on 1 processor only and use the others to start
sieving something else. Total throughput will improve.
True, if you have sufficient memory. If the machine is memory bound, one LA and N-1 sievers will spend so much time thrashing that it's better to get the LA out of the way ASAP.


Paul
xilman is offline  
Old 2008-06-19, 19:07   #63
frmky
 
frmky's Avatar
 
Jul 2003
So Cal

210610 Posts
Default

Quote:
Originally Posted by jasonp View Post
I should also spend some time playing with oprofile on my machine (or getting one of you to install a kernel module), because otherwise we'll never get hard numbers about where machine cycles are going.
Just checked. oprofile seems to be installed and presumably working on this computer. Here's the output of ophelp:
Code:
oprofile: available events for CPU type "AMD64 family10"

CPU_CLK_UNHALTED: (counter: all)
	Cycles outside of halt state (min count: 3000)
DISPATCHED_FPU_OPS: (counter: all)
	Dispatched FPU ops (min count: 500)
	Unit masks (default 0x3f)
	----------
	0x01: Add pipe ops excluding load ops and SSE move ops
	0x02: Multiply pipe ops excluding load ops and SSE move ops
	0x04: Store pipe ops excluding load ops and SSE move ops 
	0x08: Add pipe load ops and SSE move ops
	0x10: Multiply pipe load ops and SSE move ops
	0x20: Store pipe load ops and SSE move ops
	0x3f: all ops
CYCLES_FPU_EMPTY: (counter: all)
	The number of cycles in which the PFU is empty (min count: 500)
DISPATCHED_FPU_OPS_FAST_FLAG: (counter: all)
	The number of FPU operations that use the fast flag interface (min count: 500)
RETIRED_SSE_OPS: (counter: all)
	The number of SSE ops or uops retired (min count: 500)
	Unit masks (default 0x7f)
	----------
	0x01: Single Precision add/subtract ops
	0x02: Single precision multiply ops
	0x04: Single precision divide/square root ops
	0x08: Double precision add/subtract ops
	0x10: Double precision multiply ops
	0x20: Double precision divide/square root ops
	0x40: OP type, 0=uops 1=FLOPS
RETIRED_MOVE_OPS: (counter: all)
	The number of move uops retired (min count: 500)
	Unit masks (default 0xf)
	----------
	0x01: Merging low quadword move uops
	0x02: Merging high quadword move uops
	0x04: All other merging move uops
	0x08: All other move uops
RETIRED_SERIALIZING_OPS: (counter: all)
	The number of serializing uops retired. (min count: 500)
	Unit masks (default 0xf)
	----------
	0x01: SSE bottom-executing uops retired
	0x02: SSE bottom-serializing uops retired
	0x04: x87 bottom-executing uops retired
	0x08: x87 bottom-serializing uops retired
SERIAL_UOPS_IN_FP_SCHED: (counter: all)
	Number of cycles a serializing uop is in the FP scheduler (min count: 500)
	Unit masks (default 0x3)
	----------
	0x01: Number of cycles a bottom-execute uops in FP scheduler
	0x02: Number of cycles a bottom-serializing uops in FP scheduler
SEGMENT_REGISTER_LOADS: (counter: all)
	Segment register loads (min count: 500)
	Unit masks (default 0x7f)
	----------
	0x01: ES register
	0x02: CS register
	0x04: SS register
	0x08: DS register
	0x10: FS register
	0x20: GS register
	0x40: HS register
PIPELINE_RESTART_DUE_TO_SELF_MODIFYING_CODE: (counter: all)
	Micro-architectural re-sync caused by self modifying code (min count: 500)
PIPELINE_RESTART_DUE_TO_PROBE_HIT: (counter: all)
	Micro-architectural re-sync caused by snoop (min count: 500)
LS_BUFFER_2_FULL_CYCLES: (counter: all)
	Cycles LS Buffer 2 Full (min count: 500)
LOCKED_OPS: (counter: all)
	Locked operations (min count: 500)
	Unit masks (default 0xf)
	----------
	0x01: Number of locked instructions executed
	0x02: Cycles in speculative phase
	0x04: Cycles in non-speculative phase (including cache miss penalty)
	0x08: Cache miss penalty in cycles 
RETIRED_CLFLUSH: (counter: all)
	Retired CLFLUSH instructions (min count: 500)
RETIRED_CPUID: (counter: all)
	Retired CPUID instructions (min count: 500)
CANCELLED_STORE_TO_LOAD: (counter: all)
	Counts the number of cancelled store to load forward operations (min count: 500)
	Unit masks (default 0x7)
	----------
	0x01: Address mismatches (starting byte not the same)
	0x02: Store is smaller than load
	0x04: Misaligned
SMIS_RECEIVED: (counter: all)
	Counts the number of SMI received (min count: 500)
DATA_CACHE_ACCESSES: (counter: all)
	Data cache accesses (min count: 500)
DATA_CACHE_MISSES: (counter: all)
	Data cache misses (min count: 500)
DATA_CACHE_REFILLS_FROM_L2_OR_NORTHBRIDGE: (counter: all)
	Data cache refills from L2 or northbridge (min count: 500)
	Unit masks (default 0x1e)
	----------
	0x01: Refill from northbridge
	0x02: Shared-state line from L2
	0x04: Exclusive-state line from L2
	0x08: Owner-state line from L2
	0x10: Modified-state line from L2
	0x1e: All cache states except refill from northbridge
DATA_CACHE_REFILLS_FROM_NORTHBRIDGE: (counter: all)
	Data cache refills from northbridge (min count: 500)
	Unit masks (default 0x1f)
	----------
	0x10: (M)odified cache state
	0x08: (O)wner cache state
	0x04: (E)xclusive cache state
	0x02: (S)hared cache state
	0x01: (I)nvalid cache state
	0x1f: All cache states
DATA_CACHE_LINES_EVICTED: (counter: all)
	Data cache lines evicted (min count: 500)
	Unit masks (default 0x1f)
	----------
	0x01: (I)nvalid cache state
	0x02: (S)hared cache state
	0x04: (E)xclusive cache state
	0x08: (O)wner cache state
	0x10: (M)odified cache state
	0x20: Cache line evict brought by PrefetchNTA
	0x40: Cache line evict not brought by PrefetchNTA
	0x1f: All cache states except PrefetchNTA
L1_DTLB_MISS_AND_L2_DTLB_HIT: (counter: all)
	L1 DTLB misses and L2 DTLB hits (min count: 500)
	Unit masks (default 0x3)
	----------
	0x01: L2 4K TLB hit
	0x02: L2 2M TLB hit
L1_DTLB_AND_L2_DTLB_MISS: (counter: all)
	L1 and L2 DTLB misses (min count: 500)
	Unit masks (default 0x7)
	----------
	0x01: 4K TLB reload
	0x02: 2M TLB reload
	0x04: 1G TLB reload
MISALIGNED_ACCESSES: (counter: all)
	Misaligned Accesses (min count: 500)
MICRO_ARCH_LATE_CANCEL_ACCESS: (counter: all)
	Microarchitectural late cancel of an access (min count: 500)
MICRO_ARCH_EARLY_CANCEL_ACCESS: (counter: all)
	Microarchitectural early cancel of an access (min count: 500)
1_BIT_ECC_ERRORS: (counter: all)
	Single-bit ECC errors recorded by scrubber (min count: 500)
	Unit masks (default 0xf)
	----------
	0x01: Scrubber error
	0x02: Piggyback scrubber errors
	0x04: Load pipe error
	0x08: Store write pip error
PREFETCH_INSTRUCTIONS_DISPATCHED: (counter: all)
	The number of prefetch instructions dispatched by the decoder  (min count: 500)
	Unit masks (default 0x7)
	----------
	0x01: Load (Prefetch, PrefetchT0/T1/T2)
	0x02: Store (PrefetchW)
	0x04: NTA (PrefetchNTA)
LOCKED_INSTRUCTIONS_DCACHE_MISSES: (counter: all)
	The number of dta cache misses by locked instructions. (min count: 500)
	Unit masks (default 0x2)
	----------
	0x02: Data cache misses by locked instructions
L1_DTLB_HIT: (counter: all)
	L1 DTLB hit (min count: 500)
	Unit masks (default 0x7)
	----------
	0x01: L1 4K TLB hit
	0x02: L1 2M TLB hit
	0x04: L1 1G TLB hit
INEFFECTIVE_SW_PREFETCHES: (counter: all)
	Number of software prefetches that did not fetch data outside of processor core (min count: 500)
	Unit masks (default 0x9)
	----------
	0x01: Hit in L1
	0x08: Hit in L2
GLOBAL_TLB_FLUSHES: (counter: all)
	The number of global TLB flushes (min count: 500)
MEMORY_REQUESTS: (counter: all)
	Memory Requests by Type (min count: 500)
	Unit masks (default 0x83)
	----------
	0x01: Requests to non-cacheable (UC) memory
	0x02: Requests to write-combining (WC) memory or WC buffer flushes to WB memory
	0x80: Streaming store (SS) requests
DATA_PREFETCHES: (counter: all)
	Data Prefetcher (min count: 500)
	Unit masks (default 0x3)
	----------
	0x01: Cancelled prefetches
	0x02: Prefetch attempts
NORTHBRIDGE_READ_RESPONSES: (counter: all)
	Northbridge Read Responses by Coherency State (min count: 500)
	Unit masks (default 0x17)
	----------
	0x01: Exclusive
	0x02: Modified
	0x04: Shared
	0x10: Data Error
OCTWORD_WRITE_TRANSFERS: (counter: all)
	Octwords Written to System (min count: 500)
	Unit masks (default 0x1)
	----------
	0x01: Quadword write transfer
REQUESTS_TO_L2: (counter: all)
	Requests to L2 Cache (min count: 500)
	Unit masks (default 0x3f)
	----------
	0x01: IC fill
	0x02: DC fill
	0x04: TLB fill (page table walks)
	0x08: Tag snoop request
	0x10: Canceled request
	0x20: Hardware prefetch from data cache
L2_CACHE_MISS: (counter: all)
	L2 Cache Misses (min count: 500)
	Unit masks (default 0xf)
	----------
	0x01: IC fill
	0x02: DC fill (includes possible replays)
	0x04: TLB page table walk
	0x08: Hardwareprefetch from data cache
L2_CACHE_FILL_WRITEBACK: (counter: all)
	L2 Fill/Writeback (min count: 500)
	Unit masks (default 0x3)
	----------
	0x01: L2 fills (victims from L1 caches, TLB page table walks and data prefetches)
	0x02: L2 Writebacks to system
INSTRUCTION_CACHE_FETCHES: (counter: all)
	Instruction cache fetches (RevE) (min count: 500)
INSTRUCTION_CACHE_MISSES: (counter: all)
	Instruction cache misses (min count: 500)
INSTRUCTION_CACHE_REFILLS_FROM_L2: (counter: all)
	Instruction Cache Refills from L2 (min count: 500)
INSTRUCTION_CACHE_REFILLS_FROM_SYSTEM: (counter: all)
	Instruction Cache Refills from System (min count: 500)
L1_ITLB_MISS_AND_L2_ITLB_HIT: (counter: all)
	L1 ITLB misses (and L2 ITLB hits) (min count: 500)
L1_ITLB_MISS_AND_L2_ITLB_MISS: (counter: all)
	L1 ITLB Miss, L2 ITLB Miss (min count: 500)
	Unit masks (default 0x3)
	----------
	0x01: Instruction fetches to 4K pages
	0x02: Instruction fetches to 2M pages	
PIPELINE_RESTART_DUE_TO_INSTRUCTION_STREAM_PROBE: (counter: all)
	Pipeline Restart Due to Instruction Stream Probe (min count: 500)
INSTRUCTION_FETCH_STALL: (counter: all)
	Instruction fetch stall (min count: 500)
RETURN_STACK_HITS: (counter: all)
	Return stack hit (min count: 500)
RETURN_STACK_OVERFLOWS: (counter: all)
	Return stack overflow (min count: 500)
INSTRUCTION_CACHE_VICTIMS: (counter: all)
	Number of instruction cachelines evicticed to L2 (min count: 500)
INSTRUCTION_CHCHE_INVALIDATED: (counter: all)
	Instruction cache lines invalidated (min count: 500)
	Unit masks (default 0xf)
	----------
	0x01: Invalidating probe that did not hit any in-flight instructions
	0x02: Invalidating probe that hit one or more in-flight instructions
	0x04: SMC that did not hit any in-flight instructions
	0x08: SMC that hit one or more in-flight instructions
ITLB_RELOADS: (counter: all)
	The number of ITLB reloads requests (min count: 500)
ITLB_RELOADS_ABORTED: (counter: all)
	The number of ITLB reloads aborted (min count: 500)
RETIRED_INSTRUCTIONS: (counter: all)
	Retired instructions (includes exceptions, interrupts, re-syncs) (min count: 3000)
RETIRED_UOPS: (counter: all)
	Retired micro-ops (min count: 500)
RETIRED_BRANCH_INSTRUCTIONS: (counter: all)
	Retired branches (conditional, unconditional, exceptions, interrupts) (min count: 500)
RETIRED_MISPREDICTED_BRANCH_INSTRUCTIONS: (counter: all)
	Retired Mispredicted Branch Instructions (min count: 500)
RETIRED_TAKEN_BRANCH_INSTRUCTIONS: (counter: all)
	Retired taken branch instructions (min count: 500)
RETIRED_TAKEN_BRANCH_INSTRUCTIONS_MISPREDICTED: (counter: all)
	Retired taken branches mispredicted (min count: 500)
RETIRED_FAR_CONTROL_TRANSFERS: (counter: all)
	Retired far control transfers (min count: 500)
RETIRED_BRANCH_RESYNCS: (counter: all)
	Retired branches resyncs (only non-control transfer branches) (min count: 500)
RETIRED_NEAR_RETURNS: (counter: all)
	Retired near returns (min count: 500)
RETIRED_NEAR_RETURNS_MISPREDICTED: (counter: all)
	Retired near returns mispredicted (min count: 500)
RETIRED_INDIRECT_BRANCHES_MISPREDICTED: (counter: all)
	Retired Indirect Branches Mispredicted (min count: 500)
RETIRED_MMX_FP_INSTRUCTIONS: (counter: all)
	Retired MMX/FP instructions (min count: 500)
	Unit masks (default 0x7)
	----------
	0x01: x87 instructions
	0x02: MMX & 3DNow instructions
	0x04: SSE & SSE2 instructions
RETIRED_FASTPATH_DOUBLE_OP_INSTRUCTIONS: (counter: all)
	Retired FastPath double-op instructions (min count: 500)
	Unit masks (default 0x7)
	----------
	0x01: With low op in position 0
	0x02: With low op in position 1
	0x04: With low op in position 2
INTERRUPTS_MASKED_CYCLES: (counter: all)
	Cycles with interrupts masked (IF=0) (min count: 500)
INTERRUPTS_MASKED_CYCLES_WITH_INTERRUPT_PENDING: (counter: all)
	Cycles with interrupts masked while interrupt pending (min count: 500)
INTERRUPTS_TAKEN: (counter: all)
	Number of taken hardware interrupts (min count: 10)
DECODER_EMPTY: (counter: all)
	Nothing to dispatch (decoder empty) (min count: 500)
DISPATCH_STALLS: (counter: all)
	Dispatch stalls (min count: 500)
DISPATCH_STALL_FOR_BRANCH_ABORT: (counter: all)
	Dispatch stall from branch abort to retire (min count: 500)
DISPATCH_STALL_FOR_SERIALIZATION: (counter: all)
	Dispatch stall for serialization (min count: 500)
DISPATCH_STALL_FOR_SEGMENT_LOAD: (counter: all)
	Dispatch stall for segment load (min count: 500)
DISPATCH_STALL_FOR_REORDER_BUFFER_FULL: (counter: all)
	Dispatch stall for reorder buffer full (min count: 500)
DISPATCH_STALL_FOR_RESERVATION_STATION_FULL: (counter: all)
	Dispatch stall when reservation stations are full (min count: 500)
DISPATCH_STALL_FOR_FPU_FULL: (counter: all)
	Dispatch stall when FPU is full (min count: 500)
DISPATCH_STALL_FOR_LS_FULL: (counter: all)
	Dispatch stall when LS is full (min count: 500)
DISPATCH_STALL_WAITING_FOR_ALL_QUIET: (counter: all)
	Dispatch stall when waiting for all to be quiet (min count: 500)
DISPATCH_STALL_FOR_FAR_TRANSFER_OR_RESYNC: (counter: all)
	Dispatch Stall for Far Transfer or Resync to Retire (min count: 500)
FPU_EXCEPTIONS: (counter: all)
	FPU exceptions (min count: 1)
	Unit masks (default 0xf)
	----------
	0x01: x87 reclass microfaults
	0x02: SSE retype microfaults
	0x04: SSE reclass microfaults
	0x08: SSE and x87 microtraps
DR0_BREAKPOINTS: (counter: all)
	The number of matches on the address in breakpoint register DR0 (min count: 1)
DR1_BREAKPOINTS: (counter: all)
	The number of matches on the address in breakpoint register DR1 (min count: 1)
DR2_BREAKPOINTS: (counter: all)
	The number of matches on the address in breakpoint register DR2 (min count: 1)
DR3_BREAKPOINTS: (counter: all)
	The number of matches on the address in breakpoint register DR3 (min count: 1)
DRAM_ACCESSES: (counter: all)
	DRAM Accesses (min count: 500)
	Unit masks (default 0xff)
	----------
	0x01: DCT0 Page hit
	0x02: DCT0 Page miss
	0x04: DCT0 Page conflict
	0x08: DCT1 Page hit
	0x10: DCT1 Page miss
	0x20: DCT1 Page Conflict
	0x40: Write request
	0x80: Read request
MEMORY_CONTROLLER_PAGE_TABLE_OVERFLOWS: (counter: all)
	Memory controller page table overflows (min count: 500)
	Unit masks (default 0x3)
	----------
	0x01: DCT0 Page Table Overflow
	0x02: DCT1 Page Table Overflow
MEMORY_CONTROLLER_SLOT_MISSED: (counter: all)
	Memory controller DRAM command slots missed (min count: 500)
	Unit masks (default 0x3)
	----------
	0x01: DCT0 Command slots missed
	0x02: DCT2 Command slots missed
MEMORY_CONTROLLER_TURNAROUNDS: (counter: all)
	Memory controller turnarounds (min count: 500)
	Unit masks (default 0x3f)
	----------
	0x01: DCT0 DIMM (chip select) turnaround
	0x02: DCT0 Read to write turnaround
	0x04: DCT0 Write to read turnaround
	0x08: DCT1 DIMM (chip select) turnaround
	0x10: DCT1 Read to write turnaround
	0x20: DCT1 Write to read turnaround
MEMORY_CONTROLLER_BYPASS_COUNTER_SATURATION: (counter: all)
	Memory controller bypass saturation (min count: 500)
	Unit masks (default 0xf)
	----------
	0x01: Memory controller high priority bypass
	0x02: Memory controller medium priority bypass
	0x04: DCT0 DCQ bypass
	0x08: DCT1 DCQ bypass
THERMAL_STATUS: (counter: all)
	Thermal status (min count: 500)
	Unit masks (default 0x7c)
	----------
	0x04: Number of times the HTC trip point is crossed
	0x08: Number of clocks when STC trip point active
	0x10: Number of times the STC trip point is crossed
	0x20: Number of clocks HTC P-state is inactive
	0x40: Number of clocks HTC P-state is active
CPU_IO_REQUESTS_TO_MEMORY_IO: (counter: all)
	CPU/IO Requests to Memory/IO (RevE) (min count: 500)
	Unit masks (default 0x8)
	----------
	0x01: IO to IO
	0x04: IO to Mem 
	0x08: CPU to IO 
	0x10: To remote node
	0x20: To local node
	0x40: From remote node
	0x80: From local node
CACHE_BLOCK_COMMANDS: (counter: all)
	Cache Block Commands (RevE) (min count: 500)
	Unit masks (default 0x3d)
	----------
	0x01: Victim Block (Writeback)
	0x04: Read Block (Dcache load miss refill)
	0x08: Read Block Shared (Icache refill)
	0x10: Read Block Modified (Dcache store miss refill)
	0x20: Change to Dirty (first store to clean block already in cache)
SIZED_COMMANDS: (counter: all)
	Sized Commands (min count: 500)
	Unit masks (default 0x3f)
	----------
	0x01: non-posted write byte (1-32 bytes)
	0x02: non-posted write dword (1-16 dwords)
	0x04: posted write byte (1-32 bytes)
	0x08: posted write dword (1-16 dwords)
	0x10: read byte (4 bytes)
	0x20: read dword (1-16 dwords)
PROBE_RESPONSES_AND_UPSTREAM_REQUESTS: (counter: all)
	Probe Responses and Upstream Requests (min count: 500)
	Unit masks (default 0xff)
	----------
	0x01: Probe miss
	0x02: Probe hit clean
	0x04: Probe hit dirty without memory cancel
	0x08: Probe hit dirty with memory cancel
	0x10: Upstream display refresh/ISOC reads
	0x20: Upstream non-display refresh reads
	0x40: Upstream ISOC writes
	0x80: Upstream non-ISOC writes
GART_EVENTS: (counter: all)
	GART Events (min count: 500)
	Unit masks (default 0xff)
	----------
	0x01: GART aperture hit on access from CPU
	0x02: GART aperture hit on access from I/O
	0x04: GART miss
	0x08: GART/DEV Request hit table walk in progress
	0x10: DEV hit
	0x20: DEV miss
	0x40: DEV error
	0x80: GART/DEV multiple table walk in progress
MEMORY_CONTROLLER_REQUESTS: (counter: all)
	Sized Read/Write activity. (min count: 500)
	Unit masks (default 0x78)
	----------
	0x01: Write requests
	0x02: Read Requests including Prefetch
	0x04: Prefetch Request
	0x08: 32 Bytes Sized Writes
	0x10: 64 Bytes Sized Writes
	0x20: 32 Bytes Sized Reads
	0x40: 64 Byte Sized Reads
	0x80: Read Requests while writes pending in DCQ
CPU_DRAM_REQUEST_TO_NODE: (counter: all)
	CPU to DRAM requests to target node (min count: 500)
	Unit masks (default 0xff)
	----------
	0x01: From local node to node 0
	0x02: From local node to node 1
	0x04: From local node to node 2
	0x08: From local node to node 3
	0x10: From local node to node 4
	0x20: From local node to node 5
	0x40: From local node to node 6
	0x80: From local node to node 7
IO_DRAM_REQUEST_TO_NODE: (counter: all)
	IO to DRAM requests to target node (min count: 500)
	Unit masks (default 0xff)
	----------
	0x01: From local node to node 0
	0x02: From local node to node 1
	0x04: From local node to node 2
	0x08: From local node to node 3
	0x10: From local node to node 4
	0x20: From local node to node 5
	0x40: From local node to node 6
	0x80: From local node to node 7
CPU_READ_COMMAND_LATENCY_NODE_0_3: (counter: all)
	Latency between the local node and remote node (min count: 500)
	Unit masks (default 0xff)
	----------
	0x01: Read block 
	0x02: Read block shared
	0x04: Read block modified
	0x08: Change to dirty
	0x10: From local node to node 0
	0x20: From local node to node 1
	0x40: From local node to node 2
	0x80: From local node to node 3
CPU_READ_COMMAND_REQUEST_NODE_0_3: (counter: all)
	Number of requests that a latency measurment is made for Event 0x1E2 (min count: 500)
	Unit masks (default 0xff)
	----------
	0x01: Read block 
	0x02: Read block shared
	0x04: Read block modified
	0x08: Change to dirty
	0x10: From local node to node 0
	0x20: From local node to node 1
	0x40: From local node to node 2
	0x80: From local node to node 3
CPU_READ_COMMAND_LATENCY_NODE_4_7: (counter: all)
	Latency between the local node and remote node (min count: 500)
	Unit masks (default 0xff)
	----------
	0x01: Read block 
	0x02: Read block shared
	0x04: Read block modified
	0x08: Change to dirty
	0x10: From local node to node 4
	0x20: From local node to node 5
	0x40: From local node to node 6
	0x80: From local node to node 7
CPU_READ_COMMAND_REQUEST_NODE_4_7: (counter: all)
	Number of requests that a latency measurment is made for Event 0x1E2 (min count: 500)
	Unit masks (default 0xff)
	----------
	0x01: Read block 
	0x02: Read block shared
	0x04: Read block modified
	0x08: Change to dirty
	0x10: From local node to node 4
	0x20: From local node to node 5
	0x40: From local node to node 6
	0x80: From local node to node 7
CPU_COMMAND_LATENCY_TARGET: (counter: all)
	Determine latency between the local node and a remote node. (min count: 500)
	Unit masks (default 0xf7)
	----------
	0x01: Read sized
	0x02: Write sized
	0x04: Victim block
	0x08: Node group select. 0=Nodes 0-3. 1=Nodes 4-7
	0x10: From local node to node 0/4
	0x20: From local node to node 1/5
	0x40: From local node to node 2/6
	0x80: From local node to node 3/7
CPU_REQUEST_TARGET: (counter: all)
	Number of requests that a latency measurement is made for Event 0x1E6 (min count: 500)
	Unit masks (default 0xf7)
	----------
	0x01: Read sized
	0x02: Write sized
	0x04: Victim block
	0x08: Node group select. 0=Nodes 0-3. 1=Nodes 4-7
	0x10: From local node to node 0/4
	0x20: From local node to node 1/5
	0x40: From local node to node 2/6
	0x80: From local node to node 3/7
HYPERTRANSPORT_LINK0_TRANSMIT_BANDWIDTH: (counter: all)
	HyperTransport(tm) link 0 transmit bandwidth (min count: 500)
	Unit masks (default 0xbf)
	----------
	0x01: Command DWORD sent
	0x02: DWORD sent
	0x04: Buffer release DWORD sent
	0x08: Nop DW sent (idle)
	0x10: Address extension DWORD sent
	0x20: Per packet CRC sent
	0x80: SubLink Mask
HYPERTRANSPORT_LINK1_TRANSMIT_BANDWIDTH: (counter: all)
	HyperTransport(tm) link 1 transmit bandwidth (min count: 500)
	Unit masks (default 0xbf)
	----------
	0x01: Command DWORD sent
	0x02: DWORD sent
	0x04: Buffer release DWORD sent
	0x08: Nop DW sent (idle)
	0x10: Address extension DWORD sent
	0x20: Per packet CRC sent
	0x80: SubLink Mask
HYPERTRANSPORT_LINK2_TRANSMIT_BANDWIDTH: (counter: all)
	HyperTransport(tm) link 2 transmit bandwidth (min count: 500)
	Unit masks (default 0xbf)
	----------
	0x01: Command DWORD sent
	0x02: DWORD sent
	0x04: Buffer release DWORD sent
	0x08: Nop DW sent (idle)
	0x10: Address extension DWORD sent
	0x20: Per packet CRC sent
	0x80: SubLink Mask
HYPERTRANSPORT_LINK3_TRANSMIT_BANDWIDTH: (counter: all)
	HyperTransport(tm) link 3 transmit bandwidth (min count: 500)
	Unit masks (default 0xbf)
	----------
	0x01: Command DWORD sent
	0x02: DWORD sent
	0x04: Buffer release DWORD sent
	0x08: Nop DW sent (idle)
	0x10: Address extension DWORD sent
	0x20: Per packet CRC sent
	0x80: SubLink Mask
READ_REQUEST_L3_CACHE: (counter: all)
	Tracks the red requests from each core to L3 cache (min count: 500)
	Unit masks (default 0xf7)
	----------
	0x01: Read Block Exclusive (Data cache read)
	0x02: Read Block Shared (Instruciton cache read)
	0x04: Read Block Modify
	0x10: Core 0 Select
	0x20: Core 1 Select
	0x40: Core 2 Select
	0x80: Core 3 Select
L3_CACHE_MISSES: (counter: all)
	Tracks the L3 cache misses from each core (min count: 500)
	Unit masks (default 0xf7)
	----------
	0x01: Read Block Exclusive (Data cache read)
	0x02: Read Block Shared (Instruciton cache read)
	0x04: Read Block Modify
	0x10: Core 0 Select
	0x20: Core 1 Select
	0x40: Core 2 Select
	0x80: Core 3 Select
L3_FILLS_CAUSED_BY_L2_EVICTIONS: (counter: all)
	Tracks the L3 fills caused by L2 evictions per core (min count: 500)
	Unit masks (default 0xff)
	----------
	0x01: Shared
	0x02: Exclusive
	0x04: Owned
	0x08: Modified
	0x10: Core 0 Select
	0x20: Core 1 Select
	0x40: Core 2 Select
	0x80: Core 3 Select
L3_EVICTIONS: (counter: all)
	Tracks the state of the L3 line when it was evicted (min count: 500)
	Unit masks (default 0xf)
	----------
	0x01: Shared
	0x02: Exclusive
	0x04: Owned
	0x08: Modified
Greg
frmky is online now  
Old 2008-06-19, 19:14   #64
frmky
 
frmky's Avatar
 
Jul 2003
So Cal

83A16 Posts
Default

Quote:
Originally Posted by xilman View Post
True, if you have sufficient memory. If the machine is memory bound, one LA and N-1 sievers will spend so much time thrashing that it's better to get the LA out of the way ASAP.
Paul
This computer is a cog in a larger network. Through NFSNet and the Condor network Bruce has access to, we have lots of sieving cycles. We need to get the LA out of the way quickly to keep up. Although it would increase throughput, there is definitely insufficient memory to run 1 thread of LA on each core.

Greg
frmky is online now  
Old 2008-06-20, 15:27   #65
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

3,541 Posts
Default

Quote:
Originally Posted by frmky View Post
Just checked. oprofile seems to be installed and presumably working on this computer. Here's the output of ophelp:
That's a rather huge number of performance events to monitor. I'll have to get back to you on how to use them.

Regarding the NUMA stats, it looks like there's been a lot of mixing things due to previous jobs. Maybe the counters can be zeroed first...

Last fiddled with by jasonp on 2008-06-20 at 15:30
jasonp is offline  
Old 2008-06-21, 00:39   #66
frmky
 
frmky's Avatar
 
Jul 2003
So Cal

2·34·13 Posts
Default

Quote:
Originally Posted by jasonp View Post
Maybe the counters can be zeroed first...
Googling didn't reveal a way to do that, other than rebooting the computer. Perhaps looking at the differences from yesterday? Here are the stats from today. Since yesterday, the large run on nodes 0 and 1 simply continued. I restarted the run on nodes 2 and 3 to use 8 threads instead of 7. ECM and PFGW in Wine have been running on nodes 4-7.

Code:
                           node0           node1           node2           node3           node4           node5           node6           node7
numa_hit               276846730       110012018       117206662       269579970      1138760079      1140979275      1121827928      1170021262
numa_miss               16635435       606023602       511580197       298322163        93107362        29327083        24241671       152533229
numa_foreign          1392825242        39423820        29192402        19473609        53407859        64046284        91550068        41851458
interleave_hit             12216           18867           15296           16494           22148           18429           19921           18818
local_node             276651789       109620572       116936346       269238614      1138359298      1140585367      1121470224      1169540179
other_node              16830376       606415048       511850513       298663519        93508143        29720991        24599375       153014312
So the differences between today and yesterday are

Code:
                           node0           node1           node2           node3           node4           node5           node6           node7
numa_hit                 1847954         3219836         4013250         1835410       171617099       167600790       150479830       167580906
numa_miss                      0               0               0          313151         1490308               0               0               0
numa_foreign                   0               0          368300         1435159               0               0               0               0
interleave_hit               961            1394            1499              69            1819            1186            1342            1247
local_node               1821686         3198644         3992089         1826253       171594753       167583260       150460597       167559809
other_node                 26268           21192           21161          322308         1512654           17530           19233           21097
Greg
frmky is online now  
 



Similar Threads
Thread Thread Starter Forum Replies Last Post
Considering current hardware on the status page petrw1 PrimeNet 20 2007-05-24 18:10
Current status fivemack NFSNET Discussion 90 2006-11-13 13:37
Current Status moo LMH > 100M 0 2006-09-02 01:15
Current status "fishing" HiddenWarrior Operation Billion Digits 1 2005-08-19 21:42
Current Status of the Cunningham Tables rogue Cunningham Tables 4 2005-06-10 18:28

All times are UTC. The time now is 23:59.


Fri Jul 16 23:59:57 UTC 2021 up 49 days, 21:47, 1 user, load averages: 1.15, 1.54, 1.49

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.