mersenneforum.org  

Go Back   mersenneforum.org > New To GIMPS? Start Here! > Information & Answers

Reply
 
Thread Tools
Old 2021-07-24, 10:57   #34
drkirkby
 
"David Kirkby"
Jan 2021
Althorne, Essex, UK

3·149 Posts
Default

Quote:
Originally Posted by axn View Post
On further experimenting, the affinity numbers might be all wrong. Can you do a lstopo-no-graphics and post the output?

EDIT:- lstopo-no-graphics --no-io
Sure. I've not read the man page on this, so don't know what it is doing, but the very first line is wrong. 12 x 32 != 377
Code:
drkirkby@canary:~$ lstopo-no-graphics
Machine (377GB total)
  Package L#0
    NUMANode L#0 (P#0 188GB)
    L3 L#0 (36MB)
      L2 L#0 (1024KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#0)
      L2 L#1 (1024KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU L#1 (P#1)
      L2 L#2 (1024KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU L#2 (P#2)
      L2 L#3 (1024KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU L#3 (P#3)
      L2 L#4 (1024KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 + PU L#4 (P#4)
      L2 L#5 (1024KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 + PU L#5 (P#5)
      L2 L#6 (1024KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 + PU L#6 (P#6)
      L2 L#7 (1024KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 + PU L#7 (P#7)
      L2 L#8 (1024KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 + PU L#8 (P#8)
      L2 L#9 (1024KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 + PU L#9 (P#9)
      L2 L#10 (1024KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10 + PU L#10 (P#10)
      L2 L#11 (1024KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11 + PU L#11 (P#11)
      L2 L#12 (1024KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12 + PU L#12 (P#12)
      L2 L#13 (1024KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13 + PU L#13 (P#13)
      L2 L#14 (1024KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14 + PU L#14 (P#14)
      L2 L#15 (1024KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15 + PU L#15 (P#15)
      L2 L#16 (1024KB) + L1d L#16 (32KB) + L1i L#16 (32KB) + Core L#16 + PU L#16 (P#16)
      L2 L#17 (1024KB) + L1d L#17 (32KB) + L1i L#17 (32KB) + Core L#17 + PU L#17 (P#17)
      L2 L#18 (1024KB) + L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18 + PU L#18 (P#18)
      L2 L#19 (1024KB) + L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19 + PU L#19 (P#19)
      L2 L#20 (1024KB) + L1d L#20 (32KB) + L1i L#20 (32KB) + Core L#20 + PU L#20 (P#20)
      L2 L#21 (1024KB) + L1d L#21 (32KB) + L1i L#21 (32KB) + Core L#21 + PU L#21 (P#21)
      L2 L#22 (1024KB) + L1d L#22 (32KB) + L1i L#22 (32KB) + Core L#22 + PU L#22 (P#22)
      L2 L#23 (1024KB) + L1d L#23 (32KB) + L1i L#23 (32KB) + Core L#23 + PU L#23 (P#23)
      L2 L#24 (1024KB) + L1d L#24 (32KB) + L1i L#24 (32KB) + Core L#24 + PU L#24 (P#24)
      L2 L#25 (1024KB) + L1d L#25 (32KB) + L1i L#25 (32KB) + Core L#25 + PU L#25 (P#25)
    HostBridge
      PCI 00:11.5 (SATA)
      PCI 00:16.2 (IDE)
      PCI 00:17.0 (RAID)
        Block(Disk) "sdb"
        Block(Disk) "sda"
      PCIBridge
        PCI 02:00.0 (Ethernet)
          Net "enp2s0"
      PCIBridge
        PCI 03:00.0 (Ethernet)
          Net "enp3s0f0"
        PCI 03:00.1 (Ethernet)
          Net "enp3s0f1"
      PCI 00:1f.6 (Ethernet)
        Net "enp0s31f6"
    HostBridge
      PCI 44:05.5 (RAID)
        Block(Disk) "nvme0n1"
    HostBridge
      PCIBridge
        PCI 73:00.0 (VGA)
  Package L#1
    NUMANode L#1 (P#1 189GB)
    L3 L#1 (36MB)
      L2 L#26 (1024KB) + L1d L#26 (32KB) + L1i L#26 (32KB) + Core L#26 + PU L#26 (P#26)
      L2 L#27 (1024KB) + L1d L#27 (32KB) + L1i L#27 (32KB) + Core L#27 + PU L#27 (P#27)
      L2 L#28 (1024KB) + L1d L#28 (32KB) + L1i L#28 (32KB) + Core L#28 + PU L#28 (P#28)
      L2 L#29 (1024KB) + L1d L#29 (32KB) + L1i L#29 (32KB) + Core L#29 + PU L#29 (P#29)
      L2 L#30 (1024KB) + L1d L#30 (32KB) + L1i L#30 (32KB) + Core L#30 + PU L#30 (P#30)
      L2 L#31 (1024KB) + L1d L#31 (32KB) + L1i L#31 (32KB) + Core L#31 + PU L#31 (P#31)
      L2 L#32 (1024KB) + L1d L#32 (32KB) + L1i L#32 (32KB) + Core L#32 + PU L#32 (P#32)
      L2 L#33 (1024KB) + L1d L#33 (32KB) + L1i L#33 (32KB) + Core L#33 + PU L#33 (P#33)
      L2 L#34 (1024KB) + L1d L#34 (32KB) + L1i L#34 (32KB) + Core L#34 + PU L#34 (P#34)
      L2 L#35 (1024KB) + L1d L#35 (32KB) + L1i L#35 (32KB) + Core L#35 + PU L#35 (P#35)
      L2 L#36 (1024KB) + L1d L#36 (32KB) + L1i L#36 (32KB) + Core L#36 + PU L#36 (P#36)
      L2 L#37 (1024KB) + L1d L#37 (32KB) + L1i L#37 (32KB) + Core L#37 + PU L#37 (P#37)
      L2 L#38 (1024KB) + L1d L#38 (32KB) + L1i L#38 (32KB) + Core L#38 + PU L#38 (P#38)
      L2 L#39 (1024KB) + L1d L#39 (32KB) + L1i L#39 (32KB) + Core L#39 + PU L#39 (P#39)
      L2 L#40 (1024KB) + L1d L#40 (32KB) + L1i L#40 (32KB) + Core L#40 + PU L#40 (P#40)
      L2 L#41 (1024KB) + L1d L#41 (32KB) + L1i L#41 (32KB) + Core L#41 + PU L#41 (P#41)
      L2 L#42 (1024KB) + L1d L#42 (32KB) + L1i L#42 (32KB) + Core L#42 + PU L#42 (P#42)
      L2 L#43 (1024KB) + L1d L#43 (32KB) + L1i L#43 (32KB) + Core L#43 + PU L#43 (P#43)
      L2 L#44 (1024KB) + L1d L#44 (32KB) + L1i L#44 (32KB) + Core L#44 + PU L#44 (P#44)
      L2 L#45 (1024KB) + L1d L#45 (32KB) + L1i L#45 (32KB) + Core L#45 + PU L#45 (P#45)
      L2 L#46 (1024KB) + L1d L#46 (32KB) + L1i L#46 (32KB) + Core L#46 + PU L#46 (P#46)
      L2 L#47 (1024KB) + L1d L#47 (32KB) + L1i L#47 (32KB) + Core L#47 + PU L#47 (P#47)
      L2 L#48 (1024KB) + L1d L#48 (32KB) + L1i L#48 (32KB) + Core L#48 + PU L#48 (P#48)
      L2 L#49 (1024KB) + L1d L#49 (32KB) + L1i L#49 (32KB) + Core L#49 + PU L#49 (P#49)
      L2 L#50 (1024KB) + L1d L#50 (32KB) + L1i L#50 (32KB) + Core L#50 + PU L#50 (P#50)
      L2 L#51 (1024KB) + L1d L#51 (32KB) + L1i L#51 (32KB) + Core L#51 + PU L#51 (P#51)
    HostBridge
      PCI d1:05.5 (RAID)
        Block(Disk) "nvme1n1"
drkirkby@canary:~$

Last fiddled with by drkirkby on 2021-07-24 at 11:02
drkirkby is offline   Reply With Quote
Old 2021-07-24, 12:52   #35
axn
 
axn's Avatar
 
Jun 2003

22×32×11×13 Posts
Default

Is the HT turned off (or is there no HT)? It is only reporting 1 logical processor per core and a total of 52 logical cores.

Taking that at face value, the affinities should be:


local.txt #1, Worker #1
Affinity=0,1,2,3,4,5,6,7,8,9,10,11,12

local.txt #1, Worker #2
Affinity=13,14,15,16,17,18,19,20,21,22,23,24,25


local.txt #2, Worker #1
Affinity=26,27,28,29,30,31,32,33,34,35,36,37,38

local.txt #2, Worker #2
Affinity=39,40,41,42,43,44,45,46,47,48,49,50,51

Last fiddled with by axn on 2021-07-24 at 13:08 Reason: Affinity syntax is confusing.
axn is offline   Reply With Quote
Old 2021-07-24, 14:36   #36
drkirkby
 
"David Kirkby"
Jan 2021
Althorne, Essex, UK

3×149 Posts
Default

Quote:
Originally Posted by axn View Post
Is the HT turned off (or is there no HT)? It is only reporting 1 logical processor per core and a total of 52 logical cores.
It's probably turned off. I did turn it off in an attempt to make it easier to sort out the problems of getting this on cores I wanted. I thought the HT was just making life more difficult. I thought I had turned it back on again, but I must have overlooked that. The output from

Code:
numactl -H
is in this thread.

https://www.mersenneforum.org/showpo...9&postcount=35
That shows cpus numbered 0 to 103.
So do I need to use numactl or not now? Of course, whilst 4 workers is optimal with one process running, it may well not be if two processes are running. I'll give that a try.
drkirkby is offline   Reply With Quote
Old 2021-07-24, 16:05   #37
axn
 
axn's Avatar
 
Jun 2003

141C16 Posts
Default

Quote:
Originally Posted by drkirkby View Post
So do I need to use numactl or not now? Of course, whilst 4 workers is optimal with one process running, it may well not be if two processes are running. I'll give that a try.
The HT thing was just an observation. It does materially affect the Affinity setting, though. So, should you decide to turn on HT, it will need a different set of values for Affinity.

We still need numactl to evaluate the impact of stage 2 with local memory vs non-local memory. If it turns out that using only half the total amount, but local, memory makes stage 2 much faster, then that is the way to go.

Hopefully with the numactl and Affinity settings, we'll be able to run two instances of P95 with local memory and fully utilizing the cores. If that does work, I have no doubt that you'll get the best performance. It may or may not be significantly better than running a single instance, but that's what we want to find out.
axn is offline   Reply With Quote
Old 2021-07-24, 17:47   #38
drkirkby
 
"David Kirkby"
Jan 2021
Althorne, Essex, UK

3×149 Posts
Default

(kriesel: Caution, next post indicates there was an undisclosed error affecting this post.)


I tried what you said, but performance was not that great. Then I tried running one process with 2 workers, with Affinity like you said, and benchmarking another process. The benchmarking was tried with 24-26 cores and 2-4 workers.

Code:
[Worker #1 Jul 24 16:57] Timing 5760K FFT, 26 cores, 4 workers.  Average times:  8.29,  7.12,  6.24,  5.37 ms.  Total throughput: 607.58 iter/sec.
Since 4 does not divide 26, clearly there must be an unequal number of cores running on each worker.

The 607.58 iter/sec is almost double the throughput one obtains running 4 workers on each of two processes, where the processes are not constrained in any way. Here are the results from running two benchmarks, where nothing is constrained.

Code:
[Worker #1 Jul 24 18:14] Benchmarking multiple workers to measure the impact of memory bandwidth
[Worker #1 Jul 24 18:15] Timing 5760K FFT, 26 cores, 4 workers.  Average times: 13.27, 11.03, 13.31, 11.08 ms.  Total throughput: 331.31 iter/sec.
and
Code:
[Worker #1 Jul 24 18:15] Timing 5760K FFT, 26 cores, 4 workers.  Average times: 12.80, 11.30, 12.78, 11.42 ms.  Total throughput: 332.36 iter/sec.
Total throughput is a dismal 331.31+332.36=663.67 iter/sec.

One does better running one process
Code:
[Worker #1 Jul 24 18:22] Benchmarking multiple workers to measure the impact of memory bandwidth
[Worker #1 Jul 24 18:22] Timing 5760K FFT, 52 cores, 4 workers.  Average times:  3.85,  3.84,  3.86,  3.86 ms.  Total throughput: 1038.20 iter/sec.
I suppose the next thing to try is to run two processes, each with 4 workers. I guess 2x6+2x7=26 would be a reasonable It would be nice to think I could get a total throughput of 2*607.58 = 1215.16 iter/sec, but somehow I doubt that will happen.

Last fiddled with by kriesel on 2021-07-25 at 00:05 Reason: error indicated in next post
drkirkby is offline   Reply With Quote
Old 2021-07-24, 20:49   #39
drkirkby
 
"David Kirkby"
Jan 2021
Althorne, Essex, UK

3×149 Posts
Default

Ignore my last post (at 2021-07-24 17:47.) I found an error.
drkirkby is offline   Reply With Quote
Old 2021-07-25, 16:36   #40
axn
 
axn's Avatar
 
Jun 2003

22·32·11·13 Posts
Default

Quote:
Originally Posted by drkirkby View Post
Ignore my last post (at 2021-07-24 17:47.) I found an error.
Did you correct the errors from your last trial?
axn is offline   Reply With Quote
Old 2021-07-25, 16:50   #41
drkirkby
 
"David Kirkby"
Jan 2021
Althorne, Essex, UK

44710 Posts
Default

Quote:
Originally Posted by axn View Post
Did you correct the errors from your last trial?
Not yet. I want to finish some benchmarking I've been doing on one exponent for various values of tests_saved and RAM on P-1 factoring. Then I will look into once again optimising things.
drkirkby is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Determine squares fenderbender Math 14 2007-07-28 23:24
determine hyderman Homework Help 7 2007-06-17 06:01
Methods to determine integer multiples dsouza123 Math 6 2006-11-18 16:10
Help: trying to determine latency on movaps instructions on AthlonXP LoKI.GuZ Hardware 1 2004-01-26 20:05
How to determine the P-1 boundaries? Boulder Software 2 2003-08-20 11:55

All times are UTC. The time now is 13:46.


Mon Oct 18 13:46:27 UTC 2021 up 87 days, 8:15, 0 users, load averages: 1.27, 1.31, 1.37

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.