mersenneforum.org (https://www.mersenneforum.org/index.php)
-   -   What determine if P-1 factoring is used? (https://www.mersenneforum.org/showthread.php?t=26849)

 drkirkby 2021-07-24 10:57

[QUOTE=axn;583855]On further experimenting, the affinity numbers might be all wrong. Can you do a lstopo-no-graphics and post the output?

EDIT:- [c]lstopo-no-graphics --no-io[/c][/QUOTE]
Sure. I've not read the man page on this, so don't know what it is doing, but the very first line is wrong. 12 x 32 != 377
[CODE]drkirkby@canary:~\$ lstopo-no-graphics
Machine (377GB total)
Package L#0
NUMANode L#0 (P#0 188GB)
L3 L#0 (36MB)
L2 L#0 (1024KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#0)
L2 L#1 (1024KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU L#1 (P#1)
L2 L#2 (1024KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU L#2 (P#2)
L2 L#3 (1024KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU L#3 (P#3)
L2 L#4 (1024KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 + PU L#4 (P#4)
L2 L#5 (1024KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 + PU L#5 (P#5)
L2 L#6 (1024KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 + PU L#6 (P#6)
L2 L#7 (1024KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 + PU L#7 (P#7)
L2 L#8 (1024KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 + PU L#8 (P#8)
L2 L#9 (1024KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 + PU L#9 (P#9)
L2 L#10 (1024KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10 + PU L#10 (P#10)
L2 L#11 (1024KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11 + PU L#11 (P#11)
L2 L#12 (1024KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12 + PU L#12 (P#12)
L2 L#13 (1024KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13 + PU L#13 (P#13)
L2 L#14 (1024KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14 + PU L#14 (P#14)
L2 L#15 (1024KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15 + PU L#15 (P#15)
L2 L#16 (1024KB) + L1d L#16 (32KB) + L1i L#16 (32KB) + Core L#16 + PU L#16 (P#16)
L2 L#17 (1024KB) + L1d L#17 (32KB) + L1i L#17 (32KB) + Core L#17 + PU L#17 (P#17)
L2 L#18 (1024KB) + L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18 + PU L#18 (P#18)
L2 L#19 (1024KB) + L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19 + PU L#19 (P#19)
L2 L#20 (1024KB) + L1d L#20 (32KB) + L1i L#20 (32KB) + Core L#20 + PU L#20 (P#20)
L2 L#21 (1024KB) + L1d L#21 (32KB) + L1i L#21 (32KB) + Core L#21 + PU L#21 (P#21)
L2 L#22 (1024KB) + L1d L#22 (32KB) + L1i L#22 (32KB) + Core L#22 + PU L#22 (P#22)
L2 L#23 (1024KB) + L1d L#23 (32KB) + L1i L#23 (32KB) + Core L#23 + PU L#23 (P#23)
L2 L#24 (1024KB) + L1d L#24 (32KB) + L1i L#24 (32KB) + Core L#24 + PU L#24 (P#24)
L2 L#25 (1024KB) + L1d L#25 (32KB) + L1i L#25 (32KB) + Core L#25 + PU L#25 (P#25)
HostBridge
PCI 00:11.5 (SATA)
PCI 00:16.2 (IDE)
PCI 00:17.0 (RAID)
Block(Disk) "sdb"
Block(Disk) "sda"
PCIBridge
PCI 02:00.0 (Ethernet)
Net "enp2s0"
PCIBridge
PCI 03:00.0 (Ethernet)
Net "enp3s0f0"
PCI 03:00.1 (Ethernet)
Net "enp3s0f1"
PCI 00:1f.6 (Ethernet)
Net "enp0s31f6"
HostBridge
PCI 44:05.5 (RAID)
Block(Disk) "nvme0n1"
HostBridge
PCIBridge
PCI 73:00.0 (VGA)
Package L#1
NUMANode L#1 (P#1 189GB)
L3 L#1 (36MB)
L2 L#26 (1024KB) + L1d L#26 (32KB) + L1i L#26 (32KB) + Core L#26 + PU L#26 (P#26)
L2 L#27 (1024KB) + L1d L#27 (32KB) + L1i L#27 (32KB) + Core L#27 + PU L#27 (P#27)
L2 L#28 (1024KB) + L1d L#28 (32KB) + L1i L#28 (32KB) + Core L#28 + PU L#28 (P#28)
L2 L#29 (1024KB) + L1d L#29 (32KB) + L1i L#29 (32KB) + Core L#29 + PU L#29 (P#29)
L2 L#30 (1024KB) + L1d L#30 (32KB) + L1i L#30 (32KB) + Core L#30 + PU L#30 (P#30)
L2 L#31 (1024KB) + L1d L#31 (32KB) + L1i L#31 (32KB) + Core L#31 + PU L#31 (P#31)
L2 L#32 (1024KB) + L1d L#32 (32KB) + L1i L#32 (32KB) + Core L#32 + PU L#32 (P#32)
L2 L#33 (1024KB) + L1d L#33 (32KB) + L1i L#33 (32KB) + Core L#33 + PU L#33 (P#33)
L2 L#34 (1024KB) + L1d L#34 (32KB) + L1i L#34 (32KB) + Core L#34 + PU L#34 (P#34)
L2 L#35 (1024KB) + L1d L#35 (32KB) + L1i L#35 (32KB) + Core L#35 + PU L#35 (P#35)
L2 L#36 (1024KB) + L1d L#36 (32KB) + L1i L#36 (32KB) + Core L#36 + PU L#36 (P#36)
L2 L#37 (1024KB) + L1d L#37 (32KB) + L1i L#37 (32KB) + Core L#37 + PU L#37 (P#37)
L2 L#38 (1024KB) + L1d L#38 (32KB) + L1i L#38 (32KB) + Core L#38 + PU L#38 (P#38)
L2 L#39 (1024KB) + L1d L#39 (32KB) + L1i L#39 (32KB) + Core L#39 + PU L#39 (P#39)
L2 L#40 (1024KB) + L1d L#40 (32KB) + L1i L#40 (32KB) + Core L#40 + PU L#40 (P#40)
L2 L#41 (1024KB) + L1d L#41 (32KB) + L1i L#41 (32KB) + Core L#41 + PU L#41 (P#41)
L2 L#42 (1024KB) + L1d L#42 (32KB) + L1i L#42 (32KB) + Core L#42 + PU L#42 (P#42)
L2 L#43 (1024KB) + L1d L#43 (32KB) + L1i L#43 (32KB) + Core L#43 + PU L#43 (P#43)
L2 L#44 (1024KB) + L1d L#44 (32KB) + L1i L#44 (32KB) + Core L#44 + PU L#44 (P#44)
L2 L#45 (1024KB) + L1d L#45 (32KB) + L1i L#45 (32KB) + Core L#45 + PU L#45 (P#45)
L2 L#46 (1024KB) + L1d L#46 (32KB) + L1i L#46 (32KB) + Core L#46 + PU L#46 (P#46)
L2 L#47 (1024KB) + L1d L#47 (32KB) + L1i L#47 (32KB) + Core L#47 + PU L#47 (P#47)
L2 L#48 (1024KB) + L1d L#48 (32KB) + L1i L#48 (32KB) + Core L#48 + PU L#48 (P#48)
L2 L#49 (1024KB) + L1d L#49 (32KB) + L1i L#49 (32KB) + Core L#49 + PU L#49 (P#49)
L2 L#50 (1024KB) + L1d L#50 (32KB) + L1i L#50 (32KB) + Core L#50 + PU L#50 (P#50)
L2 L#51 (1024KB) + L1d L#51 (32KB) + L1i L#51 (32KB) + Core L#51 + PU L#51 (P#51)
HostBridge
PCI d1:05.5 (RAID)
Block(Disk) "nvme1n1"
drkirkby@canary:~\$[/CODE]

 axn 2021-07-24 12:52

Is the HT turned off (or is there no HT)? It is only reporting 1 logical processor per core and a total of 52 logical cores.

Taking that at face value, the affinities should be:

local.txt #1, Worker #1
Affinity=0,1,2,3,4,5,6,7,8,9,10,11,12

local.txt #1, Worker #2
Affinity=13,14,15,16,17,18,19,20,21,22,23,24,25

local.txt #2, Worker #1
Affinity=26,27,28,29,30,31,32,33,34,35,36,37,38

local.txt #2, Worker #2
Affinity=39,40,41,42,43,44,45,46,47,48,49,50,51

 drkirkby 2021-07-24 14:36

[QUOTE=axn;583863]Is the HT turned off (or is there no HT)? It is only reporting 1 logical processor per core and a total of 52 logical cores.[/QUOTE]
It's probably turned off. I did turn it off in an attempt to make it easier to sort out the problems of getting this on cores I wanted. I thought the HT was just making life more difficult. I thought I had turned it back on again, but I must have overlooked that. The output from

[CODE]numactl -H[/CODE] is in this thread.

[url]https://www.mersenneforum.org/showpost.php?p=583289&postcount=35[/url]
That shows cpus numbered 0 to 103.
So do I need to use numactl or not now? Of course, whilst 4 workers is optimal with one process running, it may well not be if two processes are running. I'll give that a try.

 axn 2021-07-24 16:05

[QUOTE=drkirkby;583866]So do I need to use numactl or not now? Of course, whilst 4 workers is optimal with one process running, it may well not be if two processes are running. I'll give that a try.[/QUOTE]

The HT thing was just an observation. It does materially affect the Affinity setting, though. So, should you decide to turn on HT, it will need a different set of values for Affinity.

We still need numactl to evaluate the impact of stage 2 with local memory vs non-local memory. If it turns out that using only half the total amount, but local, memory makes stage 2 much faster, then that is the way to go.

Hopefully with the numactl and Affinity settings, we'll be able to run two instances of P95 with local memory and fully utilizing the cores. If that does work, I have no doubt that you'll get the best performance. It may or may not be significantly better than running a single instance, but that's what we want to find out.

 drkirkby 2021-07-24 17:47

(kriesel: Caution, next post indicates there was an undisclosed error affecting this post.)

I tried what you said, but performance was not that great. Then I tried running one process with 2 workers, with Affinity like you said, and benchmarking another process. The benchmarking was tried with 24-26 cores and 2-4 workers[B]. [/B]

[CODE]
[Worker #1 Jul 24 16:57] Timing 5760K FFT, 26 cores, 4 workers. Average times: 8.29, 7.12, 6.24, 5.37 ms. [B]Total throughput: 607.58 iter/sec[/B].[/CODE]Since 4 does not divide 26, clearly there must be an unequal number of cores running on each worker.

The 607.58 iter/sec is [U]almost double[/U] the throughput one obtains running 4 workers on each of two processes, where the processes are [B]not constrained in any way[/B]. Here are the results from running two benchmarks, where nothing is constrained.

[CODE][Worker #1 Jul 24 18:14] Benchmarking multiple workers to measure the impact of memory bandwidth
[Worker #1 Jul 24 18:15] Timing 5760K FFT, 26 cores, 4 workers. Average times: 13.27, 11.03, 13.31, 11.08 ms. Total throughput: 331.31 iter/sec.
[/CODE]and
[CODE][Worker #1 Jul 24 18:15] Timing 5760K FFT, 26 cores, 4 workers. Average times: 12.80, 11.30, 12.78, 11.42 ms. Total throughput: 332.36 iter/sec.[/CODE]Total throughput is a dismal 331.31+332.36=663.67 iter/sec.

One does better running one process
[CODE][Worker #1 Jul 24 18:22] Benchmarking multiple workers to measure the impact of memory bandwidth
[Worker #1 Jul 24 18:22] Timing 5760K FFT, 52 cores, 4 workers. Average times: 3.85, 3.84, 3.86, 3.86 ms. Total throughput: 1038.20 iter/sec.
[/CODE]I suppose the next thing to try is to run two processes, each with 4 workers. I guess 2x6+2x7=26 would be a reasonable It would be nice to think I could get a total throughput of 2*607.58 = 1215.16 iter/sec, but somehow I doubt that will happen.

 drkirkby 2021-07-24 20:49

Ignore my last post (at 2021-07-24 17:47.) I found an error.

 axn 2021-07-25 16:36

[QUOTE=drkirkby;583892]Ignore my last post (at 2021-07-24 17:47.) I found an error.[/QUOTE]

Did you correct the errors from your last trial?

 drkirkby 2021-07-25 16:50

[QUOTE=axn;583960]Did you correct the errors from your last trial?[/QUOTE]Not yet. I want to finish some benchmarking I've been doing on one exponent for various values of tests_saved and RAM on P-1 factoring. Then I will look into once again optimising things.

All times are UTC. The time now is 14:05.