mersenneforum.org What determine if P-1 factoring is used?
 Register FAQ Search Today's Posts Mark Forums Read

2021-07-24, 10:57   #34
drkirkby

"David Kirkby"
Jan 2021
Althorne, Essex, UK

26×7 Posts

Quote:
 Originally Posted by axn On further experimenting, the affinity numbers might be all wrong. Can you do a lstopo-no-graphics and post the output? EDIT:- lstopo-no-graphics --no-io
Sure. I've not read the man page on this, so don't know what it is doing, but the very first line is wrong. 12 x 32 != 377
Code:
drkirkby@canary:~$lstopo-no-graphics Machine (377GB total) Package L#0 NUMANode L#0 (P#0 188GB) L3 L#0 (36MB) L2 L#0 (1024KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#0) L2 L#1 (1024KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU L#1 (P#1) L2 L#2 (1024KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU L#2 (P#2) L2 L#3 (1024KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU L#3 (P#3) L2 L#4 (1024KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 + PU L#4 (P#4) L2 L#5 (1024KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 + PU L#5 (P#5) L2 L#6 (1024KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 + PU L#6 (P#6) L2 L#7 (1024KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 + PU L#7 (P#7) L2 L#8 (1024KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 + PU L#8 (P#8) L2 L#9 (1024KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 + PU L#9 (P#9) L2 L#10 (1024KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10 + PU L#10 (P#10) L2 L#11 (1024KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11 + PU L#11 (P#11) L2 L#12 (1024KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12 + PU L#12 (P#12) L2 L#13 (1024KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13 + PU L#13 (P#13) L2 L#14 (1024KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14 + PU L#14 (P#14) L2 L#15 (1024KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15 + PU L#15 (P#15) L2 L#16 (1024KB) + L1d L#16 (32KB) + L1i L#16 (32KB) + Core L#16 + PU L#16 (P#16) L2 L#17 (1024KB) + L1d L#17 (32KB) + L1i L#17 (32KB) + Core L#17 + PU L#17 (P#17) L2 L#18 (1024KB) + L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18 + PU L#18 (P#18) L2 L#19 (1024KB) + L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19 + PU L#19 (P#19) L2 L#20 (1024KB) + L1d L#20 (32KB) + L1i L#20 (32KB) + Core L#20 + PU L#20 (P#20) L2 L#21 (1024KB) + L1d L#21 (32KB) + L1i L#21 (32KB) + Core L#21 + PU L#21 (P#21) L2 L#22 (1024KB) + L1d L#22 (32KB) + L1i L#22 (32KB) + Core L#22 + PU L#22 (P#22) L2 L#23 (1024KB) + L1d L#23 (32KB) + L1i L#23 (32KB) + Core L#23 + PU L#23 (P#23) L2 L#24 (1024KB) + L1d L#24 (32KB) + L1i L#24 (32KB) + Core L#24 + PU L#24 (P#24) L2 L#25 (1024KB) + L1d L#25 (32KB) + L1i L#25 (32KB) + Core L#25 + PU L#25 (P#25) HostBridge PCI 00:11.5 (SATA) PCI 00:16.2 (IDE) PCI 00:17.0 (RAID) Block(Disk) "sdb" Block(Disk) "sda" PCIBridge PCI 02:00.0 (Ethernet) Net "enp2s0" PCIBridge PCI 03:00.0 (Ethernet) Net "enp3s0f0" PCI 03:00.1 (Ethernet) Net "enp3s0f1" PCI 00:1f.6 (Ethernet) Net "enp0s31f6" HostBridge PCI 44:05.5 (RAID) Block(Disk) "nvme0n1" HostBridge PCIBridge PCI 73:00.0 (VGA) Package L#1 NUMANode L#1 (P#1 189GB) L3 L#1 (36MB) L2 L#26 (1024KB) + L1d L#26 (32KB) + L1i L#26 (32KB) + Core L#26 + PU L#26 (P#26) L2 L#27 (1024KB) + L1d L#27 (32KB) + L1i L#27 (32KB) + Core L#27 + PU L#27 (P#27) L2 L#28 (1024KB) + L1d L#28 (32KB) + L1i L#28 (32KB) + Core L#28 + PU L#28 (P#28) L2 L#29 (1024KB) + L1d L#29 (32KB) + L1i L#29 (32KB) + Core L#29 + PU L#29 (P#29) L2 L#30 (1024KB) + L1d L#30 (32KB) + L1i L#30 (32KB) + Core L#30 + PU L#30 (P#30) L2 L#31 (1024KB) + L1d L#31 (32KB) + L1i L#31 (32KB) + Core L#31 + PU L#31 (P#31) L2 L#32 (1024KB) + L1d L#32 (32KB) + L1i L#32 (32KB) + Core L#32 + PU L#32 (P#32) L2 L#33 (1024KB) + L1d L#33 (32KB) + L1i L#33 (32KB) + Core L#33 + PU L#33 (P#33) L2 L#34 (1024KB) + L1d L#34 (32KB) + L1i L#34 (32KB) + Core L#34 + PU L#34 (P#34) L2 L#35 (1024KB) + L1d L#35 (32KB) + L1i L#35 (32KB) + Core L#35 + PU L#35 (P#35) L2 L#36 (1024KB) + L1d L#36 (32KB) + L1i L#36 (32KB) + Core L#36 + PU L#36 (P#36) L2 L#37 (1024KB) + L1d L#37 (32KB) + L1i L#37 (32KB) + Core L#37 + PU L#37 (P#37) L2 L#38 (1024KB) + L1d L#38 (32KB) + L1i L#38 (32KB) + Core L#38 + PU L#38 (P#38) L2 L#39 (1024KB) + L1d L#39 (32KB) + L1i L#39 (32KB) + Core L#39 + PU L#39 (P#39) L2 L#40 (1024KB) + L1d L#40 (32KB) + L1i L#40 (32KB) + Core L#40 + PU L#40 (P#40) L2 L#41 (1024KB) + L1d L#41 (32KB) + L1i L#41 (32KB) + Core L#41 + PU L#41 (P#41) L2 L#42 (1024KB) + L1d L#42 (32KB) + L1i L#42 (32KB) + Core L#42 + PU L#42 (P#42) L2 L#43 (1024KB) + L1d L#43 (32KB) + L1i L#43 (32KB) + Core L#43 + PU L#43 (P#43) L2 L#44 (1024KB) + L1d L#44 (32KB) + L1i L#44 (32KB) + Core L#44 + PU L#44 (P#44) L2 L#45 (1024KB) + L1d L#45 (32KB) + L1i L#45 (32KB) + Core L#45 + PU L#45 (P#45) L2 L#46 (1024KB) + L1d L#46 (32KB) + L1i L#46 (32KB) + Core L#46 + PU L#46 (P#46) L2 L#47 (1024KB) + L1d L#47 (32KB) + L1i L#47 (32KB) + Core L#47 + PU L#47 (P#47) L2 L#48 (1024KB) + L1d L#48 (32KB) + L1i L#48 (32KB) + Core L#48 + PU L#48 (P#48) L2 L#49 (1024KB) + L1d L#49 (32KB) + L1i L#49 (32KB) + Core L#49 + PU L#49 (P#49) L2 L#50 (1024KB) + L1d L#50 (32KB) + L1i L#50 (32KB) + Core L#50 + PU L#50 (P#50) L2 L#51 (1024KB) + L1d L#51 (32KB) + L1i L#51 (32KB) + Core L#51 + PU L#51 (P#51) HostBridge PCI d1:05.5 (RAID) Block(Disk) "nvme1n1" drkirkby@canary:~$

Last fiddled with by drkirkby on 2021-07-24 at 11:02

 2021-07-24, 12:52 #35 axn     Jun 2003 5,197 Posts Is the HT turned off (or is there no HT)? It is only reporting 1 logical processor per core and a total of 52 logical cores. Taking that at face value, the affinities should be: local.txt #1, Worker #1 Affinity=0,1,2,3,4,5,6,7,8,9,10,11,12 local.txt #1, Worker #2 Affinity=13,14,15,16,17,18,19,20,21,22,23,24,25 local.txt #2, Worker #1 Affinity=26,27,28,29,30,31,32,33,34,35,36,37,38 local.txt #2, Worker #2 Affinity=39,40,41,42,43,44,45,46,47,48,49,50,51 Last fiddled with by axn on 2021-07-24 at 13:08 Reason: Affinity syntax is confusing.
2021-07-24, 14:36   #36
drkirkby

"David Kirkby"
Jan 2021
Althorne, Essex, UK

1110000002 Posts

Quote:
 Originally Posted by axn Is the HT turned off (or is there no HT)? It is only reporting 1 logical processor per core and a total of 52 logical cores.
It's probably turned off. I did turn it off in an attempt to make it easier to sort out the problems of getting this on cores I wanted. I thought the HT was just making life more difficult. I thought I had turned it back on again, but I must have overlooked that. The output from

Code:
numactl -H

https://www.mersenneforum.org/showpo...9&postcount=35
That shows cpus numbered 0 to 103.
So do I need to use numactl or not now? Of course, whilst 4 workers is optimal with one process running, it may well not be if two processes are running. I'll give that a try.

2021-07-24, 16:05   #37
axn

Jun 2003

5,197 Posts

Quote:
 Originally Posted by drkirkby So do I need to use numactl or not now? Of course, whilst 4 workers is optimal with one process running, it may well not be if two processes are running. I'll give that a try.
The HT thing was just an observation. It does materially affect the Affinity setting, though. So, should you decide to turn on HT, it will need a different set of values for Affinity.

We still need numactl to evaluate the impact of stage 2 with local memory vs non-local memory. If it turns out that using only half the total amount, but local, memory makes stage 2 much faster, then that is the way to go.

Hopefully with the numactl and Affinity settings, we'll be able to run two instances of P95 with local memory and fully utilizing the cores. If that does work, I have no doubt that you'll get the best performance. It may or may not be significantly better than running a single instance, but that's what we want to find out.

 2021-07-24, 17:47 #38 drkirkby   "David Kirkby" Jan 2021 Althorne, Essex, UK 26×7 Posts (kriesel: Caution, next post indicates there was an undisclosed error affecting this post.) I tried what you said, but performance was not that great. Then I tried running one process with 2 workers, with Affinity like you said, and benchmarking another process. The benchmarking was tried with 24-26 cores and 2-4 workers. Code: [Worker #1 Jul 24 16:57] Timing 5760K FFT, 26 cores, 4 workers. Average times: 8.29, 7.12, 6.24, 5.37 ms. Total throughput: 607.58 iter/sec. Since 4 does not divide 26, clearly there must be an unequal number of cores running on each worker. The 607.58 iter/sec is almost double the throughput one obtains running 4 workers on each of two processes, where the processes are not constrained in any way. Here are the results from running two benchmarks, where nothing is constrained. Code: [Worker #1 Jul 24 18:14] Benchmarking multiple workers to measure the impact of memory bandwidth [Worker #1 Jul 24 18:15] Timing 5760K FFT, 26 cores, 4 workers. Average times: 13.27, 11.03, 13.31, 11.08 ms. Total throughput: 331.31 iter/sec. and Code: [Worker #1 Jul 24 18:15] Timing 5760K FFT, 26 cores, 4 workers. Average times: 12.80, 11.30, 12.78, 11.42 ms. Total throughput: 332.36 iter/sec. Total throughput is a dismal 331.31+332.36=663.67 iter/sec. One does better running one process Code: [Worker #1 Jul 24 18:22] Benchmarking multiple workers to measure the impact of memory bandwidth [Worker #1 Jul 24 18:22] Timing 5760K FFT, 52 cores, 4 workers. Average times: 3.85, 3.84, 3.86, 3.86 ms. Total throughput: 1038.20 iter/sec. I suppose the next thing to try is to run two processes, each with 4 workers. I guess 2x6+2x7=26 would be a reasonable It would be nice to think I could get a total throughput of 2*607.58 = 1215.16 iter/sec, but somehow I doubt that will happen. Last fiddled with by kriesel on 2021-07-25 at 00:05 Reason: error indicated in next post
 2021-07-24, 20:49 #39 drkirkby   "David Kirkby" Jan 2021 Althorne, Essex, UK 26·7 Posts Ignore my last post (at 2021-07-24 17:47.) I found an error.
2021-07-25, 16:36   #40
axn

Jun 2003

5,197 Posts

Quote:
 Originally Posted by drkirkby Ignore my last post (at 2021-07-24 17:47.) I found an error.
Did you correct the errors from your last trial?

2021-07-25, 16:50   #41
drkirkby

"David Kirkby"
Jan 2021
Althorne, Essex, UK

26×7 Posts

Quote:
 Originally Posted by axn Did you correct the errors from your last trial?
Not yet. I want to finish some benchmarking I've been doing on one exponent for various values of tests_saved and RAM on P-1 factoring. Then I will look into once again optimising things.

 Similar Threads Thread Thread Starter Forum Replies Last Post fenderbender Math 14 2007-07-28 23:24 hyderman Homework Help 7 2007-06-17 06:01 dsouza123 Math 6 2006-11-18 16:10 LoKI.GuZ Hardware 1 2004-01-26 20:05 Boulder Software 2 2003-08-20 11:55

All times are UTC. The time now is 18:16.

Sun Dec 5 18:16:06 UTC 2021 up 135 days, 12:45, 1 user, load averages: 2.38, 1.73, 1.67