mersenneforum.org Mlucas 20.1 on Power9 v2 18-core
 Register FAQ Search Today's Posts Mark Forums Read

2021-10-17, 20:23   #1
jas

"Simon Josefsson"
Jan 2020
Stockholm

3·11 Posts
Mlucas 20.1 on Power9 v2 18-core

Hi. I'm trying to get mlucas 20.1 working on my Raptor Computing System Talos II Lite with a POWER9 18-core CPU.

Initial building fails:

Quote:
 cc: error: unrecognized command-line option ‘-march=native’; did you mean ‘-mcpu=native’?
I commented out line 141 of makemake.sh to drop the -march=native, not sure if I want to put -mcpu=native in there or not instead? Thoughts?

The next problem seems harder to solve though:

Quote:
 ../src/radix20_main_carry_loop.h:476:15: error: ‘p09’ undeclared (first use in this function); did you mean ‘t09’?
Looking at the code, it looks like it it is a code-path that is not used for most other architectures. What is the best fix here?

Just to get something going, I changed p09 to p08, and it builds eventually and seems to work:

Quote:
 100 iterations of M3888509 with FFT length 196608 = 192 K, final residue shift count = 744463 Res64: 71E61322CCFB396C. AvgMaxErr = 0.224637277. MaxErr = 0.250000000. Program: E20.1 Res mod 2^35 - 1 = 29259839105 Res mod 2^36 - 1 = 50741070790 Clocks = 00:00:00.747

What more testing can I do here? I'm running a LL DC on it now, we'll see if it finishes correctly.

 2021-10-18, 12:22 #2 Xyzzy     Aug 2002 22·3·701 Posts Tell us more about your computer!
2021-10-18, 13:38   #3
jas

"Simon Josefsson"
Jan 2020
Stockholm

3·11 Posts

Quote:

Ok I'll bite :)

Here is some information:

https://wiki.raptorcs.com/wiki/Talos_II
https://wiki.raptorcs.com/wiki/POWER9

My 18-core 4threads/core CPU appears to be doing 11msec/iter with -cpu 0:63:2 which appears to be the fastest config I could find after experimenting. I was expected 0:72:4 or 0:72:2 or even 0:72 would be faster, but for some reason 0:63:2 won. I even considered 0:31:4 and various other settings. It draws about 90-130W if I can trust the builtin power meter.

Maybe further compiler flag optimizations will help, but I'm fairly happy now and my main worry is if I can trust the results. I'm not particular fond of randomly changing FFT source code without understanding it. I'll leave this machine doing LL DCs for a while to see if there is any heating problems.

Quote:
 [2021-10-18 14:23:05] M61906697 Iter# = 4400000 [ 7.11% complete] clocks = 00:18:15.036 [ 10.9504 msec/iter] Res64: 37D6447C18BAF283. AvgMaxErr = 0.127422929. MaxErr = 0.187500000. Residue shift count = 40207388. [2021-10-18 14:41:10] M61906697 Iter# = 4500000 [ 7.27% complete] clocks = 00:18:05.212 [ 10.8521 msec/iter] Res64: AFE311CC460B5923. AvgMaxErr = 0.127419839. MaxErr = 0.218750000. Residue shift count = 51519455. [2021-10-18 14:59:28] M61906697 Iter# = 4600000 [ 7.43% complete] clocks = 00:18:17.363 [ 10.9736 msec/iter] Res64: 4B4D2992D92D01BE. AvgMaxErr = 0.127391503. MaxErr = 0.187500000. Residue shift count = 15577047. [2021-10-18 15:17:50] M61906697 Iter# = 4700000 [ 7.59% complete] clocks = 00:18:22.452 [ 11.0245 msec/iter] Res64: 8CDFA67016398ACB. AvgMaxErr = 0.127430686. MaxErr = 0.187500000. Residue shift count = 37686642.

 2021-10-18, 13:48 #4 Xyzzy     Aug 2002 203348 Posts We considered building one a few months ago. Getting info from Raptor was very difficult. How did the build go? Was there anything weird? What is the "BIOS" like? What OS are you using? Is this a hobby computer or do you have specialized work to do with it?
2021-10-18, 13:53   #5
jas

"Simon Josefsson"
Jan 2020
Stockholm

418 Posts

Quote:
 Originally Posted by Xyzzy We considered building one a few months ago. Getting info from Raptor was very difficult. How did the build go? Was there anything weird? What is the "BIOS" like? What OS are you using? Is this a hobby computer or do you have specialized work to do with it?

Everything except the price has been uncomplicated, the only "big" problem I had was realizing Debian netinst image had output on serial port and not on VGA like Ubuntu and Fedora has. Once I got help to reliaze that, it was quickly resolved though. There is source code for all of the BIOS, and especially the OpenBMC is nice. It uses petitboot to load the OS. I bought it as an experiment to see if this platform is stable enough to deploy some of my production services on, and it will be used for experiments and I've offered a VM on it as a Guix build server. Having mlucas running in the background at all times is a good way to "burn in" the machine I think...

/Simon

2021-10-18, 14:32   #6
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

7×877 Posts

Quote:
 Originally Posted by jas My 18-core 4threads/core CPU appears to be doing 11msec/iter with -cpu 0:63:2 which appears to be the fastest config I could find after experimenting. I was expected 0:72:4 or 0:72:2 or even 0:72 would be faster, but for some reason 0:63:2 won. I even considered 0:31:4 and various other settings. It draws about 90-130W if I can trust the builtin power meter.
So, likely running 3.25M fft length. A point of comparison, an i7-1165g7, Mlucas v20.1 on Ubuntu 18.04 atop WSL1 on Windows10 x64 home, which is a 4-core & x2HT CPU, at -cpu 0:7
Code:
20.1
2048  msec/iter =    7.45  ROE[avg,max] = [0.188009655, 0.250000000]  radices =  32  8 16 16 16  0  0  0  0  0
2304  msec/iter =    8.41  ROE[avg,max] = [0.187239248, 0.250000000]  radices = 144 32 16 16  0  0  0  0  0  0
2560  msec/iter =    8.34  ROE[avg,max] = [0.219140625, 0.281250000]  radices =  40 32 32 32  0  0  0  0  0  0
2816  msec/iter =    9.77  ROE[avg,max] = [0.235196521, 0.312500000]  radices = 176 16 16 32  0  0  0  0  0  0
3072  msec/iter =    9.66  ROE[avg,max] = [0.180298339, 0.250000000]  radices =  48 32 32 32  0  0  0  0  0  0
3328  msec/iter =   10.45  ROE[avg,max] = [0.188625155, 0.234375000]  radices =  52 32 32 32  0  0  0  0  0  0
3584  msec/iter =   10.97  ROE[avg,max] = [0.189514034, 0.250000000]  radices =  56 32 32 32  0  0  0  0  0  0
3840  msec/iter =   10.71  ROE[avg,max] = [0.237338918, 0.312500000]  radices = 240  8  8  8 16  0  0  0  0  0
4096  msec/iter =   10.58  ROE[avg,max] = [0.199521603, 0.265625000]  radices = 256 32 16 16  0  0  0  0  0  0
and ~17. W in the cpu package per HWMonitor. Or ~6. ms/iter 59.8M at 3200K fftl in prime95/Win10. The entire new laptop with 16GiB ram, 15.6" display & warranty was under $1k. Is this$2k price for the power9 chip typical?

Last fiddled with by kriesel on 2021-10-18 at 14:37

2021-10-18, 15:39   #7
jas

"Simon Josefsson"
Jan 2020
Stockholm

3·11 Posts

Quote:
 Originally Posted by kriesel So, likely running 3.25M fft length. A point of comparison, an i7-1165g7, Mlucas v20.1 on Ubuntu 18.04 atop WSL1 on Windows10 x64 home, which is a 4-core & x2HT CPU, at -cpu 0:7 Code: 20.1 3328 msec/iter = 10.45 ROE[avg,max] = [0.188625155, 0.234375000] radices = 52 32 32 32 0 0 0 0 0 0 and ~17. W in the cpu package per HWMonitor. Or ~6. ms/iter 59.8M at 3200K fftl in prime95/Win10. The entire new laptop with 16GiB ram, 15.6" display & warranty was under $1k. Is this$2k price for the power9 chip typical?

Yes, 3.25M, so these systems appear comparable for FFT in msec/iter:

Code:
3328  msec/iter =   10.78  ROE[avg,max] = [0.204429768, 0.281250000]  radices = 208 32 16 16  0  0  0  0  0  0
However for iter/W it seems like my system is slower with a factor of 5, and iter/price it is also around a factor of 5 comparing total cost. But it makes up for it with its fun-factor of a non-x86 platform :)

Last fiddled with by jas on 2021-10-18 at 15:39

2021-10-22, 20:32   #8
ewmayer
2ω=0

Sep 2002
República de California

5·2,339 Posts

Hi, Simon - sorry, late to the thread. Been busy with Mlucas v20.1 bugfix work, only had time for my already-subscribed threads this past week.

Yes, it's been a while since anyone built the latest release on PowerPC or POWER. You are building in non-SIMD mode, which is one of the standard prerelease build-and-test paths, but building in that mode on PowerPC triggers the preprocessor flag PFETCH=1 in prefetch.h, whereas on non-PPC PFETCH remains undef'd, explaining why the bug you hit was not seen in my own non-SIMD builds. You seem to have guessed from the use of the undeclared var p09 that it's just used as a prefetch offset, thus changing it to p08 (or commenting it out) does not affect correctness of the computed results. The proper fix is to replace p09 by p08+p01 - I must've shortened the list of declared double-array offsets from the original p01-p19 to p01,p02,p03,p04,p08,p12,p16 sometime in the past few years, and multiples like p05 and p09 are computed by summing 2 of those.

When I force -DPFETCH=1 in a non-SIMD build on my x86 macbook, the radix20_main_carry_loop.h error is the only one I see. Interestingly, in the POWER section of prefetch.h, PFETCH remains unset, implying that on your platform on or more of the following (snip from platform.h) is predefined:
Code:
#elif(defined(__ppc__) || defined(__powerpc__) || defined(__PPC__) || defined(__powerc) || defined(__ppc64__))
My first thought was that I had inadvertently put the POWER-related-defs section of platform.h below the PPC one, but no, it immediately precedes the above #elif, and the comment makes clear why:
Code:
/* IBM Power: Note that I found that only __xlc__ was properly defined on the power5 I used for my tests.
We Deliberately put this #elif ahead of the PowerPC one because the Power/AIX compiler also defines some PowerPC flags: */
#elif(defined(_POWER) || defined(_ARCH_PWR))
So on your platform, neither _POWER nor _ARCH_PWR is predefined by the compiler. Would you be so kind as to send me the predefs.txt file resulting from the command 'gcc -dM -E align.h < /dev/null >& predefs.txt'? Perhaps there is >= 1 POWER-specific (i.e. non-PPC) flags set which I need to add to the above #elif.

Re. SIMD and NUMA-ness, found some info here:
https://www.olcf.ornl.gov/wp-content...PUs_Walkup.pdf

Throughput would clear be hugely improved by using assembly to target the SIMD, but porting all the asm to a new architecture is always a major half-year-ish effort, it needs a significant likely user base to make it tempting.

Most important thing near-term is to properly target your CPU's multithread and NUMA topology - I plan to add support for the freeware hwloc library next year in order to automate this sort of thing, but for now we have to figure it out ourselves - starting with the /proc/cpuinfo file and other related documentation you can find, can you answer the following questions?

1. What is the logical core numbering convention? It is likely either like Intel's (where on an N-physical-core CPU with 4 threads/core, physical core 0 gets threads 0,N,2N,3N; physical core 1 gets threads 1,N+1,2N+1,3N+1, etc) or like AMD's (where on an N-physical-core CPU with 4 threads/core, physical core 0 gets threads 0-3, etc).

2. What can you tell us about the NUMA domains on the CPU?

Lastly, a good starting point for the total-throughput-maximization procedure is - with nothing else of consequence running - doing a basic set of 1-thread self-tests, './Mlucas -s m -cpu 0 >& test.log'. If you would be so kind as to attach zipped copies of /proc/cpuinfo and the test.log and mlucas.cfg files resulting from the 1-thread self-test, that would be great.

Thanks,
-Ernst

Quote:
Originally Posted by jas
Hi. I'm trying to get mlucas 20.1 working on my Raptor Computing System Talos II Lite with a POWER9 18-core CPU.

Initial building fails:

Quote:
 cc: error: unrecognized command-line option ‘-march=native’; did you mean ‘-mcpu=native’?
I commented out line 141 of makemake.sh to drop the -march=native, not sure if I want to put -mcpu=native in there or not instead? Thoughts?

The next problem seems harder to solve though:
Quote:
 ../src/radix20_main_carry_loop.h:476:15: error: ‘p09’ undeclared (first use in this function); did you mean ‘t09’?
Looking at the code, it looks like it it is a code-path that is not used for most other architectures. What is the best fix here?

Just to get something going, I changed p09 to p08, and it builds eventually and seems to work:
Quote:
 100 iterations of M3888509 with FFT length 196608 = 192 K, final residue shift count = 744463 Res64: 71E61322CCFB396C. AvgMaxErr = 0.224637277. MaxErr = 0.250000000. Program: E20.1 Res mod 2^35 - 1 = 29259839105 Res mod 2^36 - 1 = 50741070790 Clocks = 00:00:00.747
What more testing can I do here? I'm running a LL DC on it now, we'll see if it finishes correctly.

2021-10-27, 15:45   #9
jas

"Simon Josefsson"
Jan 2020
Stockholm

3·11 Posts

Quote:
 Originally Posted by jas What more testing can I do here? I'm running a LL DC on it now, we'll see if it finishes correctly.

FWIW, it finished the LL DC correctly:

https://www.mersenne.org/report_expo...1906697&full=1

/Simon

2021-10-27, 15:52   #10
jas

"Simon Josefsson"
Jan 2020
Stockholm

3·11 Posts

Quote:
 Originally Posted by ewmayer Hi, Simon - sorry, late to the thread. Been busy with Mlucas v20.1 bugfix work, only had time for my already-subscribed threads this past week.
No worries, meanwhile it finished the LL DC so I've built confidence in both the machine and mlucas.

Quote:
 Originally Posted by ewmayer Yes, it's been a while since anyone built the latest release on PowerPC or POWER. You are building in non-SIMD mode, which is one of the standard prerelease build-and-test paths, but building in that mode on PowerPC triggers the preprocessor flag PFETCH=1 in prefetch.h, whereas on non-PPC PFETCH remains undef'd, explaining why the bug you hit was not seen in my own non-SIMD builds. You seem to have guessed from the use of the undeclared var p09 that it's just used as a prefetch offset, thus changing it to p08 (or commenting it out) does not affect correctness of the computed results. The proper fix is to replace p09 by p08+p01 - I must've shortened the list of declared double-array offsets from the original p01-p19 to p01,p02,p03,p04,p08,p12,p16 sometime in the past few years, and multiples like p05 and p09 are computed by summing 2 of those.
Looking forward to see the small fix incorporated in a released version eventually! Is git access available anywhere?
Quote:
 Originally Posted by ewmayer Would you be so kind as to send me the predefs.txt file resulting from the command 'gcc -dM -E align.h < /dev/null >& predefs.txt'? Perhaps there is >= 1 POWER-specific (i.e. non-PPC) flags set which I need to add to the above #elif.
Here it is: https://gist.github.com/jas4711/6667...1966d19123b99b

The OS is vanilla Debian 11 Bullseye with GCC 10.2.1 so a fairly "normal" setup.

As for the other stuff, I can't really answer it right now but I'll read your post and try to understand what to investigate and how to report it. I can setup SSH access to it if you want.

cpuinfo:
https://gist.github.com/jas4711/a999...7f10b84cff3eef

/Simon

Last fiddled with by jas on 2021-10-27 at 16:10 Reason: add cpuinfo

2021-10-27, 16:48   #11
jas

"Simon Josefsson"
Jan 2020
Stockholm

3×11 Posts

Quote:
 Originally Posted by ewmayer Lastly, a good starting point for the total-throughput-maximization procedure is - with nothing else of consequence running - doing a basic set of 1-thread self-tests, './Mlucas -s m -cpu 0 >& test.log'. If you would be so kind as to attach zipped copies of /proc/cpuinfo and the test.log and mlucas.cfg files resulting from the 1-thread self-test, that would be great.

Running it takes around 38 minutes. Below is the mlucas.cfg and I put test.log here: https://gist.github.com/jas4711/100d...68592ae56e9a12

Code:
2048  msec/iter =   75.75  ROE[avg,max] = [0.161886161, 0.187500000]  radices = 1024 32 32  0  0  0  0  0  0  0
2304  msec/iter =   91.21  ROE[avg,max] = [0.158895438, 0.187500000]  radices =  36 32 32 32  0  0  0  0  0  0
2560  msec/iter =  108.66  ROE[avg,max] = [0.188839286, 0.250000000]  radices =  40 32 32 32  0  0  0  0  0  0
2816  msec/iter =  114.73  ROE[avg,max] = [0.170926339, 0.218750000]  radices =  44 32 32 32  0  0  0  0  0  0
3072  msec/iter =  121.73  ROE[avg,max] = [0.175083705, 0.218750000]  radices =  48 32 32 32  0  0  0  0  0  0
3328  msec/iter =  137.32  ROE[avg,max] = [0.242410714, 0.312500000]  radices =  52 32 32 32  0  0  0  0  0  0
3584  msec/iter =  144.97  ROE[avg,max] = [0.229241071, 0.281250000]  radices =  56 32 32 32  0  0  0  0  0  0
3840  msec/iter =  157.38  ROE[avg,max] = [0.169998605, 0.203125000]  radices =  60 32 32 32  0  0  0  0  0  0
4096  msec/iter =  163.65  ROE[avg,max] = [0.233258929, 0.281250000]  radices = 128 32 32 16  0  0  0  0  0  0
4608  msec/iter =  183.77  ROE[avg,max] = [0.174515206, 0.218750000]  radices = 144 32 32 16  0  0  0  0  0  0
5120  msec/iter =  234.53  ROE[avg,max] = [0.234598214, 0.281250000]  radices = 160 32 32 16  0  0  0  0  0  0
5632  msec/iter =  228.97  ROE[avg,max] = [0.181417411, 0.218750000]  radices = 176 32 32 16  0  0  0  0  0  0
6144  msec/iter =  252.87  ROE[avg,max] = [0.209709821, 0.250000000]  radices = 192 32 32 16  0  0  0  0  0  0
6656  msec/iter =  272.22  ROE[avg,max] = [0.177399554, 0.187500000]  radices = 208 32 32 16  0  0  0  0  0  0
7168  msec/iter =  291.60  ROE[avg,max] = [0.181417411, 0.218750000]  radices = 224 32 32 16  0  0  0  0  0  0
7680  msec/iter =  352.16  ROE[avg,max] = [0.186188616, 0.218750000]  radices = 240 32 32 16  0  0  0  0  0  0

 Similar Threads Thread Thread Starter Forum Replies Last Post em99010pepe Hardware 0 2011-11-11 15:18 Rodrigo PrimeNet 4 2011-07-30 14:43 Rodrigo Hardware 6 2010-11-29 18:48 jippie Information & Answers 7 2009-12-14 22:04 S485122 Software 0 2007-05-13 09:15

All times are UTC. The time now is 20:19.

Fri Jan 28 20:19:42 UTC 2022 up 189 days, 14:48, 1 user, load averages: 0.79, 1.13, 1.25