mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Mlucas (https://www.mersenneforum.org/forumdisplay.php?f=118)
-   -   Mlucas v18 available (https://www.mersenneforum.org/showthread.php?t=24100)

ewmayer 2019-02-21 00:01

Mlucas v18 available
 
[url=http://www.mersenneforum.org/mayer/README.html]Mlucas v18 has gone live[/url]. Use this thread to report bugs, build issues, and for any other related discussion.

moebius 2019-02-21 08:09

I always wanted to try it out, but unfortunately I can not compile multi-threaded, because I still use windows 7 professional. Would be great if someone would upload an exe file for the AMD K -10 architecture.

M344587487 2019-02-21 08:14

Must be my birthday :)

M344587487 2019-02-21 12:03

It compiles and doesn't seg fault on the Samsung S7, well done on fixing the Arm issues this is great.

moebius 2019-02-21 17:45

[QUOTE=M344587487;509036]It compiles and doesn't seg fault on the Samsung S7, well done on fixing the Arm issues this is great.[/QUOTE]


Your build could run on my Samsung Galaxy A3 8-Core Smartphone (S7 Architecture), but I'm pessimistic about 88M exponents.....:smile:

ewmayer 2019-02-21 19:29

[QUOTE=M344587487;509036]It compiles and doesn't seg fault on the Samsung S7, well done on fixing the Arm issues this is great.[/QUOTE]

Thanks for the build - were you previously forced to used the v17.1 precompiled binary on that platform due to your-own-build-crashed issues?

M344587487 2019-02-21 20:24

1 Attachment(s)
[QUOTE=moebius;509057]Your build could run on my Samsung Galaxy A3 8-Core Smartphone (S7 Architecture), but I'm pessimistic about 88M exponents.....:smile:[/QUOTE]



Sounds like your phone has a Snapdragon 415 which is a 28nm 4xA53 4xA53. It should work but unfortunately doesn't come close in efficiency to an S7's 14nm 4xM1 4xA53. It should handily beat a raspberry pi 3's 40nm in efficiency and throughput and slot somewhere behind the 20nm 10 core Helio X25 ( [URL]https://www.mersenneforum.org/showpost.php?p=508368&postcount=83[/URL] ). Attached is the v18 ARM asimd binary from the S7 on the offchance you find it useful, AFAIK you need a rooted phone to run it and if you have a rooted phone you could easily build mlucas from source yourself but there it is.

[QUOTE=ewmayer;509067]Thanks for the build - were you previously forced to used the v17.1 precompiled binary on that platform due to your-own-build-crashed issues?[/QUOTE]
Yes, luckily your c2 binary worked flawlessly on everything Armv8. Did you do anything special with that build to ensure compatibility or was it a normal -DUSE_ARM_V8_SIMD -DUSE_THREADS?


I'll try and create an APK tomorrow, there's a chance it works where the v17.1 failed as there were clobber-related error messages like this:
[code]/home/u18/AndroidStudioProjects/MlucasAPK/app/src/main/cpp/mi64.c:813:19: error: unknown register name 'rax' in asm
: "cc","memory","rax","rbx","rcx","rsi","r10","r11" /* Clobbered registers */\[/code]It's just as likely that I was accidentally trying to compile x86 code, it was at that point it got thrown at a virtual wall so never investigated.

ewmayer 2019-02-21 22:00

[QUOTE=M344587487;509070]Yes, luckily your c2 binary worked flawlessly on everything Armv8. Did you do anything special with that build to ensure compatibility or was it a normal -DUSE_ARM_V8_SIMD -DUSE_THREADS?[/quote]
Actually, it was another user who build that posted binary under his Gentoo distro, but yes, just the usual flags. Similar to my Odroid C2 builds, in that the compiler just happened to not use any of the left-off-clobber-list registers or use 64-bit addresses for those erroneous 32-bit loads in the asm. In fact those remaining bugs survived *because* they simply happened to not trigger any errors in my build - all the similar bugs-along-the-way of my ARMv8 development work which did cause runtime errors in my C2 builds obviously were tracked down and fixed prior to the v17.1 release, the first one with ARMv8 assembly support.

[quote]I'll try and create an APK tomorrow, there's a chance it works where the v17.1 failed as there were clobber-related error messages like this:
[code]/home/u18/AndroidStudioProjects/MlucasAPK/app/src/main/cpp/mi64.c:813:19: error: unknown register name 'rax' in asm
: "cc","memory","rax","rbx","rcx","rsi","r10","r11" /* Clobbered registers */\[/code]It's just as likely that I was accidentally trying to compile x86 code, it was at that point it got thrown at a virtual wall so never investigated.[/QUOTE]
Yes, that particular error looks like a piece of x86_64 asm is trying to get built. But will be interested to hear results of further build attempts on a wider variety of ARM hardware/OS combinations. Thanks for posting a binary for others to use, but I do hope they will also first try a build-from-source because that's what I need in order to shake out remaining bugs and portability issues.

moebius 2019-02-22 07:16

[QUOTE=M344587487;509070]Sounds like your phone has a Snapdragon 415 which is a 28nm 4xA53 4xA53.[/QUOTE]No it is a newer one (Samsung Galaxy A3 (2017), Octa-core, 1600 MHz, ARM Cortex-A53, 64-bit, 14 nm. Thanks, I'll have to look for a root-tool for it.

ATH 2019-02-22 12:03

Compiled it on the usual c5d.9xlarge with 18 cores and 36 threads:
gcc -c -O3 -march=skylake-avx512 -DUSE_AVX512 -DUSE_THREADS ../src/*.c >& build.log
grep -i error build.log
[Assuming above grep comes up empty]
gcc -o Mlucas *.o -lm -lpthread -lrt


-DCARRY_16_WAY is not needed in v18 right?

This time all 18 cores was fastest for some reason.

[CODE]18.0
./Mlucas -fftlen 4608 -iters 10000 -nthread 36
4608 msec/iter = 3.24 ROE[avg,max] = [0.246743758, 0.312500000] radices = 144 16 32 32 0 0 0 0 0 0 10000-iteration Res mod 2^64, 2^35-1, 2^36-1 = 13BB5C9DDF0CD3D6, 15982066709, 51703797107

./Mlucas -fftlen 4608 -iters 10000 -nthread 34
4608 msec/iter = 3.18 ROE[avg,max] = [0.246743758, 0.312500000] radices = 144 16 32 32 0 0 0 0 0 0 10000-iteration Res mod 2^64, 2^35-1, 2^36-1 = 13BB5C9DDF0CD3D6, 15982066709, 51703797107

./Mlucas -fftlen 4608 -iters 10000 -nthread 32
4608 msec/iter = 3.15 ROE[avg,max] = [0.246743758, 0.312500000] radices = 144 16 32 32 0 0 0 0 0 0 10000-iteration Res mod 2^64, 2^35-1, 2^36-1 = 13BB5C9DDF0CD3D6, 15982066709, 51703797107

./Mlucas -fftlen 4608 -iters 10000 -nthread 30
4608 msec/iter = 3.07 ROE[avg,max] = [0.246740330, 0.312500000] radices = 144 16 32 32 0 0 0 0 0 0 10000-iteration Res mod 2^64, 2^35-1, 2^36-1 = 13BB5C9DDF0CD3D6, 15982066709, 51703797107

./Mlucas -fftlen 4608 -iters 10000 -nthread 28
4608 msec/iter = 3.03 ROE[avg,max] = [0.246740330, 0.312500000] radices = 144 16 32 32 0 0 0 0 0 0 10000-iteration Res mod 2^64, 2^35-1, 2^36-1 = 13BB5C9DDF0CD3D6, 15982066709, 51703797107

./Mlucas -fftlen 4608 -iters 10000 -nthread 26
4608 msec/iter = 3.08 ROE[avg,max] = [0.246740330, 0.312500000] radices = 144 16 32 32 0 0 0 0 0 0 10000-iteration Res mod 2^64, 2^35-1, 2^36-1 = 13BB5C9DDF0CD3D6, 15982066709, 51703797107

./Mlucas -fftlen 4608 -iters 10000 -cpu 0:17
4608 msec/iter = 2.96 ROE[avg,max] = [0.246740330, 0.312500000] radices = 144 16 32 32 0 0 0 0 0 0 10000-iteration Res mod 2^64, 2^35-1, 2^36-1 = 13BB5C9DDF0CD3D6, 15982066709, 51703797107

./Mlucas -fftlen 4608 -iters 10000 -cpu 0:16
4608 msec/iter = 3.12 ROE[avg,max] = [0.246740330, 0.312500000] radices = 144 16 32 32 0 0 0 0 0 0 10000-iteration Res mod 2^64, 2^35-1, 2^36-1 = 13BB5C9DDF0CD3D6, 15982066709, 51703797107

./Mlucas -fftlen 4608 -iters 10000 -cpu 0:15
4608 msec/iter = 3.09 ROE[avg,max] = [0.246740330, 0.312500000] radices = 144 16 32 32 0 0 0 0 0 0 10000-iteration Res mod 2^64, 2^35-1, 2^36-1 = 13BB5C9DDF0CD3D6, 15982066709, 51703797107

./Mlucas -fftlen 4608 -iters 10000 -cpu 0:14
4608 msec/iter = 4.05 ROE[avg,max] = [0.246727988, 0.312500000] radices = 144 16 32 32 0 0 0 0 0 0 10000-iteration Res mod 2^64, 2^35-1, 2^36-1 = 13BB5C9DDF0CD3D6, 15982066709, 51703797107

./Mlucas -fftlen 4608 -iters 10000 -cpu 0:13
4608 msec/iter = 4.18 ROE[avg,max] = [0.246727988, 0.312500000] radices = 144 16 32 32 0 0 0 0 0 0 10000-iteration Res mod 2^64, 2^35-1, 2^36-1 = 13BB5C9DDF0CD3D6, 15982066709, 51703797107

./Mlucas -fftlen 4608 -iters 10000 -cpu 18:35
4608 msec/iter = 3.00 ROE[avg,max] = [0.246740330, 0.312500000] radices = 144 16 32 32 0 0 0 0 0 0 10000-iteration Res mod 2^64, 2^35-1, 2^36-1 = 13BB5C9DDF0CD3D6, 15982066709, 51703797107

./Mlucas -fftlen 4608 -iters 10000 -cpu 0:34:2
4608 msec/iter = 4.27 ROE[avg,max] = [0.246740330, 0.312500000] radices = 144 16 32 32 0 0 0 0 0 0 10000-iteration Res mod 2^64, 2^35-1, 2^36-1 = 13BB5C9DDF0CD3D6, 15982066709, 51703797107
[/CODE]


From the README.html should this be [I][B]-cpu 0:n-1[/B][/I] ?

[QUOTE]Hyperthreaded x86 CPUs: If Intel, use -cpu 0:n, where n is the number of physical cores on your system[/QUOTE]

ewmayer 2019-02-22 23:44

[QUOTE=ATH;509123]-DCARRY_16_WAY is not needed in v18 right?[/quote]
Correct - if you open platform.h and search for CARRY_16_WAY you'll see it's now on by default for avx-512 builds.

[QUOTE]This time all 18 cores was fastest for some reason.[/QUOTE]
What's your best-radix-set timings for 8 and 9-threads at 4608K? I'm curious how much more ||ism we're getting at the higher 16 and 18-threadcounts.

[QUOTE]From the README.html should this be [I][B]-cpu 0:n-1[/B][/I] ?[/QUOTE]

That snip indeed needs an edit, but of a different kind - the section in question is describing (or attempting to :) the simplest way to maximize total system throughput on most multicore x86 systems. That is 1 LL test per physical core, with each such job using 2-threads on Intel hyperthreaded CPUs and 1-thread otherwise (Intel non-HT, AMD, ARM, etc). Because of the way Intel numbers its logical cores, on a system with n physical cores, logical cores j and n+j map to phys-core j, for j = 0,...,n-1. So to generate a proper mlucas.cfg file for such a set-up, one should use -cpu 0,n (note: comma, not colon), then copy the resulting cfg-file to each of n run directories which will host such a 2-thread-on-1-physical-core job.

From a job-management perspective it's of course easier to just run 1 job using all the physical cores, and as long as n <= 4 one won't sacrifice much total throughput by doing so. So on both my non-HT Intel quad Haswell and my quad-ARM64-core Odroid C2 I use -cpu 0:3, as I do on my HT-enabled dual-core Intel Broadwell NUC because there I want to use 2-threads-per-physical-core and a single 4-thread job gives me nearly the same throughput as separate jobs using -cpu 0,2 and -cpu 1,3.

I need to carefully re-read the README.html page to try to catch remaining such ,-versus-: mixups, because they are easy to overlook.

ewmayer 2019-02-26 22:04

One of the beta-code build & test-ers reports build and runtime errors on several varieties of big-endian hardware ... a code review confirms that some byte-array-based bitwise-utilities funtionality I added in the last few years for the sake of efficiency breaks endian-independence. Easy enough to fix the issue - just need to wrap the handful of byte-array-based utils in an endian-ness preprocessor clause and run the byte-processing in reverse order in the big-endian case. But - what compiler predefine to use for said preprocessor clauses? On my Mac, 'gcc -dM -E [random source file] < /dev/null | grep ENDIAN' gives this:

#define __LITTLE_ENDIAN__ 1

My hopes that that would be a gcc-standard predef were quickly dashed - On my ARMv8/linux, things are far less straightforward:

#define __ORDER_LITTLE_ENDIAN__ 1234
#define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__
#define __ORDER_PDP_ENDIAN__ 3412
#define __ORDER_BIG_ENDIAN__ 4321
#define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__

As long as the range of supported predefs across Posixworld is decently small that's OK - can folks reading this try the above gcc predef-dump command on their systems and let me know if they spot anything that would not be covered by the following?
[code]
#if (__LITTLE_ENDIAN__ == 0) || (__BYTE_ORDER__ != __ORDER_LITTLE_ENDIAN__)
#define USE_BIG_ENDIAN
#endif[/code]
[b]Edit:[/b] Another option would be to key off the relatively limited set of CPU families using big-endian in the platform.h file and set an internal USE_BIG_ENDIAN preprocessor flag based on CPU-family. Since most major CPU families on which the code has been built already have their own little predef-sections in the header file, it would simply be another predef that gets set-or-not there. Thoughts welcome!

ewmayer 2019-03-06 21:43

[url=http://www.mersenneforum.org/mayer/README.html]Mlucas v18 has gone live[/url] - I've updated the OP in this thread to note that and to remove the beta source-tarball. Thanks to all who built and provide feedback. Would someone with access to an ARMv8 CPU please try the prebuilt-with-SIMD binary I posted? I built it on my Odroid C2, with non-static linkage (which is how the v17.1 binaries were done, IIRC), and need to see how portable that is.
[b]
Summary of changes-since-beta-source posted:
[/b]
I found and fixed several bugs in Mlucas.c, the first 2 of which were exposed by the same testing circumstance, where I had a 1st-time LL-test with p in the 80M range complete and then the code started in on the next assignment, which was a partially-complete (migrated from another machine) DC in the 50M range. So those 2 bugs can be considered corner(ish)-case scenarios, but it was still important to get them fixed.

[b]Bug 1:[/b] During processing of the savefiles for the 50M-exponent, the code spat this out:
[i]
read_ppm1_savefiles: On restart: Res35m1 checksum error!ERROR: read_ppm1_savefiles Failed on savefile p55******!
[/i]
After inserting the obviously-missing newline that should follow the !, I dug into the source of the error, which is referring the Selfridge-Hurwitz residues (LL-test residue mod 2[sup]35[/sup]-1 and 2[sup]36[/sup]-1) which I compute as an integity-check for Mlucas savefiles. (The S-H residues were pioneered by those 2 luminaries in their Fermat-number-testing work during the 1950s on hardware of the day which supported 36-bit integers in addition to floating-point numbers). When writing an interime savefile I compute those during the conversion of the residue from floating-point to packed-bit form and tack them onto the full-length residue written to the 2 redundant savefiles. When I restart from a savefile, after reading the full-length residue R and the 2 checksums, I use a different method to on-the-fly compute R mod 2[sup]35[/sup]-1 and 2[sup]36[/sup]-1, namely the fast Montgomery-mod remaindering I described in [url=http://arxiv.org/abs/1303.0328]this manuscript[/url] a few years back. I then compare those 2 just-computed remainders to the ones stored in the savefile, to make sure the savefile data were not corrupted in some fashion. It was that check which was failing, and doing so on both the primary and (normally identical) secondary savefiles. The problem turned out to be this: I use a bytewise array to store R, and in calling the aforementioned remaindering function, which is part of my mi64 (personal GMP-style library) function suite, I cast said array from (uint8*) to (uint64*). (If you're about to ask "but won;t that break endian-portability?", indeed it does - more on that in Bug 5 below.) Problem was, in the above finish-big-exponent-then-proceed-to-DC scenario, I was failing to clear any high bytes in the topmost 64-bit limb of the resulting treated-as-64-bit-integer array, i.e. bytes above those needed for the current p-bit residue which had previously held bytes of the larger previous-test residue. Adding the needed short (1-7 passes) clear-bytes loop, all is well, but then I hit...

[b]Bug 2:[/b] The first test in the above one-test-finishes-and-we-proceed-to-the-next-one scenario was for an exponent very close to the 4608K FFT-length upper limit, so much so that at several points during the run the program detected a 0.4375 fractional part during the per-iteration rounding step, causing it to stop execution, reread from the last savefile and restart at FFT length 5120K. Based on the relative rarity of the 0.4375 ROEs I decided running at 4608K was safe except for the occasional 0.4375-containing iteration interval, so whenever I noticed such auto-switching to 5120K had occurred I killed the code and restarted with an explicit '-fftlen 4608' added to the command line, which overrides the last-FFT-length-used stored in the savefile. Problem was, after finishing the first-time LL test using that length, the program also overrode the 3072K default length for the subsequent DC exponent with 4608K. So a bug in the control logic, now fixed.

Note that Bug 1 is not in play if the next-assignment is the typical from-scratch one.

Alex Vong, who is working to incorporate Mlucas v18 into the Debian freeware suite, also reported a few bugs:

[b]Bug 3:[/b] [i]src/radix16_dyadic_square.c', the function
'SSE2_RADI16_CALC_TWIDDLES_1_2_4_8_13(...)' misses a 'X', it should be
'SSE2_RADIX16_CALC_TWIDDLES_1_2_4_8_13(...)' instead.[/i] -- This sllly typo is in the 32-bit-build preprocessor-flag-wrapped portion of said sourcefile, and the code is used only for Fermat-number testing, not Mersenne, but it's still a showstopper in that build mode because it will prevent object-linkability-into-a-binary. Rather bizarrely, on my Mac bth cc and clang fail to flag the no-macro-by-this-name error.

[b]Bug 4:[/b] Missing wide-integer-product macros and wide-mul macro syntax errors in PowerPC 32-bit builds. It's been so long since I've built on PPC32 that this is a wayback-machine exercise, but anyhow: The missing __MULL64 and __MULH64 macros have been added, and the macro name-collisions which caused the syntax errors have been fixed.

[b]Bug 5:[/b] Endian-portability broken due to several byte-array-based functions I added to Mlucas v17. This has been fixed. At least I believe it's been fixed - I don't have access to any big-endian hardware.

[b]Bug 6:[/b] Fixed one-dereference-too-few error "_cy_r[0] = -2" instead of "_cy_r[0][0] = -2" in non-SIMD code in radix[1008|1024|4032]*c. These were not present in v17, rather were introduced by some careless search-and-replace-across-multiple-files editing I did in my v18 development work. I only noticed the errors when I did a non-SIMD v18 build on ARM just prior to release and hit segfaults for those carry-step-wrapping radices during self-testing.

nomead 2019-03-07 08:06

[QUOTE=ewmayer;510279] Would someone with access to an ARMv8 CPU please try the prebuilt-with-SIMD binary I posted? I built it on my Odroid C2, with non-static linkage (which is how the v17.1 binaries were done, IIRC), and need to see how portable that is.
[/QUOTE]
Works on Raspberry Pi 3B+ and 3A+... No changes in performance though, at 5120K FFT size it can't use the radix-320 front end due to excessive roundoff. Works at 2560K though, but there it's about the same speed as before (but it still chose 160 32 16 16 before on version 17.1, and 320 16 16 16 now on version 18.0, so maybe there's some difference anyway).

ewmayer 2019-03-07 20:15

[QUOTE=nomead;510321]Works on Raspberry Pi 3B+ and 3A+... No changes in performance though, at 5120K FFT size it can't use the radix-320 front end due to excessive roundoff. Works at 2560K though, but there it's about the same speed as before (but it still chose 160 32 16 16 before on version 17.1, and 320 16 16 16 now on version 18.0, so maybe there's some difference anyway).[/QUOTE]

Hmm, just did a quick single-FFT-length self-test @2560 and 5120K on my Odroid C2 using the SIMD binary, '-cpu 0:3 -iters 1000' - here is the summary:

2560K: Radices 320,16x3 run @123.3 ms/iter, maxROE = 0.3125; 160,32,16,16 @124.1 ms/iter, maxROE = 0.34375, so radix 320 gives a tiny speedup here, and both top-candidate radix sets give acceptable ROE levels.

5120K: 320,32,16,16 gives 286.9 ms/iter but ROE = 0.4375 on iters 80,752, (thus deemed ineligible as the cfg-file entry for this FFT length); 160,32,32,16 gives 281.9 ms/iter and maxROE = 0.3125, thus is both fastest and has acceptably low ROE, thus gets the nod. Now were those 2 timings reversed and were I planning to do some first-time-tests @5120K on the hardware in question, I would consider manually hacking the mlucas.cfg file to force radix set 320,32,16,16 at this length. Do you still have your self-test screenlog so you can check the timings in this manner?

The timing deterioration on the C2 between 2560 and 5120K is marked - this hardware is thus ill-suited for first-time-tests even ignoring the long runtime and risk of assignment-expiry such a run would incur.

nomead 2019-03-08 22:44

Well I thought I had saved the output from the self-test but I forgot that it goes to stderr, not stdout. Meh. Ran it again, anyway.

2560K 320 16 16 16: 145.1 ms/iter, MaxErr = 0.34375 (chosen in mlucas.cfg)
2560K 160 16 16 32: 152.2 ms/iter, MaxErr = 0.3125
2560K 160 32 16 16: 146.7 ms/iter, MaxErr = 0.3125

5120K 320 16 16 32: 356.2 ms/iter, MaxErr = 0.46875 (limit exceeded on iters 425, 709, 795 if it makes a difference)
5120K 160 16 32 32: 349.3 ms/iter, MaxErr = 0.3125
5120K 160 32 32 16: 336.7 ms/iter, MaxErr = 0.28125 (chosen in mlucas.cfg)

So the same here, 5120K with radix-320 is slower for some reason.

I'm only running double checks, and the exponents I'm getting from Primenet seem to run on a 2816K FFT. I really don't know how the internals work, so what I'm asking may be totally silly: would a doubling of radix-176 (352?) be beneficial, or even work at all?

Oh, and one more thing. When I stop the program with Control-C as before, there is this error message:
received SIGINT signal.
ERROR: at line 2146 of file ../src/mers_mod_square.c
Assertion failed: nanosleep fail!

ewmayer 2019-03-08 23:10

[QUOTE=nomead;510459]Well I thought I had saved the output from the self-test but I forgot that it goes to stderr, not stdout. Meh. Ran it again, anyway.

2560K 320 16 16 16: 145.1 ms/iter, MaxErr = 0.34375 (chosen in mlucas.cfg)
2560K 160 16 16 32: 152.2 ms/iter, MaxErr = 0.3125
2560K 160 32 16 16: 146.7 ms/iter, MaxErr = 0.3125

5120K 320 16 16 32: 356.2 ms/iter, MaxErr = 0.46875 (limit exceeded on iters 425, 709, 795 if it makes a difference)
5120K 160 16 32 32: 349.3 ms/iter, MaxErr = 0.3125
5120K 160 32 32 16: 336.7 ms/iter, MaxErr = 0.28125 (chosen in mlucas.cfg)

So the same here, 5120K with radix-320 is slower for some reason.[/quote]
Thanks - eagle-eyed readers may note that while your overall results are essentially the same, some of the details - specifically the precise maxROE value and iterations-with-ROE-warning - differ from those I posted. Same binary, same-ARMv8-compliant-hardware, so shouldn't the numbers be *exactly* the same? The reason for subtle differences lies in v18's usage of random residue shift - if your initial shift count differs from mine, the ROE numbers will as well.

[quote]I'm only running double checks, and the exponents I'm getting from Primenet seem to run on a 2816K FFT. I really don't know how the internals work, so what I'm asking may be totally silly: would a doubling of radix-176 (352?) be beneficial, or even work at all?[/quote]
Not silly at all - but the larger initial radices appear quite hit-or-miss in terms of speedups: 288 is better tha 144 across most platforms especially at 4608K. (2304K is more precise-platform dependent). Radix-320 was rather more disappointing in that regard. I plan to implement Radix-352 in v19, but it's more likely to have an impact at 5632K than at 2816K, i.e. once the GIMPS first-time-testing wavefront passes p ~106M, thus no huge rush.

[quote]Oh, and one more thing. When I stop the program with Control-C as before, there is this error message:
received SIGINT signal.
ERROR: at line 2146 of file ../src/mers_mod_square.c
Assertion failed: nanosleep fail![/QUOTE]
I get those errors sometimes, typically in the context of running under the debugger - they basically mean some signal has interacted badly with the nanosleep() command I use as part of my wait-for-all-threads-to-finish-current-task management in multithreaded execution mode. Future enhancements of the new signal-catching code using the supposedly more robust sigaction() may help here; for now, YMMV as to whether the signal code works as intended. Worst case you lose the iterations done since the last normally sheduled checkpoint. I assume you restarted the above run - can you post the snip of code from the p*.stat file bracketing the interrupt?

nomead 2019-03-09 03:52

[QUOTE=ewmayer;510460]Worst case you lose the iterations done since the last normally sheduled checkpoint. I assume you restarted the above run - can you post the snip of code from the p*.stat file bracketing the interrupt?[/QUOTE]
And indeed, that seems to happen. The program doesn't manage to save progress when interrupted and restarts from the last save file. I started this test with 17.1 way back in December so the residue shift is 0. There is also some very small random variance in the execution speed, but that doesn't seem to change while the program is running. This behaviour was the same with version 17.1. It was 168.4 ms for some time before this restart (and this was also spot on the same iteration speed as on 17.1, for a couple of months) , and has stayed at 167.9 ms now for the time it's been running since the restart.
[CODE][Mar 07 20:29:18] M5132xxxx Iter# = 36790000 [71.67% complete] clocks = 00:28:04.083 [168.4083 msec/iter] Res64: 74873862A50BB57E. AvgMaxErr = 0.071576885. MaxErr = 0.109375000. Residue shift count = 0.
Restarting M5132xxxx at iteration = 36790000. Res64: 74873862A50BB57E, residue shift count = 0
M5132xxxx: using FFT length 2816K = 2883584 8-byte floats, initial residue shift count = 0
this gives an average 17.800407062877309 bits per digit
Using complex FFT radices 176 32 16 16
[Mar 08 17:10:27] M5132xxxx Iter# = 36800000 [71.69% complete] clocks = 00:27:58.834 [167.8834 msec/iter] Res64: C6533BF704CDF1F1. AvgMaxErr = 0.071412456. MaxErr = 0.101562500. Residue shift count = 0.
[/CODE]

ewmayer 2019-03-09 19:54

Sorry to see signal-handling not working for you - that was a *very* late-breaking addition to v18. In any event you're no worse off than with the previous version, you just need to stop it with the constant interruptions! How do you expect us to get any work done... :)

Also, a corrigendum to my note re. leading-radix 352 and FFT length 5632K:

[QUOTE=ewmayer;510460]I plan to implement Radix-352 in v19, but it's more likely to have an impact at 5632K than at 2816K, i.e. once the GIMPS first-time-testing wavefront passes p ~106M, thus no huge rush.[/QUOTE]

Actually p ~106M is the *upper* limit for 5632, lower limit (i.e. upper limit for 5120K) is ~96M. So I guess I better get on that!

nomead 2019-03-09 20:52

[QUOTE=ewmayer;510499]Sorry to see signal-handling not working for you - that was a *very* late-breaking addition to v18. In any event you're no worse off than with the previous version, you just need to stop it with the constant interruptions! How do you expect us to get any work done... :)[/QUOTE]
Yeah it's not a problem at all in real use, the last time I interrupted the run (before upgrading to 18.0) was a couple months ago... And I'm building a small cluster of RPi 3A+ boards in a "set and forget" configuration double checking LL. I actually got the first five up and almost running, but one of the boards has almost deaf WiFi for some reason, and I haven't had the time yet to check what's wrong with it. Maybe I'll just have to run that one board completely offline if it requires too much effort to fix. But once they're running, there will be no real need to interrupt them, ever.

ewmayer 2019-03-09 21:11

[QUOTE=nomead;510506]Yeah it's not a problem at all in real use, the last time I interrupted the run (before upgrading to 18.0) was a couple months ago... And I'm building a small cluster of RPi 3A+ boards in a "set and forget" configuration double checking LL. I actually got the first five up and almost running, but one of the boards has almost deaf WiFi for some reason, and I haven't had the time yet to check what's wrong with it. Maybe I'll just have to run that one board completely offline if it requires too much effort to fix. But once they're running, there will be no real need to interrupt them, ever.[/QUOTE]

In order to aid this "army ant" computing model - I'm taking delivery of a couple of for-parts cellphones for my part - I'm currently working with Aaron (MadPoo) on enhancing the primenet.py script to do a couple of v5-server things to support assignment progress update. That should allow ARM user to run longer 1st-time tests, should they desire to, without having said assignments expire once they hit the 180-day mark.

ewmayer 2019-03-12 19:21

[QUOTE=nomead;510506]Yeah it's not a problem at all in real use, the last time I interrupted the run (before upgrading to 18.0) was a couple months ago...[/quote]
Here's the signal stuff working on my Debian-running Intel Haswell quad, on a first-time LL test running on all 4 cores ... yesterday was first really springlike day in my neck of the woods, my BR where the box sits in a corner has southern exposure and gets pretty warm on days like that. The haswell uses just stock cooling and even with the case side panel on the CPU side removed, I find the system starts getting flaky when ambient goes above 75F. So late morning yesterday clicked the on/off switch on the case to turn the system off, then back on late evening once things had cooled off. Note the times-of-day in the following p*.stat snip are ~8 hours behind, this is a headless system and I've just let the internal clock drift in the years I've owned it:
[code]
[Mar 11 06:22:34] M86687009 Iter# = 8030000 [ 9.26% complete] clocks = 00:01:58.909 [ 11.8909 msec/iter] Res64: 5C4BB4BE6AE5BBB0. AvgMaxErr = 0.214745667. MaxErr = 0.312500000. Residue shift count = 12479875.
[Mar 11 06:24:32] M86687009 Iter# = 8040000 [ 9.27% complete] clocks = 00:01:58.691 [ 11.8691 msec/iter] Res64: 6757643F59CD637A. AvgMaxErr = 0.214732219. MaxErr = 0.312500000. Residue shift count = 15291034.
received SIGTERM signal.
Iter = 8041419: Writing savefiles and exiting.
[Mar 11 06:24:50] M86687009 Iter# = 8041419 [ 9.28% complete] clocks = 00:00:16.910 [ 11.9174 msec/iter] Res64: A15F129B39F5F7AD. AvgMaxErr = 0.214937561. MaxErr = 0.281250000. Residue shift count = 24137129.
...
Restarting M86687009 at iteration = 8041419. Res64: A15F129B39F5F7AD, residue shift count = 24137129
M86687009: using FFT length 4608K = 4718592 8-byte floats, initial residue shift count = 24137129
this gives an average 18.371372010972763 bits per digit
Using complex FFT radices 288 16 16 32
[Mar 11 13:15:57] M86687009 Iter# = 8050000 [ 9.29% complete] clocks = 00:01:40.629 [ 11.7271 msec/iter] Res64: FA296EE64B5710E2. AvgMaxErr = 0.214812786. MaxErr = 0.281250000. Residue shift count = 20164181.[/code]

[quote]And I'm building a small cluster of RPi 3A+ boards in a "set and forget" configuration double checking LL. I actually got the first five up and almost running, but one of the boards has almost deaf WiFi for some reason, and I haven't had the time yet to check what's wrong with it. Maybe I'll just have to run that one board completely offline if it requires too much effort to fix. But once they're running, there will be no real need to interrupt them, ever.[/QUOTE]
I just took delivery of two sold-for-parts-on-ebay Samsung Galaxy S7s in the past couple of days, project for the coming week is to get them rooted and running Mlucas, also awaiting delivery of a [url=https://www.amazon.com/gp/f.html?C=3JTDOSORXPWJG&K=1ICBB0J24TYRN&M=urn:rtn:msg:20190310222152d657b194a88642339e55a39b1a60p0na&R=1E1Z4PO5V0X9F&T=C&U=https%3A%2F%2Fwww.amazon.com%2Fgp%2Fcss%2Forder-details%3ForderId%3D114-3855172-4009063%26ref_%3Dpe_2640190_232748420_TE_simp_od&H=GD0PAX2VA717EFUA173ZKA8W6ACA&ref_=pe_2640190_232748420_TE_simp_od]USB charging station[/url] (which should have enough juice to power 4 such phones running Mlucas on all cores) and USB fan (which the Q&A section on the product page says draws just 0.8W at top speed in USB mode) ... the fan should be sufficient to cool a pair of such 4-phone mini farms, which should give me a total compute throughput comparable to the above-mentioned Haswell quad, in a rather smaller footprint.

Lorenzo 2019-03-29 10:07

Hello! Benchmark for v18 on [B]Ampere eMAG 32-Core @ 3.3GHz[/B] using pre-built Mlucas_v18_c2simd.

[CODE]root@lorenzoArm:~/mersenne/arm8# lscpu
Architecture: aarch64
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 1
Core(s) per socket: 32
Socket(s): 1
NUMA node(s): 1
CPU max MHz: 3300.0000
CPU min MHz: 363.9700
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
NUMA node0 CPU(s): 0-31[/CODE]

[CODE]root@lorenzoArm:~/mersenne/arm8# cat /proc/cpuinfo
processor : 0
BogoMIPS : 90.00
Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid
CPU implementer : 0x50
CPU architecture: 8
CPU variant : 0x3
CPU part : 0x000
CPU revision : 2
[/CODE]

root@lorenzoArm:~/mersenne/arm8# ./Mlucas_v18_c2simd -s m -cpu 0:31:
[CODE]
root@lorenzoArm:~/mersenne/arm8# cat mlucas.cfg
18.0
2048 msec/iter = 19.63 ROE[avg,max] = [0.000307249, 0.375000000] radices = 128 32 16 16 0 0 0 0 0 0
2304 msec/iter = 19.88 ROE[avg,max] = [0.000272423, 0.375000000] radices = 144 32 16 16 0 0 0 0 0 0
2560 msec/iter = 22.07 ROE[avg,max] = [0.000281943, 0.375000000] radices = 160 8 8 8 16 0 0 0 0 0
2816 msec/iter = 22.07 ROE[avg,max] = [0.000260572, 0.312500000] radices = 176 16 16 32 0 0 0 0 0 0
3072 msec/iter = 22.24 ROE[avg,max] = [0.000265834, 0.375000000] radices = 192 16 16 32 0 0 0 0 0 0
3328 msec/iter = 23.63 ROE[avg,max] = [0.000281118, 0.375000000] radices = 208 16 16 32 0 0 0 0 0 0
3584 msec/iter = 25.02 ROE[avg,max] = [0.000250660, 0.343750000] radices = 224 32 16 16 0 0 0 0 0 0
3840 msec/iter = 26.60 ROE[avg,max] = [0.000222911, 0.312500000] radices = 60 32 32 32 0 0 0 0 0 0
4096 msec/iter = 25.42 ROE[avg,max] = [0.000244299, 0.312500000] radices = 64 32 32 32 0 0 0 0 0 0
4608 msec/iter = 30.73 ROE[avg,max] = [0.000298148, 0.375000000] radices = 144 8 8 16 16 0 0 0 0 0
5120 msec/iter = 31.50 ROE[avg,max] = [0.000235369, 0.312500000] radices = 160 32 32 16 0 0 0 0 0 0
5632 msec/iter = 33.74 ROE[avg,max] = [0.000257523, 0.343750000] radices = 176 32 32 16 0 0 0 0 0 0
6144 msec/iter = 36.94 ROE[avg,max] = [0.000247058, 0.312500000] radices = 192 32 32 16 0 0 0 0 0 0
6656 msec/iter = 36.74 ROE[avg,max] = [0.000313628, 0.406250000] radices = 208 8 8 16 16 0 0 0 0 0
7168 msec/iter = 36.94 ROE[avg,max] = [0.000233152, 0.312500000] radices = 224 8 8 16 16 0 0 0 0 0
7680 msec/iter = 36.94 ROE[avg,max] = [0.000246354, 0.312500000] radices = 240 8 8 16 16 0 0 0 0 0
[/CODE]

Lorenzo 2019-03-29 10:10

Just FYI
[CODE]root@lorenzoArm:~/mersenne/arm8# ./Mlucas_v18_c2simd -fftlen 18432 -iters 100 -cpu 0:7

Mlucas 18.0

http://www.mersenneforum.org/mayer/README.html

INFO: testing qfloat routines...
CPU Family = ARM Embedded ABI, OS = Linux, 64-bit Version, compiled with Gnu C [or other compatible], Version 5.4.0 20160609.
INFO: Build uses ARMv8 advanced-SIMD instruction set.
INFO: Using inline-macro form of MUL_LOHI64.
INFO: MLUCAS_PATH is set to ""
INFO: using 53-bit-significand form of floating-double rounding constant for scalar-mode DNINT emulation.
Setting DAT_BITS = 10, PAD_BITS = 2
INFO: testing IMUL routines...
INFO: System has 32 available processor cores.
INFO: testing FFT radix tables...
Set affinity for the following 8 cores: 0.1.2.3.4.5.6.7.

Mlucas selftest running.....

/****************************************************************************/

NTHREADS = 8
M337615261: using FFT length 18432K = 18874368 8-byte floats, initial residue shift count = 49407158
this gives an average 17.887500180138481 bits per digit
Using complex FFT radices 288 32 32 32
mers_mod_square: Init threadpool of 8 threads
radix16_dif_dit_pass pfetch_dist = 32
radix16_wrapper_square: pfetch_dist = 1024
Using 8 threads in carry step
100 iterations of M337615261 with FFT length 18874368 = 18432 K, final residue shift count = 321038982
Res64: 69FF742497F16902. AvgMaxErr = 0.003191964. MaxErr = 0.375000000. Program: E18.0
Res mod 2^36 = 19729049858
Res mod 2^35 - 1 = 20161851329
Res mod 2^36 - 1 = 1044285462
Clocks = 00:00:21.067

NTHREADS = 8
M337615261: using FFT length 18432K = 18874368 8-byte floats, initial residue shift count = 321038982
this gives an average 17.887500180138481 bits per digit
Using complex FFT radices 144 16 16 16 16
mers_mod_square: Init threadpool of 8 threads
Using 8 threads in carry step
100 iterations of M337615261 with FFT length 18874368 = 18432 K, final residue shift count = 171176556
Res64: 2258A7342961B652. AvgMaxErr = 0.002428013. MaxErr = 0.281250000. Program: E18.0
Res mod 2^36 = 17874138706
Res mod 2^35 - 1 = 28069471175
Res mod 2^36 - 1 = 53816329185
Clocks = 00:00:21.009
NTHREADS = 8
[B][COLOR="Red"]ERROR: at line 1540 of file ../src/Mlucas.c
Assertion failed: Return value of shift_word(): unpadded-array-index out of range![/COLOR][/B][/CODE]

ewmayer 2019-03-29 19:27

Thanks, Lorenzo: Could you also try the '-s m' tests using just -cpu 0:3 on that 32-core system and post the resulting cfg-file here? I'd like to see what kind of degradation of parallelism results from using more than one socket on that system.

I will look into the residue-shift assertion issue you hit in your 18432K FFT length test - first I need to see if I can reproduce it on any of the hardware I have. By way of a workaround, '-shift 0' should bypass all the new residue-shift code and allow you get a best-radix-set timing at 18432K.

[b]Edit:[/b] I was able to reproduce the assertion on my 2-core Macbook using an x86 SSE2 build, so the bug appears to be code-logic-related rather than anything platform specific.

Lorenzo 2019-03-30 07:51

[QUOTE=ewmayer;512164]Thanks, Lorenzo: Could you also try the '-s m' tests using just -cpu 0:3 on that 32-core system and post the resulting cfg-file here? I'd like to see what kind of degradation of parallelism results from using more than one socket on that system.

I will look into the residue-shift assertion issue you hit in your 18432K FFT length test - first I need to see if I can reproduce it on any of the hardware I have. By way of a workaround, '-shift 0' should bypass all the new residue-shift code and allow you get a best-radix-set timing at 18432K.

[b]Edit:[/b] I was able to reproduce the assertion on my 2-core Macbook using an x86 SSE2 build, so the bug appears to be code-logic-related rather than anything platform specific.[/QUOTE]

Hello, ewmayer! Sorry but unfortunately I haven't access to this machine any more.

kriesel 2019-03-30 19:18

[QUOTE=Lorenzo;512109]Hello! Benchmark for v18 on [B]Ampere eMAG 32-Core @ 3.3GHz[/B] using pre-built Mlucas_v18_c2simd.
[CODE]
4608 msec/iter = 30.73 ROE[avg,max] = [0.000298148, 0.375000000] radices =
[/CODE][/QUOTE]
Yikes, 717 hours, so at nominal $1/hour, that works out to over $700/84M primality test at [URL]https://www.packet.com/cloud/servers/[/URL] It's triple the speed of Ernst's Samsung S7 phone, at far higher cost (~83x) there. I've bought whole used workstations capable of 10+ times the 30.73ms/it speed, for the price of one exponent at packet.com at that rate. (Spot rate $0.25/hr helps but not nearly enough.)

ewmayer 2019-03-30 19:29

[QUOTE=Lorenzo;512191]Hello, ewmayer! Sorry but unfortunately I haven't access to this machine any more.[/QUOTE]

OK - for future reference, on a 'typical' multisocket system with 1 or more sockets and each socket holding a 4-core CPU, I like to see the following timing tests:

1. All 4 cores on 1 socket: '-s m -iters 100 -cpu 0:3'

2. If there are differences between the CPUs on various sockets (use /proc/cpuinfo as your guide here), run the same self-tests on each distinct-CPU-type socket. If it's e.g. a BIG socket with a high-perf CPU having just 2 cores, fiddle the -cpu args to use just those 2 cores;

3. All cores across all sockets: Like the 32-core test you did above;

4. One program instance per socket: This can get tricky in self-test mode if the runspeed varies appreciably between sockets. Better is to create a rundir for each socket, e.g. run0-run7 on an 8-socket 32-core system, copy the mlucas.cfg files from your -cpu 0:3 self-test to each rundir, create a worktodo.ini file containing one exponent of the size range of interest (you can use a single-shot invocation of the primenet.py script to grab such an assignment), then copy that to each rundir. Then cd to run0 and fire up a production run using -cpu 0:3, let that get to the first 10000-iter checkpoint (you will see a pair of p|q-named binary savefiles get created, and the p*.stat file updated with a checkpoint entry), that gives you a production-run timing for 1 socket. At that point cd to each of the other rundirs in turn and start up an instance in each. Let those runs get through a couple checkpoints and average the last-line-of-statfile timings, compare that average to the 1-socket-used timing.

I have found and fixed the bug your 18432K self-test exposed, will post update on that once I finish creating new ARM binaries from the updated source tarball and uploading to the server.

ewmayer 2019-03-30 19:39

[QUOTE=kriesel;512248]Yikes, 717 hours, so at nominal $1/hour, that works out to over $700/84M primality test at [URL]https://www.packet.com/cloud/servers/[/URL] It's triple the speed of Ernst's Samsung S7 phone, at far higher cost (~83x) there. I've bought whole used workstations capable of 10+ times the 30.73ms/it speed, for the price of one exponent at packet.com at that rate. (Spot rate $0.25/hr helps but not nearly enough.)[/QUOTE]

I suspect the total throughput for that system would be several times greater using one instance per 4-core socket, that's why I asked Lorenzo if he could provide that timing. You get a hint of throughput loss due to too-many-threads-for-one-job from the cfg-file timings he posted: The 32-thread timing @7680K is less than 2x greater than that for 2048K. Larger FFT lengths tend to be more parallelizable than smaller ones because at the same threadcount the work units done by each thread are proportionally larger, resulting in that sort of timing pattern. But I'm sure even in optimum-usage mode such a system would be a lot more expensive than a cellphone compute node - reminiscent of the difference between the big-iron AWS-instance runs we use to verify new prime discoveries, as compared to a $/FLOP-optimized low-end retail Intel rig.

But manycore tests are always interesting because we hope to see signs of one manufacturer or another achieving a breakthrough in parallelism. Though in that regard, nearly all the action the past 5 years has been on the GPU side of the ledger.

kriesel 2019-03-31 00:21

[QUOTE=ewmayer;512250]I suspect the total throughput for that system would be several times greater using one instance per 4-core socket, that's why I asked Lorenzo if he could provide that timing. You get a hint of throughput loss due to too-many-threads-for-one-job from the cfg-file timings he posted: The 32-thread timing @7680K is less than 2x greater than that for 2048K. Larger FFT lengths tend to be more parallelizable than smaller ones because at the same threadcount the work units done by each thread are proportionally larger, resulting in that sort of timing pattern. But I'm sure even in optimum-usage mode such a system would be a lot more expensive than a cellphone compute node - reminiscent of the difference between the big-iron AWS-instance runs we use to verify new prime discoveries, as compared to a $/FLOP-optimized low-end retail Intel rig.

But manycore tests are always interesting because we hope to see signs of one manufacturer or another achieving a breakthrough in parallelism. Though in that regard, nearly all the action the past 5 years has been on the GPU side of the ledger.[/QUOTE]
Thought experiment: suppose one instance per 4-core socket was the same speed as his 32-core test, so 8 instances, 8-fold more throughput. It still loses to the dual-e5-2670 that I bought for a month's rent of the 32-arm-core system.
Another way to go at it would be to make 1-core, 4-core, and 8-core benchmark runs and compare to the 32.

ldesnogu 2019-03-31 17:39

[QUOTE=kriesel;512264]It still loses to the dual-e5-2670 that I bought for a month's rent of the 32-arm-core system.[/QUOTE]
How much power does your system consume? How much will that cost you?

For the record, the CPU from Ampere is not that great from a performance point of view, in particular its FP performance is less than Amazon Cortex-A72 chip despite running at 3.3 GHz vs 2.3 GHz: [URL]http://browser.geekbench.com/v4/cpu/compare/12589322?baseline=11678329[/URL]

It's not even that much faster than an S7: [URL]http://browser.geekbench.com/v4/cpu/compare/12589322?baseline=12621230[/URL]

BTW Ernst, I'm afraid I don't get why you're talking about multiple sockets. That system has a single socket.

kriesel 2019-03-31 19:21

[QUOTE=ldesnogu;512313]How much power does your system consume? How much will that cost you?[/QUOTE]~US$3 / 85M exponent total cost, equipment amortization and utilities and taxes. Details at [url]https://www.mersenneforum.org/showpost.php?p=512218&postcount=20[/url]
I have no reason to expect that figure to be optimal among cpu choices. It's just one of the better among my little fleet. (Then there's curtisc's and others' $0/exponent, when the participant is using someone else's hardware and electricity.)

ewmayer 2019-03-31 19:34

[QUOTE=ldesnogu;512313]BTW Ernst, I'm afraid I don't get why you're talking about multiple sockets. That system has a single socket.[/QUOTE]

Ah, I didn't look into the details of that kind of system, assumed it was a single-mobo cluster of 2 or 4-core cortex CPUs.

M344587487 2019-04-01 13:44

1 Attachment(s)
Got some errors in the build log compiling on a Ryzen 1700, log attached.
[code]gcc -c -O3 -DUSE_AVX2 -mavx2 -DUSE_THREADS ../src/*.c >& build.log[/code]Errors with and without -mavx2, and without -DUSE_AVX2. Tried txz and tbz2 archives to rule out download corruption. Haven't investigated beyond that but I can if necessary. gcc (Ubuntu 8.3.0-3ubuntu1) 8.3.0, the default gcc on a daily build of Ubuntu 19.04. There's a chance it's a problem due to being a daily build, but this is the only issue I've come across so far.

ewmayer 2019-04-01 19:40

[QUOTE=M344587487;512374]Got some errors in the build log compiling on a Ryzen 1700, log attached.
[code]gcc -c -O3 -DUSE_AVX2 -mavx2 -DUSE_THREADS ../src/*.c >& build.log[/code]Errors with and without -mavx2, and without -DUSE_AVX2. Tried txz and tbz2 archives to rule out download corruption. Haven't investigated beyond that but I can if necessary. gcc (Ubuntu 8.3.0-3ubuntu1) 8.3.0, the default gcc on a daily build of Ubuntu 19.04. There's a chance it's a problem due to being a daily build, but this is the only issue I've come across so far.[/QUOTE]

Thanks for the log - here my summary:

o See a bunch of -Wformat-overflow warnings, those prints need to be replaced with buffer-overflow-proof ones;

o The "cast from pointer to integer of different size" warnings are benign, the statements in question are just checking alignment of various pointers using the bottom few bits, but I suppose chaging the (uint32) casts to casts to a pointer-sized int can't hurt;

Ah, I think I see why you "encountered errors" - the version of gcc you are using (btw, what version is it?) clearly is aggressively warning for potential buffer-overflow-unsafe string I/O - which is good. But those previously-unseen warnings are, among other things, flagging Mlucas-internal error-print statements, so when you do the case-insensitive 'grep -i error build.log' as per the README page, you now see these print statements containing the string "ERROR" appearing by way of the aforementioned -Wformat-overflow warnings:
[i]
125: sprintf(cbuf , "*** ERROR: Non-numeric character encountered in -shift argument %s.\n", stFlag);
139: sprintf(cbuf , "*** ERROR: -shift argument %s overflows uint64 field.\n", stFlag);
154: sprintf(cbuf, "Error writing residue to restart file %s.\n",RESTARTFILE);
169: sprintf(cbuf,"ERROR: bit_depth_done of %u > max. allowed of %u. The ini file entry was %s\n", bit_depth_done, MAX_FACT_BITS, in_line);
183: sprintf(cbuf, "ERROR: Illegal 'fftlen = ' argument - suggested FFT length for this p = %u. The ini file entry was %s\n", kblocks, in_line);
225: sprintf(cbuf, "ERROR: read_ppm1_savefiles Failed on savefile %s!\n",RESTARTFILE);
239: sprintf(cbuf, "ERROR: convert_res_bytewise_FP Failed on savefile %s!\n",RESTARTFILE);
288: sprintf(cbuf,"ERROR: unable to rename %s restart file ==> %s ... skipping every-million-iteration restart file archiving\n",RANGEFILE, STATFILE);
302: sprintf(cbuf, "ERROR: unable to open restart file %s for write of checkpoint data.\n",RESTARTFILE);
446: sprintf(cbuf , "*** ERROR: -f argument %s overflows integer field.\n", stFlag);
474: sprintf(cbuf , "*** ERROR: -m argument %s overflows integer field.\n", stFlag);
488: sprintf(cbuf , "*** ERROR: Non-numeric character encountered in -nthread argument %s.\n", stFlag);
502: sprintf(cbuf , "*** ERROR: -nthread argument %s overflows integer field.\n", stFlag);
516: sprintf(cbuf , "*** ERROR: Non-numeric character encountered in -prp argument %s.\n", stFlag);
530: sprintf(cbuf , "*** ERROR: -prp argument %s overflows integer field.\n", stFlag);
544: sprintf(cbuf , "*** ERROR: Non-numeric character encountered in -shift argument %s.\n", stFlag);
558: sprintf(cbuf , "*** ERROR: -shift argument %s overflows uint64 field.\n", stFlag);
572: sprintf(cbuf , "*** ERROR: Non-numeric character encountered in -radset argument %s.\n", stFlag);
586: sprintf(cbuf , "*** ERROR: -radset argument %s overflows integer field.\n", stFlag);
600: sprintf(cbuf , "*** ERROR: Non-numeric character encountered in -fftlen argument %s.\n", stFlag);
614: sprintf(cbuf , "*** ERROR: -fftlen argument %s overflows integer field.\n", stFlag);
628: sprintf(cbuf , "*** ERROR: Non-numeric character encountered in -iters argument %s.\n", stFlag);
642: sprintf(cbuf , "*** ERROR: -iters argument %s overflows integer field.\n", stFlag);
[/i]
The quick workaround is to simply drop '-i' from the grep, when I do that to your build.log it comes up empty (and in fact I'm not sure why I ever used the '-i' in that context to begin with, maybe some long-ago-used compiler used e.g. 'Error' in its messagin). Are you able to link?

M344587487 2019-04-01 20:19

That's amusing, I blindly followed the instruction to only link if grep comes up empty and didn't pay close enough attention to the log. It works fine, sorry for lighting the bat signal unnecessarily. April fool? ;)

ewmayer 2019-04-01 20:28

[QUOTE=M344587487;512395]That's amusing, I blindly followed the instruction to only link if grep comes up empty and didn't pay close enough attention to the log. It works fine, sorry for lighting the bat signal unnecessarily. April fool? ;)[/QUOTE]

No worries, it was still useful in reminding me that I should fix up all those possible-buffer-overflow and point-to-shorter-int-cast warnings, and I need to get rid of the '-i' in my grep-your-build.log instructions on the README page.

ewmayer 2019-04-12 20:45

[QUOTE=Lorenzo;512110]Just FYI
[CODE]root@lorenzoArm:~/mersenne/arm8# ./Mlucas_v18_c2simd -fftlen 18432 -iters 100 -cpu 0:7
[snip]
100 iterations of M337615261 with FFT length 18874368 = 18432 K, final residue shift count = 321038982
Res64: 69FF742497F16902. AvgMaxErr = 0.003191964. MaxErr = 0.375000000. Program: E18.0
Res mod 2^36 = 19729049858
Res mod 2^35 - 1 = 20161851329
Res mod 2^36 - 1 = 1044285462
Clocks = 00:00:21.067
[snip]
100 iterations of M337615261 with FFT length 18874368 = 18432 K, final residue shift count = 171176556
Res64: 2258A7342961B652. AvgMaxErr = 0.002428013. MaxErr = 0.281250000. Program: E18.0
Res mod 2^36 = 17874138706
Res mod 2^35 - 1 = 28069471175
Res mod 2^36 - 1 = 53816329185
Clocks = 00:00:21.009
NTHREADS = 8
[B][COLOR="Red"]ERROR: at line 1540 of file ../src/Mlucas.c
Assertion failed: Return value of shift_word(): unpadded-array-index out of range![/COLOR][/B][/CODE][/QUOTE]

This bug has been fixed in the patch I uploaded to the ftp server last week. It only affects runs at FFT lengths >= 16M (16384K), and since it doesn't permit the program to create an mlucas.cfg file entry for the FFT lengths in question, no actual user runs should be affected, since you can't do a production run at an FFT length without a cfg-file entry for said length.

kriesel 2019-10-15 12:27

V18 is the current latest release, yes? How about making this thread sticky.

Dylan14 2019-10-15 20:56

Colab test of v18, spot check
 
I made Mlucas v18 successfully in Colab using the reverse tunnel code that chalsall made. It built with no problems, however, as a spot check, one should run



[CODE]./Mlucas -fftlen 192 -iters 100 -radset 0[/CODE]When I did that, I get an excessive round off warning:


[CODE]root@colab_test:/content/mlucas/mlucas_v18# ./Mlucas -fftlen 192 -iters 100 -radset 0 > test1.txt

Mlucas 18.0

http://www.mersenneforum.org/mayer/README.html

INFO: using 64-bit-significand form of floating-double rounding constant for sca lar-mode DNINT emulation.
INFO: testing FFT radix tables...

Mlucas selftest running.....

/****************************************************************************/

INFO: Unable to find/open mlucas.cfg file in r+ mode ... creating from scratch.
No CPU set or threadcount specified ... running single-threaded.
INFO: Maximum recommended exponent for this runlength = 3888516; p[ = 3888517]/p max_rec = 1.0000002572.
specified FFT length 192 K is less than recommended 208 K for this p.
M3888517: using FFT length 192K = 196608 8-byte floats, initial residue shift count = 1942965
this gives an average 19.778020222981770 bits per digit
Using complex FFT radices 192 16 32
radix16_dif_dit_pass pfetch_dist = 4096
radix16_wrapper_square: pfetch_dist = 4096
Using 1 threads in carry step
M3888517 Roundoff warning on iteration 46, maxerr = 0.437500000000
100 iterations of M3888517 with FFT length 196608 = 192 K, final residue shift c ount = 3620533
Res64: 579D593FCE0707B2. AvgMaxErr = 0.003006696. MaxErr = 0.437500000. Program: E18.0
Res mod 2^36 = 67881076658
Res mod 2^35 - 1 = 21674900403
Res mod 2^36 - 1 = 42893438228
Clocks = 00:00:00.466
***** Excessive level of roundoff error detected - this radix set will not be us ed. *****

Done ...

[/CODE]but with radset 1 it works fine:


[CODE]root@colab_test:/content/mlucas/mlucas_v18# ./Mlucas -fftlen 192 -iters 100 -radset 1 > test1.txt

Mlucas 18.0

http://www.mersenneforum.org/mayer/README.html

INFO: using 64-bit-significand form of floating-double rounding constant for scalar-mode DNINT emulation.
INFO: testing FFT radix tables...

Mlucas selftest running.....

/****************************************************************************/

INFO: Unable to find/open mlucas.cfg file in r+ mode ... creating from scratch.
No CPU set or threadcount specified ... running single-threaded.
INFO: Maximum recommended exponent for this runlength = 3888516; p[ = 3888517]/pmax_rec = 1.0000002572.
specified FFT length 192 K is less than recommended 208 K for this p.
M3888517: using FFT length 192K = 196608 8-byte floats, initial residue shift count = 1942965
this gives an average 19.778020222981770 bits per digit
Using complex FFT radices 192 32 16
radix16_dif_dit_pass pfetch_dist = 4096
radix16_wrapper_square: pfetch_dist = 4096
Using 1 threads in carry step
100 iterations of M3888517 with FFT length 196608 = 192 K, final residue shift count = 3620533
Res64: 579D593FCE0707B2. AvgMaxErr = 0.002918527. MaxErr = 0.375000000. Program: E18.0
Res mod 2^36 = 67881076658
Res mod 2^35 - 1 = 21674900403
Res mod 2^36 - 1 = 42893438228
Clocks = 00:00:00.393

Done ...

[/CODE]I am presently running a test on a known prime, and I will report back when it's finished.

ewmayer 2019-10-15 21:14

@Dylan14 -- Thanks for the build attempt and info. The ROE is benign ... Mlucas uses the self-tests for both performance and accuracy-testing. When it hits ROE >= 0.4375 during one of the self-tests it will simply omit the particular FFT-radix set from considering for writing to the mlucas.cfg file for said FFT length. The only problem is if that set of FFT radices happens to also give the best performance at the FFT length in question. What you really want to do is to run the the code in self-test (benchmarking) mode - to do this at a specific single FFT length of interest, do like so, using you example at 192K:

./Mlucas -iters 100 -fftlen 192

Then have a look at the resulting 1-line entry in the mlucas.cfg file.

What kind of CPU has that hardware you are testing on? What sort of processor, how many cores?

Dylan14 2019-10-15 21:26

[QUOTE=ewmayer;528101]@Dylan14 -- Thanks for the build attempt and info. The ROE is benign ... Mlucas uses the self-tests for both performance and accuracy-testing. When it hits ROE >= 0.4375 during one of the self-tests it will simply omit the particular FFT-radix set from considering for writing to the mlucas.cfg file for said FFT length. The only problem is if that set of FFT radices happens to also give the best performance at the FFT length in question. What you really want to do is to run the the code in self-test (benchmarking) mode - to do this at a specific single FFT length of interest, do like so, using you example at 192K:

./Mlucas -iters 100 -fftlen 192

Then have a look at the resulting 1-line entry in the mlucas.cfg file.

What kind of CPU has that hardware you are testing on? What sort of processor, how many cores?[/QUOTE]

This is the proc/cpuinfo file of the machine in question:

[CODE] CPU -- root@colab_test# cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 79
model name : Intel(R) Xeon(R) CPU @ 2.20GHz
stepping : 0
microcode : 0x1
cpu MHz : 2200.000
cache size : 56320 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 1
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt arat md_clear arch_capabilities
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs
bogomips : 4400.00
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:

processor : 1
vendor_id : GenuineIntel
cpu family : 6
model : 79
model name : Intel(R) Xeon(R) CPU @ 2.20GHz
stepping : 0
microcode : 0x1
cpu MHz : 2200.000
cache size : 56320 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 1
apicid : 1
initial apicid : 1
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt arat md_clear arch_capabilities
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs
bogomips : 4400.00
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:
[/CODE]

ewmayer 2019-10-15 21:33

Thanks - so it's a single-physical-core with hyperthreading and avx2/fma3, hence 2 entries in /proc/cpuinfo. Thus your compile should have used
[i]gcc -c -O3 -DUSE_AVX2 -mavx2 -DUSE_THREADS ../src/*.c >& build.log
grep error build.log[/i]
[Assuming above grep comes up empty] [i]gcc -o Mlucas *.o -lm -lpthread -lrt [/i]

Unlike mprime, Mlucas does often get some added speedup from using the virtual cores enabled by HT - a quick spot-check of this would be the following 2 single-FFT-length self-tests:

./Mlucas -fftlen 192 -iters 100
./Mlucas -fftlen 192 -iters 100 -cpu 0:1

Then post the 2 resulting mlucas.cfg lines here.

Dylan14 2019-10-16 00:38

Here are the resulting mlucas.cfg lines from the Colab build:


[CODE]192 msec/iter = 3.20 ROE[avg,max] = [0.002939732, 0.375000000] rad\
ices = 48 8 16 16 0 0 0 0 0 0 100-iteration Res mod 2^64, 2^35-1, 2^\
36-1 = 579D593FCE0707B2, 21674900403, 42893438228
192 msec/iter = 2.39 ROE[avg,max] = [0.002939732, 0.375000000] rad\
ices = 48 8 16 16 0 0 0 0 0 0 100-iteration Res mod 2^64, 2^35-1, 2^\
36-1 = 579D593FCE0707B2, 21674900403, 42893438228[/CODE]The first entry is the single threaded run, the second is the 2 threaded run.

And the program does show that M3021377 is prime, as I expected (this was another test to make sure it works).

ewmayer 2019-10-16 03:12

Thanks - so you're getting quite a decent speedup from using both logical cores, though I haven't a clue if the absolute timings are reasonable for the hardware in question - 2 ms/iter @192K is quite slow by (say) Haswell-and-beyond desktop-PC standards.

I suggest you proceed to the full production-run-oriented self-tests, and please post a zipped copy of the resulting self-test logfile here:
[i]
./Mlucas -s m -iters 100 -cpu 0:1 >& selftest.log[/i]

Dylan14 2019-10-16 04:55

3 Attachment(s)
[QUOTE=ewmayer;528121]Thanks - so you're getting quite a decent speedup from using both logical cores, though I haven't a clue if the absolute timings are reasonable for the hardware in question - 2 ms/iter @192K is quite slow by (say) Haswell-and-beyond desktop-PC standards.

I suggest you proceed to the full production-run-oriented self-tests, and please post a zipped copy of the resulting self-test logfile here:
[I]
./Mlucas -s m -iters 100 -cpu 0:1 >& selftest.log[/I][/QUOTE]


See attached file. Note: this is on a new session of Colab, so the processor is not the same as before. I have also attached the cpu info and cfg files.

ewmayer 2019-10-16 19:14

Thanks for the build & test data - I see this particular new instance supports avx-512, so you'll want to prepare a second build that invokes those inline-asm macros in the code:
[i]
gcc -c -O3 -DUSE_AVX512 -march=skylake-avx512 -DUSE_THREADS ../src/*.c >& build.log
[/i]
...and use a different name for the resulting executable, you could call the 2 binaries mlucas_avx2 and mlucas_avx512, say. "grep avx512 /proc/cpuinfo" on whatever system you get during a particular session will tell you which binary to use. Rerun the self-tests on this new system to see what kind of speedup you get from using avx-512.

(Wait - while working through your selftest.log data further down in this note, I came across these infoprints @7168K:
[i]
radix28_ditN_cy_dif1: No AVX-512 support; Skipping this leading radix.
[/i]
So you did prepare and use an avx-512 build as per above compile flags for this set of runs? If so, that obviates the avx2-vs-avx512 parts of the commentary below.)

As to your avx2-build timings, I realized after posting my "seems slow' comment yesterday that I was thinking in terms of multicore running on hardware like my Haswell. For a single-physical-core running at 2 GHz, ~50 msec/iter at the current GIMPS wavefront (5120K) is not at all bad - for comparison, here is the mlucas.cfg file for all 4 physical cores (no hyperthreading on this CPU) of my 3.3GHz Haswell. On a single CPU the runtimes would be perhaps ~3.5x as large, so (say) at 5120K we'd expect ~47 msec/iter, only ~10% faster than your 1-core/2-thread timings, and this is at 3.3GHz vs your 2GHz:
[code]
18.0
2048 msec/iter = 5.25 ROE[avg,max] = [0.222878714, 0.312500000] radices = 64 16 32 32 0 0 0 0 0 0
2304 msec/iter = 5.85 ROE[avg,max] = [0.259770659, 0.375000000] radices = 144 16 16 32 0 0 0 0 0 0
2560 msec/iter = 6.28 ROE[avg,max] = [0.252363335, 0.312500000] radices = 160 16 16 32 0 0 0 0 0 0
2816 msec/iter = 7.44 ROE[avg,max] = [0.239182557, 0.312500000] radices = 176 16 16 32 0 0 0 0 0 0
3072 msec/iter = 8.35 ROE[avg,max] = [0.251998996, 0.312500000] radices = 192 16 16 32 0 0 0 0 0 0
3328 msec/iter = 9.02 ROE[avg,max] = [0.243424657, 0.312500000] radices = 208 16 16 32 0 0 0 0 0 0
3584 msec/iter = 9.25 ROE[avg,max] = [0.248507344, 0.312500000] radices = 224 16 16 32 0 0 0 0 0 0
3840 msec/iter = 10.17 ROE[avg,max] = [0.256763639, 0.343750000] radices = 240 16 16 32 0 0 0 0 0 0
4096 msec/iter = 10.63 ROE[avg,max] = [0.279075387, 0.343750000] radices = 256 16 16 32 0 0 0 0 0 0
4608 msec/iter = 12.21 ROE[avg,max] = [0.269211099, 0.343750000] radices = 288 16 16 32 0 0 0 0 0 0
5120 msec/iter = 13.48 ROE[avg,max] = [0.300527545, 0.375000000] radices = 320 16 16 32 0 0 0 0 0 0
5632 msec/iter = 15.42 ROE[avg,max] = [0.230105748, 0.281250000] radices = 176 16 32 32 0 0 0 0 0 0
6144 msec/iter = 17.51 ROE[avg,max] = [0.246608585, 0.312500000] radices = 192 16 32 32 0 0 0 0 0 0
6656 msec/iter = 18.60 ROE[avg,max] = [0.231292347, 0.312500000] radices = 208 16 32 32 0 0 0 0 0 0
[/code]
Further using an avx-512 build on this type of instance should give a nice added speedup, perhaps as much as 1.6x. And if/when a Prime95/mprime build for these systems comes online, that should be faster still.

Looking more closely at your selftest.log and mlucas.cfg files, I see "Excessive level of roundoff error detected" messages for individual FFT radix sets at 2816K, 3328K, 5120K and 7168K, but in none of those cases did the skipped radix set(s) happen to be the fastest one(s) at the FFT length in question.

kracker 2019-11-28 02:09

Trying to compile under MSYS2/windows, getting 'SIGHUP' undeclared errors.
[code]
../src/fermat_mod_square.c:1869:18: error: 'SIGHUP' undeclared (first use in this function)
../src/mers_mod_square.c:2382:18: error: 'SIGHUP' undeclared (first use in this function)
../src/Mlucas.c:182:21: error: 'SIGHUP' undeclared (first use in this function)

[/code]

ewmayer 2019-11-28 02:53

[QUOTE=kracker;531614]Trying to compile under MSYS2/windows, getting 'SIGHUP' undeclared errors.
[code]
../src/fermat_mod_square.c:1869:18: error: 'SIGHUP' undeclared (first use in this function)
../src/mers_mod_square.c:2382:18: error: 'SIGHUP' undeclared (first use in this function)
../src/Mlucas.c:182:21: error: 'SIGHUP' undeclared (first use in this function)

[/code][/QUOTE]

I no longer have access to a Windows machine of any kind - perhaps SIGHUP has no proper analog in Windows? Anyhow, quick workaround is to simply comment out any clauses giving such errors and recompile. E.g. in Mlucas.c:
[code]
void sig_handler(int signo)
{
if (signo == SIGINT) {
fprintf(stderr,"received SIGINT signal.\n"); sprintf(cbuf,"received SIGINT signal.\n");
} else if(signo == SIGTERM) {
fprintf(stderr,"received SIGTERM signal.\n"); sprintf(cbuf,"received SIGTERM signal.\n");
// } else if(signo == SIGHUP) {
// fprintf(stderr,"received SIGHUP signal.\n"); sprintf(cbuf,"received SIGHUP signal.\n");
}
// Toggle a global to allow desired code sections to detect signal-received and take appropriate action:
MLUCAS_KEEP_RUNNING = 0;
}
[/code]
..and similarly in the other 2 files which define signal handlers and are giving errors.


All times are UTC. The time now is 11:08.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.