mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Mlucas (https://www.mersenneforum.org/forumdisplay.php?f=118)
-   -   Mlucas v19 available (https://www.mersenneforum.org/showthread.php?t=24990)

ewmayer 2019-12-01 23:09

Mlucas v19 available
 
[url=http://www.mersenneforum.org/mayer/README.html]Mlucas v19 has gone live[/url]. Use this thread to report bugs, build issues, and for any other related discussion.

kriesel 2019-12-03 01:19

[LEFT]Haven't tried it yet, but congrats on getting it out.
[/LEFT]

ewmayer 2019-12-03 21:58

1 Attachment(s)
[QUOTE=kriesel;531891][LEFT]Haven't tried it yet, but congrats on getting it out.
[/LEFT][/QUOTE]

Thanks. Meanwhile I have discovered a bug related to the new PRP-handling logic of the kind I expected would be shaken out by further testing ... this one specifically affects exponents really close to an FFT-length breakover point (I discovered it when I fired up a first-time PRP test of M96365419, which is very close to the 5120K-FFT exponent limit), turns out the Gerbicz-check-related breaking of the usual checkpointing interval into multiple smaller subintervals (at the end of each of which we update the G-checkproduct) breaks the is-roundoff-error-reproducible-on-retry logic. It's a simple fix, I just uploaded updated versions of the release tarball and ARM prebuilt binaries, but folks who previously built and are running the the Dec 1 code snapshot can use the simpler expedient of incremental-rebuild-and-relink of the single attached sourcefile.

ewmayer 2019-12-05 21:51

An interesting subtheme re. the newly-added PRP assigment-type support and the Gerbicz check ... shortly after the initial v19 release, got e-mail from George about the importance of adding redundancy to the G-checking mechanism:
[quote]On Dec 4, 2019, at 7:57 AM, George Woltman wrote:

You are basically examining the code looking for any place a one-bit error could doom your result. This is mostly likely to occur right after a Gerbicz compare. In my first implementation, after the compare succeeded I threw away the equal value and only had the one value in memory (and wrote a save file with one value). If a cosmic ray hit during that time period, the end result would be wrong.

I now keep the compared (and equal) value in memory. One starts the next Gerbicz comparison value, the other continues the PRP exponentiation. Similarly when the computation ends, I generate a residue from both values and check that the res64s match.[/quote]
My reply follows.

Well, let's review how my code does things, say, starting from a post-interrupt savefile-read:

1. Read PRP residue into array a[], accumulated G-checkproduct into b[]. Both of these residues are written to savefiles together with associated full-residue checksums - I use the Selfridge-Hurwitz residues (full-length residue mod (2^35-1) and mod (2^36-1)) for that - and the checksums compared with those recomputed during the read-from-file.

2. Do an iteration interval leading up to the next savefile update, 10k or 100k mod-squarings of a[]. Every 1000 squarings update b[] *= a[]. The initial b[] is in pure-integer form; on subsequent mul-by-a[] updates the result is left in the partially-fwd-FFTed form returned by the carry step, i.e. fwd-weighted and initial-fwd-FFT-pass done.

3. On final G-checkproduct update of the current iteration interval, 1000 iterations before the next savefile write, save a copy of the current G-checkproduct b[] in a third array c[], before doing the usual G-checkproduct update b[] *= a[].

4. At end of the current iteration interval, prior to writing savefiles, do 1000 mod-squarings of c[] and compare the result to b[]. If mismatch, no savefiles written, instead roll back to last 'good' G-checkproduct data, which in my current first implementation means the previous multiple of 1M iterations.

So during the above, the G-checkproduct accumulator b[] is vulnerable to a 1-bit error, of the kind which would not show up, say, via a roundoff error during the ensuing *= a[] FFT-mul update.

So, what to do? Since the b[] data are kept in partially-fwd-FFTed form for most of the iteration interval, the Selfridge-Hurwitz (or similar CRC-style) checksums can't be easily computed from that. I think the easiest thing would be, every time I do an update b[] *= a[], do a memcpy to save a separate copy of the result, and compare that vs b[] prior to each update of the latter.

[followup e-mail a few hours later] Additional thoughts:

We are essentially trying to guard against a "false G-check failure", in the sense that the G-check might fail not because the PRP-residue array a[] had gotten corrupted but rather because the G-checkproduct accumulator b[] had. So every time we update b[] (or read it from a savefile) we also make a copy c[] = b[], and prior to each b[] *= a[] update we check that b == c. OK, but if at some point we find b != c, how can we tell which of the 2 is the good one? Obvious answer is to compute some kind of whole-array checksum at every update. Since post-update b[] may be in some kind of partially-FFTed state (that is the case for my code) the checksum needs to not assume integer data - perhaps simply treat the floats in a[] as integer bitfields. Would something as simple as computing a mod-2^64 sum of the uint64-reinterpretation-casted elements of a[] suffice, do you think? Further, any such checksum will be a much smaller bit-corruption target than b[], but to be safe one should probably make at least 2 further copies of *it*, call our 3 redundant checksums s1,s2,s3, then the attendant logic would look something like this:

[code]// Mod-2^64 sum of elements of double-float array a[], treated as uint64 bitfields:
uint64 sum64(double a[], int n) {
int i;
uint64 sum = 0ull;
for(i = 0; i < n; i++)
sum += *(uint64*(a+i)); // Type-punning cast of a[i]
return sum;
}
// Simply majority-vote consensus:
uint64 consensus_checksum(uint64 s1, uint64 s2, uint64 s3) {
if(s1 == s2) return s1;
if(s1 == s3) return s1;
if(s2 == s3) return s2;
return 0ull;
}
int n; // FFT length in doubles
double a[], b[], c[]; // a[] is PRP residue; b,c are redundant copies of G-checkproduct array
uint64 s1,s2,s3; // Triply-redundant whole-array checksum on b,c-arrays
...
[bunch of mod-squaring updates of a[]]
// prior to each b[]-update, check integrity of array data:
if(b[] != c[]) { // Houston, we have a problem
s1 = consensus_checksum(s1,s2,s3);
if(s1 == sum64(b,n)) // b-data good
/* no-op */
else if(s1 == sum64(c,n)) // c-data good, copy back into b
b[] = c[];
else // Catastrophic data corruption
[roll back to last-good G-check savefile]
}
b[] *= a[]; // G-checkproduct update
s1 = s2 = s3 = sum64(b,n); // Triply-redundant whole-array checksum update
c[] = b[]; // Make a copy
[/code]
And if that is an effective anti-corruption strategy, the obvious question is, why not apply it to the main residue array a[] itself? Likely performance impact is one issue - the cost of making a copy of a[] and of updating the whole-array checksum at each iteration, while O(n) and thus certainly smaller than that of an FFT-mod-squaring, is likely going to be nontrivial, a few percent I would guess.

kriesel 2019-12-05 23:24

After the dust settles, an update on the Mlucas save file format description to final v18, and to v19 PRP would be appreciated. For your convenience, [url]https://www.mersenneforum.org/showpost.php?p=489491&postcount=2[/url]

ewmayer 2019-12-06 02:25

[QUOTE=kriesel;532137]After the dust settles, an update on the Mlucas save file format description to final v18, and to v19 PRP would be appreciated. For your convenience, [url]https://www.mersenneforum.org/showpost.php?p=489491&postcount=2[/url][/QUOTE]

Simple, PRP test savefiles tack on an additional version of the last 4 items in the 'current Mlucas file format' list - full-length residue byte-array (this one holding the accumulated Gerbicz checkproduct) and the 3 associated checksums totaling a further 18 bytes. Thus, where an LL-savefile read reads one such residue+checksum data quartet, 3 bytes for FFT-length-in-Kdoubles which the code was using at time of savefile write (*note* your quote in post #2 needs to change that from 4 to 3 bytes), 8 bytes for circular-shift to be applied to the (shift-removed) savefile residue, a PRP-savefile read follows those reads with another read of a residue+checksum data quartet.

kriesel 2019-12-06 07:18

[QUOTE=ewmayer;532150]Simple, PRP test savefiles tack on an additional version of the last 4 items in the 'current Mlucas file format' list - full-length residue byte-array (this one holding the accumulated Gerbicz checkproduct) and the 3 associated checksums totaling a further 18 bytes. Thus, where an LL-savefile read reads one such residue+checksum data quartet, 3 bytes for FFT-length-in-Kdoubles which the code was using at time of savefile write (*note* your quote in post #2 needs to change that from 4 to 3 bytes), 8 bytes for circular-shift to be applied to the (shift-removed) savefile residue, a PRP-savefile read follows those reads with another read of a residue+checksum data quartet.[/QUOTE]Thanks; [url]https://www.mersenneforum.org/showpost.php?p=489491&postcount=2[/url] is updated and extended.

ewmayer 2019-12-06 20:13

[QUOTE=kriesel;532174]Thanks; [url]https://www.mersenneforum.org/showpost.php?p=489491&postcount=2[/url] is updated and extended.[/QUOTE]

FYI, the "master reference" for savefile format is, as always, the actual code - the relevant functions are read|write_ppm1_savefiles in the Mlucas.c source. ('ppm1' is short for 'Primality-test and P-1' ... the latter as yet unsupportd, but we remain ever-optimistic ... the 2-input FFT-modmul support added in v19 for Gerbicz-checking will help in that regard, since p-1 stage 2 needs that capability.)

ewmayer 2020-01-03 03:05

2 Attachment(s)
[b]***Patch *** 03 Jan 2020:[/b] This patch adds one functionality-related item, namely adding redundancy to the PRP-test Gerbicz-check mechanism to prevent data corruption in the G-check residue from causing a "false Gerbicz-check failure", i.e. a failure not due to a corrupted PRP-test residue itself. This more or less follows the schema laid out in post #4.

I have also patched another logic bug related to roundoff-error-retry, this one was occasionally causing the run to switch to the next-larger FFT length when encountering a reproducible roundoff error, rather than first retrying at the current FFT length but with a shorter carry-chain recurrence computation for DWT weights. Not fatal, just suboptimal in terms of CPU usage.

[b]NOTE ALSO[/b] that I hit a Primenet-server-side bug on 31. Dec when I used the primenet.py script to submit my first batch of v19 LL-test results (my previous v19 submissions were all PRP-test ones). The server code was incorrectly expecting a Prime95-style checksum as part of such results lines. The really nasty part of this was that I almost missed it - until now, the primenet.py script grepped the page resulting from each attempted result-line submission for "Error code", if it found that it emitted a user-visible echo of the error message which was found, and the attempted submission line was not copied to the results_sent.txt file for archiving. In this case - I only saw this after retrying one of the submits via the manual test webform - there was "Error" on the returned page, but that was not followed by "code", so the script treated the submissions as successful. I only saw the problem when I checked the exponent status page for one of the expos, and saw no result had been registered. James Heinrich has fixed the server-side issue and to be safe I've tweaked the primenet.py script to only grep for "Error", but if you used the script to submit any v19 LL-test results (PRP tests were being correctly handled at both ends) prior to the current patch, please delete the corresponding lines from your results_sent.txt file and retry submitting using the patched primenet.py file. To be safe, check the exponent status at mersenne.org to make sure your results appear there.

I just uploaded updated versions of the release tarball and ARM prebuilt binaries, but folks who previously built and are running the the Dec 3 code snapshot can use the simpler expedient of incremental-rebuild-and-relink of the attached Mlucas.c sourcefile. The also-attached tweaked primenet.py file - matching the updated one in the release tarball - is not necessary now that James has made the above-described server-side bugfix, but better safe than sorry, I say.

Jumba 2020-01-16 14:14

Version 19.0 error
 
I'm getting the following error after getting to the 100% mark:

ERROR: at line 2313 of file ../src/Mlucas.c
Assertion failed: After short-div, R != 0 (mod B)

Nothing has been written to the results.txt file since I started the run a week ago. I can restart the process, and it resumes from just before the end, but still spits out the same error after a couple minutes.

ewmayer 2020-01-16 20:15

[QUOTE=Jumba;535213]I'm getting the following error after getting to the 100% mark:

ERROR: at line 2313 of file ../src/Mlucas.c
Assertion failed: After short-div, R != 0 (mod B)

Nothing has been written to the results.txt file since I started the run a week ago. I can restart the process, and it resumes from just before the end, but still spits out the same error after a couple minutes.[/QUOTE]

That looks like you've found a bug in the PRP-residue postprocessing code ... could you upload your p[exponent] savefile to Dropbox or similar site so I can download it and re-run the final few (whatever) iterations within a debug session? PM me the resulting download location and the worktodo.ini file entry.

In the meantime, if you've not already done so, suggest you switch the top 2 entries in worktodo.ini and start on the next assignment. By the time that finishes you can grab a bug-patched version of the code, which should allow you successfully complete your above run.

Oh, your data should be fine, like I said this appears to strictly be a postprocessing bug.

ewmayer 2020-01-18 03:54

1 Attachment(s)
OK, starting from a copy of the OP's last-checkpoint file and running the last few thousand iterations using a debug-enabled build under gdb, I have found the cause of the bug - there is a small-multiplier computation in PRP final-residue postprocessing which needed an additional (mod b^2) operation, where b is the PRP-test base. None of my PRP runs to date revealed the error.

Anyone who downloaded/built the current 3. Dec code snapshot can incremental-rebuild by compiling the attached patched Mlucas.c file and relinking. I'll upload updated source tarballs and ARM binaries over the weekend.

Once I see that OP has successfully completed the bug-affected run with the updated build and submitted the result I'll ask for an early double-check of the exponent, just to be on the safe side.

Thanks again for the bug report!

ewmayer 2020-01-18 21:43

The PRP-residue postprocessing bug reported in post #10 above has been fixed - please grab latest code tarball (or ARM binary for folks running on ARM). If your current build is based on the 3 January code snapshot, my "fixed" Post #3 in the above thread has an incremental single-file-rebuild option.

[b]Update:[/b] Coincidentally, a few hours after I uploaded the patched code and posted the above 'fixed' message, my latest haswell-quad PRP test, of an exponent ~96M, completed and hit the very same postprocessing bug. So I sent the above [url=https://www.mersenne.org/report_exponent/?exp_lo=96365519&full=1]96M exponent[/url] from my run out for an early double-check, which completed yesterday and verified the first-time-test result.

Since no other bugs have been reported in the past month, I am officially pronouncing v19 as the current stable release, several PRP-functionality-related bugs having been turned up and fixed as a result of the last 2 months' shakedown testing. Going to update the README.html page now.

paulunderwood 2020-02-06 00:06

I downloaded mlucas19_c2simd for my Odroid N2 and obtained a couple of WR PRP assignments. The assignments end in ",2" but no P-1 performance was indicated in the ".stat" file. :help:

ewmayer 2020-02-06 00:26

[QUOTE=paulunderwood;536828]I downloaded mlucas19_c2simd for my Odroid N2 and obtained a couple of WR PRP assignments. The assignments end in ",2" but no P-1 performance was indicated in the ".stat" file. :help:[/QUOTE]

The readme page notes that the fields related to p-1 testing are ignored by Mlucas. Thinking aloud, Is there a way to tweak the assignment-fetch to request only exponents for which p-1 has been done to the current default depth(s)?

paulunderwood 2020-02-06 00:34

[QUOTE=ewmayer;536829]The readme page notes that the fields related to p-1 testing are ignored by Mlucas. Thinking aloud, Is there a way to tweak the assignment-fetch to request only exponents for which p-1 has been done to the current default depth(s)?[/QUOTE]

I popped "PFactor=..." into my gpuowl worktodo.txt file (2nd and 3rd lines). I should know soon whether to continue PRP'ing them with Mlucas.

ewmayer 2020-02-16 20:01

[QUOTE=paulunderwood;536831]I popped "PFactor=..." into my gpuowl worktodo.txt file (2nd and 3rd lines). I should know soon whether to continue PRP'ing them with Mlucas.[/QUOTE]

I just did similarly with my latest batch of PRPs to-be-farmed-out to my various devices running Mlucas. I'm planning to add a few lines to the Mlucas readme to detail this, but want to make sure I get the different assignment-line formats for PRP and LL right in this regard. Let's review:

PRP=C57FF1C644A0CB16F5E2B5B3A9FC4E1D,1,2,98024161,-1,77,0 - Trailing '0' indicates p-1 has been done. We only want to do p-1 if trailing digit = 1 or 2, in which case we change PRP to Pfactor and paste the resulting assignment into the worktodo file for Prime95/mprime or gpuOwl. (Does CudaLucas also support p-1? Unfamiliar with it.)

Test=DDD21F2A0B252E499A9F9020E02FE232,48295213,69,0 - Trailing '0' indicates p-1 has *not* been done. How to munge the assignment into a p-1 one?

For DC assignments of either type, the same 2 trailing-digit conventions hold, yes? And if some p-1 was done prior to the first-time test of a given exponent, in preparation for handing out a 2nd time as a DC, does the server ever change the trailing digit from "p-1 has been done" to "p-1 needed" to reflect that deeper p-1 should be done prior to the DC?

kriesel 2020-02-16 21:31

[QUOTE=ewmayer;537717]Let's review:

PRP=C57FF1C644A0CB16F5E2B5B3A9FC4E1D,1,2,98024161,-1,77,0 - Trailing '0' indicates p-1 has been done. We only want to do p-1 if trailing digit = 1 or 2, in which case we change PRP to Pfactor and paste the resulting assignment into the worktodo file for Prime95/mprime or gpuOwl. (Does CudaLucas also support p-1? Unfamiliar with it.)[/QUOTE]No, CUDALucas does LL only, without Jacobi check. If you want CUDA P-1, that would be CUDAPm1, a separate app described by its author as alpha software.[QUOTE]
Test=DDD21F2A0B252E499A9F9020E02FE232,48295213,69,0 - Trailing '0' indicates p-1 has *not* been done. How to munge the assignment into a p-1 one?

For DC assignments of either type, the same 2 trailing-digit conventions hold, yes? And if some p-1 was done prior to the first-time test of a given exponent, in preparation for handing out a 2nd time as a DC, does the server ever change the trailing digit from "p-1 has been done" to "p-1 needed" to reflect that deeper p-1 should be done prior to the DC?[/QUOTE]Requesting a manual or PrimeNet P-1 assignment for the exponent in question will generate the prime95 form of P-1 assignment record with the correct # of tests saved included. Prime95 uses that integer as input to select appropriate bounds to maximize probable testing time saved, as does CUDAPm1.
The reference post for assignment (worktodo entry) format by app, version and type is at [URL]https://www.mersenneforum.org/showpost.php?p=522098&postcount=22[/URL]

ewmayer 2020-02-22 20:52

Readme has been updated with guidance re. doing preliminary p-1 using one of the GIMPS clients which support it. Ken, thanks for the comments.

ewmayer 2020-03-07 19:43

Ken reports build failure using msys2 under Windows -- "signal issues and real time extensions lib not available."
[code]ken@condorella MINGW64 ~/mlucas/mlucas_v19/build
$ gcc -c -O3 ../src/*.c >& build.log

ken@condorella MINGW64 ~/mlucas/mlucas_v19/build
$ grep error build.log
../src/fermat_mod_square.c:1842:18: error: 'SIGHUP' undeclared (first use in this function); did you mean 'SIGFPE'?
../src/fermat_mod_square.c:1844:18: error: 'SIGALRM' undeclared (first use in this function); did you mean 'SIGABRT'?
../src/fermat_mod_square.c:1846:18: error: 'SIGUSR1' undeclared (first use in this function); did you mean 'SIG_ERR'?
../src/fermat_mod_square.c:1848:18: error: 'SIGUSR2' undeclared (first use in this function); did you mean 'SIG_ERR'?
../src/mers_mod_square.c:2533:18: error: 'SIGHUP' undeclared (first use in this function); did you mean 'SIGFPE'?
../src/mers_mod_square.c:2535:18: error: 'SIGALRM' undeclared (first use in this function); did you mean 'SIGABRT'?
../src/mers_mod_square.c:2537:18: error: 'SIGUSR1' undeclared (first use in this function); did you mean 'SIG_ERR'?
../src/mers_mod_square.c:2539:18: error: 'SIGUSR2' undeclared (first use in this function); did you mean 'SIG_ERR'?
../src/Mlucas.c:187:21: error: 'SIGHUP' undeclared (first use in this function); did you mean 'SIGFPE'?
../src/Mlucas.c:189:21: error: 'SIGALRM' undeclared (first use in this function); did you mean 'SIGABRT'?
../src/Mlucas.c:191:21: error: 'SIGUSR1' undeclared (first use in this function); did you mean 'SIG_ERR'?
../src/Mlucas.c:193:21: error: 'SIGUSR2' undeclared (first use in this function); did you mean 'SIG_ERR'?

ken@condorella MINGW64 ~/mlucas/mlucas_v19/build
$ gcc -o Mlucas-x86 *.o -lm -lrt
C:/msys64/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.2.0/../../../../x86_64-w64-mingw32/bin/ld.exe: cannot find -lrt
collect2.exe: error: ld returned 1 exit status

ken@condorella MINGW64 ~/mlucas/mlucas_v19/build
$ pacman -Su librt
error: target not found: librt

ken@condorella MINGW64 ~/mlucas/mlucas_v19/build
$ pacman -Su rt
error: target not found: rt[/code]
Any advice from fellow msys2 users welcome - I no longer have any Win-box to play with this myself.

Ken, which version of Windows, and which precise built-in Linux distro?

kriesel 2020-03-07 21:04

The preceding was on msys2/mingw64 (on Windows 7 x64 Pro, as fully patched as I could make it, HP Z600 dual E5645).
I went a few rounds with msys2's pacman implementation which is adapted from Arch Linux from what I've read, searching for the rt library without success. POSSIBLY it's on some mirror somewhere, if it has been ported.

I'm currently updating msys2 and its packages. This is taking a while, since I have a lot running and am frequently getting "Connection timed out after 10030 milliseconds" or similar from pacman.

kriesel 2020-03-08 00:01

"MSYS2 is a software distro and building platform for Windows

At its core, it is an independent rewrite of MSYS, based on modern Cygwin (POSIX compatibility layer) and MinGW-w64 with the aim of better interoperability with native Windows software. It provides a bash shell, Autotools, revision control systems and the like for building native Windows applications using MinGW-w64 toolchains.

It features a package management system, Pacman, to provide easy installation of packages. It brings many powerful features such as dependency resolution and simple complete system upgrades, as well as straight-forward package building." [URL]https://www.msys2.org/[/URL]

Mlucas v17 built ok on msys2, as did v17.1, a year ago. But V19 did not, this week, partly because it requires linking to the real time extensions library librt. [URL]https://docs.oracle.com/cd/E36784_01/html/E36873/librt-3lib.html[/URL]

But there is no librt in msys2's package list. [URL]https://github.com/msys2/MSYS2-packages[/URL] [URL]https://packages.msys2.org/search?t=pkg&q=rt[/URL]
And there's no libc listed either.

And v18 didn't go so well either, today, on msys2, after a full update of msys2.[CODE]ken@condorella MINGW64 ~/mlucas/mlucas_v18/build
$ grep error build.log
../src/fermat_mod_square.c:1869:18: error: 'SIGHUP' undeclared (first use in this function)
../src/mers_mod_square.c:2382:18: error: 'SIGHUP' undeclared (first use in this function)
../src/Mlucas.c:182:21: error: 'SIGHUP' undeclared (first use in this function)

ken@condorella MINGW64 ~/mlucas/mlucas_v18/build
$ gcc -o Mlucas-x86 *.o -lm -lrt
C:/msys64/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/9.2.0/../../../../x86_64-w64-mingw32/bin/ld.exe: cannot find -lrt
collect2.exe: error: ld returned 1 exit status
[/CODE]

kriesel 2020-03-08 16:14

mlucas v19 build success on WSL
 
Mlucas V19 built very smoothly on an Ubuntu install (19.04 I think; what's linux's equivalent to dos/win's "ver"?) atop Windows Subsystem for Linux atop Windows 10 Home 64-bit build 18362, and passed self test. [URL]https://en.wikipedia.org/wiki/Windows_Subsystem_for_Linux[/URL] is only available for Windows 10 and Server 2019. (VM approaches such as Oracle Virtualbox could be taken for Windows 8.x and 7, presumably with considerable overhead.)

Mlucas for X86, SSE2, and FMA3 built multithreaded without issue.
[CODE]ken@peregrine:~/mlucas_v19/mlucas_v19/build$ gcc -c -O3 -DUSE_THREADS ../src/*.c >& build.log
ken@peregrine:~/mlucas_v19/mlucas_v19/build$ grep error build.log
ken@peregrine:~/mlucas_v19/mlucas_v19/build$ gcc -o mlucas-x86-mt *.o -lm -lpthread -lrt
[/CODE][CODE]ken@peregrine:~/mlucas_v19/mlucas_v19/build$ gcc -c -O3 -DUSE_SSE2 -DUSE_THREADS ../src/*.c >& build.log
ken@peregrine:~/mlucas_v19/mlucas_v19/build$ grep error build.log
ken@peregrine:~/mlucas_v19/mlucas_v19/build$ gcc -o mlucas-sse2-mt *.o -lm -lpthread -lrt[/CODE][CODE]ken@peregrine:~/mlucas_v19/mlucas_v19/build$ gcc -c -O3 -DUSE_AVX2 -mavx2 -DUSE_THREADS ../src/*.c >& build.log
ken@peregrine:~/mlucas_v19/mlucas_v19/build$ grep error build.log
ken@peregrine:~/mlucas_v19/mlucas_v19/build$ gcc -o mlucas-fma3-mt *.o -lm -lpthread -lrt
[/CODE][CODE]ken@peregrine:~/mlucas_v19/mlucas_v19/build$ ./mlucas-fma3-mt -fftlen 192 -iters 100 -radset 0 -nthread 12

Mlucas 19.0

http://www.mersenneforum.org/mayer/README.html

INFO: testing qfloat routines...
CPU Family = x86_64, OS = Linux, 64-bit Version, compiled with Gnu C [or other compatible], Version 7.4.0.
INFO: Build uses AVX2 instruction set.
INFO: Using inline-macro form of MUL_LOHI64.
INFO: Using FMADD-based 100-bit modmul routines for factoring.
INFO: MLUCAS_PATH is set to ""
INFO: using 64-bit-significand form of floating-double rounding constant for scalar-mode DNINT emulation.
Setting DAT_BITS = 10, PAD_BITS = 2
INFO: testing IMUL routines...
INFO: System has 12 available processor cores.
INFO: testing FFT radix tables...
Set affinity for the following 12 cores: 0.1.2.3.4.5.6.7.8.9.10.11.

Mlucas selftest running.....

/****************************************************************************/

INFO: Unable to find/open mlucas.cfg file in r+ mode ... creating from scratch.
NTHREADS = 12
INFO: Maximum recommended exponent for this runlength = 3888516; p[ = 3888509]/pmax_rec = 0.9999981998.
Initial DWT-multipliers chain length = [short] in carry step.
M3888509: using FFT length 192K = 196608 8-byte floats, initial residue shift count = 3736240
this gives an average 19.777979532877605 bits per digit
Using complex FFT radices 192 16 32
mers_mod_square: Init threadpool of 12 threads
Using 8 threads in carry step
100 iterations of M3888509 with FFT length 196608 = 192 K, final residue shift count = 744463
Res64: 71E61322CCFB396C. AvgMaxErr = 0.262946429. MaxErr = 0.312500000. Program: E19.0
Res mod 2^35 - 1 = 29259839105
Res mod 2^36 - 1 = 50741070790
Clocks = 00:00:00.331


Done ...[/CODE]ken@peregrine:~/mlucas_v19/mlucas_v19/build$ ./mlucas-fma3-mt -s m -cpu 0,6 >& selftest.log produced this cfg file and a 78kb log file.[CODE]19.0
2048 msec/iter = 49.19 ROE[avg,max] = [0.231026786, 0.281250000] radices = 256 16 16 16 0 0 0 0 0 0
2304 msec/iter = 58.64 ROE[avg,max] = [0.188309152, 0.226562500] radices = 288 16 16 16 0 0 0 0 0 0
2560 msec/iter = 64.61 ROE[avg,max] = [0.223995536, 0.281250000] radices = 160 16 16 32 0 0 0 0 0 0
2816 msec/iter = 72.57 ROE[avg,max] = [0.215401786, 0.250000000] radices = 352 16 16 16 0 0 0 0 0 0
3072 msec/iter = 83.32 ROE[avg,max] = [0.243191964, 0.281250000] radices = 192 16 16 32 0 0 0 0 0 0
3328 msec/iter = 87.62 ROE[avg,max] = [0.216573661, 0.281250000] radices = 208 16 16 32 0 0 0 0 0 0
3584 msec/iter = 91.89 ROE[avg,max] = [0.256696429, 0.312500000] radices = 224 16 16 32 0 0 0 0 0 0
3840 msec/iter = 96.61 ROE[avg,max] = [0.197767857, 0.218750000] radices = 240 16 16 32 0 0 0 0 0 0
4096 msec/iter = 104.96 ROE[avg,max] = [0.184507533, 0.218750000] radices = 256 16 16 32 0 0 0 0 0 0
4608 msec/iter = 117.71 ROE[avg,max] = [0.195703125, 0.234375000] radices = 288 16 16 32 0 0 0 0 0 0
5120 msec/iter = 142.07 ROE[avg,max] = [0.194757952, 0.218750000] radices = 320 16 16 32 0 0 0 0 0 0
5632 msec/iter = 151.15 ROE[avg,max] = [0.187960379, 0.222656250] radices = 352 16 16 32 0 0 0 0 0 0
6144 msec/iter = 173.96 ROE[avg,max] = [0.214758301, 0.250000000] radices = 768 16 16 16 0 0 0 0 0 0
6656 msec/iter = 198.43 ROE[avg,max] = [0.189202009, 0.250000000] radices = 208 32 32 16 0 0 0 0 0 0
7168 msec/iter = 210.43 ROE[avg,max] = [0.199651228, 0.218750000] radices = 224 16 32 32 0 0 0 0 0 0
7680 msec/iter = 232.87 ROE[avg,max] = [0.233147321, 0.312500000] radices = 240 16 32 32 0 0 0 0 0 0
[/CODE]That 142ms/iter at 5120K above is about 7 iter/sec on one core using hyperthreading at 5M. One could optimistically extrapolate that to 42 iter/sec for all 6 cores assuming no memory bottlenecking.

For comparison prime95 v29.8b6 gives on the same i7-8750H and directly on the Windows OS, using all 6 cores, benchmark values ranging 42-80 iter/sec at 5M, and is currently running ~16.5 ms/iter on 93M. Note that I've not spent any time trying to tune Mlucas performance or run all cores. A fairer test might be running mprime on Ubuntu on WSL on Win10 for comparison, or single-core prime95 on Win10, versus mlucas compiled single core on msys2 if I could get that to compile and link there.

kriesel 2020-03-08 17:48

Running mprime benchmarking, 5120K fft, 1 core hyperthreaded, gave 66ms/iter.
Next time I do these I should stop the production prime95, mfakto on the IGP, and anything else that might be competing for cpu cycles or memory bandwidth instead of leaving them run as in the previously described timings.

ewmayer 2020-03-08 22:02

By way of comparison, my current PRP-3 run of p~103M @5632K using all 4 cores of my aged Haswell system (stock 3.3GHz, no HT) is getting ~16-17 ms/iter, implying a 1-thread timing of no worse than 66 ms on that system. So a whopping 142 ms/iter @5120K using 2 threads on a single HT physical core of your system indicates ... I'm not sure what. OTOH if Prime95/mprime gets 66 ms/iter on the same system using a similar 2-threads-1-physical-core setup, that ~2x is typical of the single-physical-core speed difference I've seen.

I suggest first rerunning the self-test/benchmark with 1,2,4 and 6-threads on the same number of physical cores (i.e. -cpu 0, -cpu 0:1, -cpu 0:3, -cpu 0:5), using 100 iters to get the total runtime down to reasonable), posting those numbers, then we can examine || scaling, whether overloading 1 core (your -cpu 0,6) was faster than not, etc.

kriesel 2020-03-08 22:34

Linux and "Linux" version info
 
Ok, Duck-Duck-Go is my linux tutor:

"Ubuntu linux" from the MS Store on WSL on Win10:
[CODE]$ uname -a
Linux peregrine 4.4.0-18362-Microsoft #476-Microsoft Fri Nov 01 16:53:00 PST 2019 x86_64 x86_64 x86_64 GNU/Linux
$ cat /proc/version
Linux version 4.4.0-18362-Microsoft (Microsoft@Microsoft.com) (gcc version 5.4.0 (GCC) ) #476 Microsoft Fri Nov 01 16:53:00 PST 2019
$ cat /etc/issue
Ubuntu 18.04.2 LTS \n \l
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 18.04.2 LTS
Release: 18.04
Codename: bionic[/CODE]msys2 on Win7:
[CODE]$ uname -a
MINGW64_NT-6.1-7601 condorella 3.0.7-338.x86_64 2019-07-11 10:58 UTC x86_64 Msys
$ cat /proc/version
MINGW64_NT-6.1-7601 version 3.0.7-338.x86_64 (Alexx@WARLOCK) (gcc version 9.1.0 (GCC) ) 2019-07-11 10:58 UTC
$ cat /etc/issue
cat: /etc/issue: No such file or directory
$ lsb_release -a
bash: lsb_release: command not found[/CODE]Ubuntu 19.04 on Oracle Virtualbox 6.1 atop Win10:
[CODE]$ uname -a
Linux ken-peregrine-ubuntu 5.0.0-38-generic #41-Ubuntu SMP Tue Dec 3 00:27:35 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
$ cat /proc/version
Linux version 5.0.0-38-generic (buildd@lgw01-amd64-036) (gcc version 8.3.0 (Ubuntu 8.3.0-6ubuntu1)) #41-Ubuntu SMP Tue De 3 00:27:35 UTC 2019
$ cat /etc/issue
Ubuntu 19.04 \n \l
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 19.04
Release: 19.04
Codenme: disco
[/CODE]Win 7:[CODE]>ver

Microsoft Windows [Version 6.1.7601][/CODE]Win 10:[CODE]>ver

Microsoft Windows [Version 10.0.18362.657][/CODE]

kriesel 2020-03-09 21:51

Mlucas V19 on "Ubuntu" on WSL on Win10, i7-8750H laptop,
production prime95 and mfakto stopped for the test;
incidental other background activity using ~5% of cpu capacity.
Note only one of the two SODIMM sockets is occupied; 16GB RAM

./mlucas-fma3-mt -fftlen 192 -iters 1000 -radset 0 -nthread x, varied nthread
1: 3.108 sec
2: 1.819
3: 1.883
4: 2.012
6: 2.020
12: 2.100

./mlucas-fma3-mt -fftlen 5120 -iters 100 -radset 0 -nthread x, varied nthread
1: 4.389 sec
2: 2.689
3: 2.351
4: 1.832
6: 1.736
12: 1.825

./mlucas-fma3-mt -fftlen 5120 -iters 100 -radset 0 -nthread x, varied nthread
1: 4.389 sec
2: 2.689
3: 2.351
4: 1.832
6: 1.736
12: 1.825

./mlucas-fma3-mt -fftlen 5632 -iters 100 -radset 0 -nthread x, varied nthread
1: 4.796 sec
2: 2.829
3: 2.442
4: 2.025
6: 1.888
12: 2.035

./mlucas-fma3-mt -fftlen 5632 -iters 100 -radset 0 -cpu 0:11
2.012 sec

./mlucas-fma3-mt -fftlen 5632 -iters 100 -radset 0 -cpu 0:5
1.870 sec

./mlucas-fma3-mt -fftlen 5632 -iters 100 -radset 0 -cpu 0:3
1.929 sec

./mlucas-fma3-mt -fftlen 5632 -iters 100 -radset 0 -cpu 0:1
2.773 sec
repeated timings 2.781, 2.788, 2.805, 2.924.

./mlucas-fma3-mt -fftlen 5632 -iters 100 -radset 0 -cpu 0
4.634 sec
repeated timings 4.624, 4.647, 4.647, 4.665.

ewmayer 2020-03-22 22:57

Example of corrupted residue data caught by Gerbicz check
 
1 Attachment(s)
My PRP test of [url=https://www.mersenne.org/report_exponent/?exp_lo=103928393&full=1]M103928393[/url] on my famously glitchy and error-prone Haswell quad finished a couple days ago. I noticed it had no fewer than 4 Gerbicz-check failure-and-retries along the way, so decided a DC using gpuOwl on my GPU was in order. Problem was, when I tried reserving it via the Manual Tests pages - not completely necessary, since gpuOwl can test without a proper Primenet assignment ID - I got
[code]Error code: 40
Error text: No assignment available meeting CPU, program code and work preference requirements, cpu_id: 145323, cpu # = 0, user_id = 20047[/code]
E-mailed Aaron and George to ask about that, George replied with
[quote]The manual assignments page was refusing to give out a PRPDC assignment unless the first PRP was done by prime95.
It was trying to prevent a gpuowler from DCing a previous gpuowl result (no shift count).
I see that Mlucas uses shift counts, so I changed the SQL query to allow assigning PRPDC as long as the first test had a shift count.

I tested the fix and got your exponent assigned. I've put it on one of my GPUs. You should have a result within 2 days.[/quote]
His DC just finished, and we match, which is gratifying to see. It's useful to analyze one of the G-check failures - here is the relevant snippet from the .stat (Mlucas log) file, a full bzipped copy of which is attached:
[code][2020-02-24 12:17:07] M103928393 Iter# = 27000000 [25.98% complete] clocks = 00:03:23.189 [ 20.3190 msec/iter] Res64: [b]F3BE8DF410F5D624[/b]. AvgMaxErr = 0.148711378. MaxErr = 0.218750000. Residue shift count = 16345459.
Gerbicz check failed! Restarting from last-good-Gerbicz-check data, or from scratch if iteration < 1000000
Restarting M103928393 at iteration = 26000000. Res64: 0E65A742C2BB20B1, residue shift count = 69955037
M103928393: using FFT length 5632K = 5767168 8-byte floats, initial residue shift count = 69955037
this gives an average 18.020698027177289 bits per digit
The test will be done in form of a 3-PRP test.
[2020-02-24 12:20:53] M103928393 Iter# = 26010000 [25.03% complete] clocks = 00:03:23.089 [ 20.3089 msec/iter] Res64: 475BF991D56F1BA0. AvgMaxErr = 0.148887068. MaxErr = 0.218750000. Residue shift count = 84179301.
...
[2020-02-24 18:01:56] M103928393 Iter# = 27000000 [25.98% complete] clocks = 00:03:22.197 [ 20.2198 msec/iter] Res64: [b]6A0AD63777D57CDB[/b]. AvgMaxErr = 0.148770224. MaxErr = 0.203125000. Residue shift count = 16345459.
At iteration 27000000, shift = 16345459: Gerbicz check passed.
[2020-02-24 18:05:42] M103928393 Iter# = 27010000 [25.99% complete] clocks = 00:03:22.284 [ 20.2285 msec/iter] Res64: D5ECCFB2699E1849. AvgMaxErr = 0.148870173. MaxErr = 0.218750000. Residue shift count = 44187191.[/code]
Parsing backward from both iter = 27M entries we find where the Res64s first diverged, it's when the code hits a sudden-onset fatal-ROE and the retry of the same interval 'succeeds' but - as it turns out - with some kind of "silent" data corruption, i.e. that the same system glitchiness which triggered the first fatal ROE has recurred during the ensuing interval-retry, just not in a way which triggered a fatal ROE the second time around:
[code][2020-02-24 03:36:55] M103928393 Iter# = 26600000 [25.59% complete] clocks = 00:03:23.000 [ 20.3000 msec/iter] Res64: 24006083D0E3DC08. AvgMaxErr = 0.148862082. MaxErr = 0.218750000. Residue shift count = 26402859.
[2020-02-24 03:40:20] M103928393 Iter# = 26610000 [25.60% complete] clocks = 00:03:22.976 [ 20.2977 msec/iter] Res64: 3D5C81ED9B60E581. AvgMaxErr = 0.148812770. MaxErr = 0.203125000. Residue shift count = 17047481.
[2020-02-24 03:43:46] M103928393 Iter# = 26620000 [25.61% complete] clocks = 00:03:22.984 [ 20.2984 msec/iter] Res64: CC78A4D41D7C36B0. AvgMaxErr = 0.148618110. MaxErr = 0.203125000. Residue shift count = 90014462.
[2020-02-24 03:47:13] M103928393 Iter# = 26630000 [25.62% complete] clocks = 00:03:25.069 [ 20.5070 msec/iter] Res64: 28666DBCDDDACF95. AvgMaxErr = 0.148698978. MaxErr = 0.218750000. Residue shift count = 71209289.
[2020-02-24 03:50:41] M103928393 Iter# = 26640000 [25.63% complete] clocks = 00:03:25.529 [ 20.5530 msec/iter] Res64: 97A4BE71C4C69F35. AvgMaxErr = 0.148884626. MaxErr = 0.234375000. Residue shift count = 74415220.
[2020-02-24 03:54:06] M103928393 Iter# = 26650000 [25.64% complete] clocks = 00:03:23.486 [ 20.3487 msec/iter] Res64: E21867090F1ED800. AvgMaxErr = 0.148870620. MaxErr = 0.218750000. Residue shift count = 35507203.
[2020-02-24 03:57:33] M103928393 Iter# = 26660000 [25.65% complete] clocks = 00:03:23.890 [ 20.3891 msec/iter] Res64: 80B2F7731FECF0FF. AvgMaxErr = 0.148867484. MaxErr = 0.187500000. Residue shift count = 44352630.
[2020-02-24 04:00:58] M103928393 Iter# = 26670000 [25.66% complete] clocks = 00:03:23.111 [ 20.3112 msec/iter] Res64: BCE2C7BC49D369AB. AvgMaxErr = 0.148717655. MaxErr = 0.250000000. Residue shift count = 94577777.
[2020-02-24 04:04:23] M103928393 Iter# = 26680000 [25.67% complete] clocks = 00:03:22.826 [ 20.2826 msec/iter] Res64: 8F058E4D087CAA99. AvgMaxErr = 0.148834566. MaxErr = 0.218750000. Residue shift count = 43649701.
M103928393 Roundoff warning on iteration 26687954, maxerr = 0.500000000000
Retrying iteration interval to see if roundoff error is reproducible.
Restarting M103928393 at iteration = 26680000. Res64: 8F058E4D087CAA99, residue shift count = 43649701
M103928393: using FFT length 5632K = 5767168 8-byte floats, initial residue shift count = 43649701
this gives an average 18.020698027177289 bits per digit
The test will be done in form of a 3-PRP test.
Retry of iteration interval with fatal roundoff error was successful.
[2020-02-24 04:10:03] M103928393 Iter# = 26690000 [25.68% complete] clocks = 00:02:56.183 [ 17.6184 msec/iter] Res64: BBC72ACB4A413178. AvgMaxErr = 0.148836098. MaxErr = 0.250000000. Residue shift count = 21946860.[/code]
Compare the same iteration interval for the post-failed-G-check retry of the full 1M-multiple iteration interval:
[code][2020-02-24 15:43:21] M103928393 Iter# = 26600000 [25.59% complete] clocks = 00:03:23.937 [ 20.3937 msec/iter] Res64: 24006083D0E3DC08. AvgMaxErr = 0.148862082. MaxErr = 0.218750000. Residue shift count = 26402859.
[2020-02-24 15:46:48] M103928393 Iter# = 26610000 [25.60% complete] clocks = 00:03:24.153 [ 20.4153 msec/iter] Res64: 3D5C81ED9B60E581. AvgMaxErr = 0.148812770. MaxErr = 0.203125000. Residue shift count = 17047481.
[2020-02-24 15:50:13] M103928393 Iter# = 26620000 [25.61% complete] clocks = 00:03:22.396 [ 20.2396 msec/iter] Res64: CC78A4D41D7C36B0. AvgMaxErr = 0.148618110. MaxErr = 0.203125000. Residue shift count = 90014462.
[2020-02-24 15:53:38] M103928393 Iter# = 26630000 [25.62% complete] clocks = 00:03:23.522 [ 20.3522 msec/iter] Res64: 28666DBCDDDACF95. AvgMaxErr = 0.148698978. MaxErr = 0.218750000. Residue shift count = 71209289.
[2020-02-24 15:57:04] M103928393 Iter# = 26640000 [25.63% complete] clocks = 00:03:23.579 [ 20.3579 msec/iter] Res64: 97A4BE71C4C69F35. AvgMaxErr = 0.148884626. MaxErr = 0.234375000. Residue shift count = 74415220.
[2020-02-24 16:00:30] M103928393 Iter# = 26650000 [25.64% complete] clocks = 00:03:23.273 [ 20.3273 msec/iter] Res64: E21867090F1ED800. AvgMaxErr = 0.148870620. MaxErr = 0.218750000. Residue shift count = 35507203.
[2020-02-24 16:03:55] M103928393 Iter# = 26660000 [25.65% complete] clocks = 00:03:23.569 [ 20.3570 msec/iter] Res64: 80B2F7731FECF0FF. AvgMaxErr = 0.148867484. MaxErr = 0.187500000. Residue shift count = 44352630.
[2020-02-24 16:07:21] M103928393 Iter# = 26670000 [25.66% complete] clocks = 00:03:23.612 [ 20.3612 msec/iter] Res64: BCE2C7BC49D369AB. AvgMaxErr = 0.148717655. MaxErr = 0.250000000. Residue shift count = 94577777.
[2020-02-24 16:10:46] M103928393 Iter# = 26680000 [25.67% complete] clocks = 00:03:22.849 [ 20.2850 msec/iter] Res64: 8F058E4D087CAA99. AvgMaxErr = 0.148834566. MaxErr = 0.218750000. Residue shift count = 43649701.
[2020-02-24 16:14:12] M103928393 Iter# = 26690000 [25.68% complete] clocks = 00:03:23.453 [ 20.3454 msec/iter] Res64: 070333818EEE6AE7. AvgMaxErr = 0.148934921. [b]MaxErr = 0.203125000[/b]. Residue shift count = 21946860.
M103928393 Roundoff warning on iteration 26694909, maxerr = 0.500000000000
Retrying iteration interval to see if roundoff error is reproducible.
Restarting M103928393 at iteration = 26690000. Res64: 070333818EEE6AE7, residue shift count = 21946860
M103928393: using FFT length 5632K = 5767168 8-byte floats, initial residue shift count = 21946860
this gives an average 18.020698027177289 bits per digit
The test will be done in form of a 3-PRP test.
Retry of iteration interval with fatal roundoff error was successful.[/code]
Notice how the retry also has data-corruption glitch right after it passes 26.69M - but in this case the later 27M G-check assures us that all is well. We hope - with multiple such failures on this run, it was very gratifying to see the final result matching George's gpuOwl-run one. He know as do I that we can *simulate* G-check and other kinds of failures in our software-development efforts all we like, but nothing convinces like the acid test of the real world.

It's funny - I used to regularly curse this, um, accursed Haswell system, but now that it's no longer doing a significant % of my GIMPS crunching I'm really enjoying the if-I-can-get-reliable-results-from-this-piece-of-crap aspects. But no more LL-tests or LL-DCs on this system, that's for sure.

Oh, here is the full set of every-1M-iter interim residues for my run, including the G-check failures:
[code][2020-02-17 01:50:20] M103928393 Iter# = 1000000 Res64: BBBC3545748228D8. MaxErr = 0.250000000. shift = 72344423.
[2020-02-17 07:43:10] M103928393 Iter# = 2000000 Res64: 025C080BC4C6055B. MaxErr = 0.250000000. shift = 44575791.
[2020-02-17 13:36:33] M103928393 Iter# = 3000000 Res64: 85DB2AC8E470DFEC. MaxErr = 0.281250000. shift = 89726158.
[2020-02-17 19:21:57] M103928393 Iter# = 4000000 Res64: BA66000615EF926A. MaxErr = 0.218750000. shift = 87626016.
[2020-02-18 00:57:26] M103928393 Iter# = 5000000 Res64: E721C49DB5954694. MaxErr = 0.203125000. shift = 16969828.
[2020-02-18 06:24:33] M103928393 Iter# = 6000000 Res64: 9FDEA3BDD4B7FC45. MaxErr = 0.218750000. shift = 99400703.
[2020-02-18 13:07:25] M103928393 Iter# = 7000000 Res64: BE95DA8D2E615F52. MaxErr = 0.203125000. shift = 20741788.
[2020-02-18 22:01:34] M103928393 Iter# = 8000000 Res64: 60B0C6EE5B0DDCE6. MaxErr = 0.203125000. shift = 37090148.
[2020-02-19 06:30:59] M103928393 Iter# = 9000000 Res64: 511F198D4F2BE727. MaxErr = 0.203125000. shift = 93281939.
[2020-02-19 11:18:56] M103928393 Iter# = 10000000 Res64: AB10CE8109FF2E7E. MaxErr = 0.218750000. shift = 40201342.
[2020-02-19 17:02:02] M103928393 Iter# = 11000000 Res64: 0D79D26162CBA4B5. MaxErr = 0.203125000. shift = 101752387.
[2020-02-20 00:06:20] M103928393 Iter# = 12000000 Res64: EDA5832B97C533DC. MaxErr = 0.195312500. shift = 10233705.
[2020-02-20 09:02:10] M103928393 Iter# = 13000000 Res64: BF586784A1B0D9DD. MaxErr = 0.218750000. shift = 48015562.
[2020-02-20 18:00:31] M103928393 Iter# = 14000000 Res64: 48452729BF0A0732. MaxErr = 0.203125000. shift = 3158862.
[2020-02-21 02:56:16] M103928393 Iter# = 15000000 Res64: 8B80BB93C1361106. MaxErr = 0.203125000. shift = 96856822.
[2020-02-21 11:32:12] M103928393 Iter# = 16000000 Res64: FA48BCC7E0ABC5F7. MaxErr = 0.203125000. shift = 32146366.
[2020-02-21 20:44:32] M103928393 Iter# = 17000000 Res64: 377C185E322A9F98. MaxErr = 0.218750000. shift = 17221864.
[2020-02-22 02:20:33] M103928393 Iter# = 18000000 Res64: CD8C589AFB89DB19. MaxErr = 0.218750000. shift = 50146915.
[2020-02-22 08:04:17] M103928393 Iter# = 19000000 Res64: 24A95ADC07142A48. MaxErr = 0.187500000. shift = 69632311.
[2020-02-22 13:52:49] M103928393 Iter# = 20000000 Res64: 3C675B06ED79F705. MaxErr = 0.203125000. shift = 26706482.
[2020-02-22 19:42:02] M103928393 Iter# = 21000000 Res64: EC5660CA1082F345. MaxErr = 0.203125000. shift = 45377540.
[2020-02-23 01:23:59] M103928393 Iter# = 22000000 Res64: 5A363FFEAF9FFDD8. MaxErr = 0.218750000. shift = 33006999.
[2020-02-23 07:07:14] M103928393 Iter# = 23000000 Res64: 134074260FFFEFA5. MaxErr = 0.218750000. shift = 63572438.
[2020-02-23 12:49:41] M103928393 Iter# = 24000000 Res64: 3A557523E1D1B19A. MaxErr = 0.203125000. shift = 78368660.
[2020-02-23 18:27:31] M103928393 Iter# = 25000000 Res64: 138552F613F4BD6D. MaxErr = 0.203125000. shift = 16942643.
[2020-02-24 00:10:02] M103928393 Iter# = 26000000 Res64: 0E65A742C2BB20B1. MaxErr = 0.218750000. shift = 69955037.
Gerbicz check failed! Restarting from last-good-Gerbicz-check data, or from scratch if iteration < 1000000
[2020-02-24 12:17:07] M103928393 Iter# = 27000000 Res64: F3BE8DF410F5D624. MaxErr = 0.218750000. shift = 16345459.
[2020-02-24 18:01:56] M103928393 Iter# = 27000000 Res64: 6A0AD63777D57CDB. MaxErr = 0.203125000. shift = 16345459.
[2020-02-25 13:36:24] M103928393 Iter# = 28000000 Res64: A8F349B3B5B1AFF7. MaxErr = 0.203125000. shift = 58352056.
[2020-02-25 19:04:37] M103928393 Iter# = 29000000 Res64: F1CF00263AA7CC28. MaxErr = 0.203125000. shift = 48574191.
[2020-02-26 12:58:12] M103928393 Iter# = 30000000 Res64: 99D4FB9E0C76ADE4. MaxErr = 0.203125000. shift = 74711714.
[2020-02-26 20:16:40] M103928393 Iter# = 31000000 Res64: CD7BA2D0CBF319C6. MaxErr = 0.203125000. shift = 46484356.
[2020-02-27 01:38:02] M103928393 Iter# = 32000000 Res64: D26AFB429CFD18AD. MaxErr = 0.203125000. shift = 60793685.
[2020-02-27 06:51:41] M103928393 Iter# = 33000000 Res64: 02C2DAC6FA89A512. MaxErr = 0.203125000. shift = 37220461.
[2020-02-27 12:07:02] M103928393 Iter# = 34000000 Res64: F73E3B502508EC17. MaxErr = 0.250000000. shift = 73230751.
[2020-02-27 17:20:32] M103928393 Iter# = 35000000 Res64: 378304DEB0552788. MaxErr = 0.203125000. shift = 16550084.
[2020-02-28 01:06:29] M103928393 Iter# = 36000000 Res64: 44931F56605C0665. MaxErr = 0.218750000. shift = 50708651.
[2020-02-28 06:17:32] M103928393 Iter# = 37000000 Res64: 440D64F93493A5F0. MaxErr = 0.203125000. shift = 87534239.
[2020-02-28 11:31:11] M103928393 Iter# = 38000000 Res64: D5A0E22C8AC79E65. MaxErr = 0.218750000. shift = 35802507.
[2020-02-28 17:02:25] M103928393 Iter# = 39000000 Res64: 2C198B4B490A06E3. MaxErr = 0.218750000. shift = 30983032.
[2020-02-28 22:18:04] M103928393 Iter# = 40000000 Res64: 9F79C2894C3D7FBC. MaxErr = 0.187500000. shift = 44669779.
[2020-02-29 03:34:38] M103928393 Iter# = 41000000 Res64: B13F55C1BAE6FCC8. MaxErr = 0.218750000. shift = 46462316.
[2020-02-29 22:57:48] M103928393 Iter# = 42000000 Res64: EF9620EBB65ACC84. MaxErr = 0.218750000. shift = 12147797.
[2020-03-01 15:15:37] M103928393 Iter# = 43000000 Res64: 052C1F13AE12DFD3. MaxErr = 0.218750000. shift = 22454771.
[2020-03-01 20:24:23] M103928393 Iter# = 44000000 Res64: 7CDD9FA077AFEF10. MaxErr = 0.187500000. shift = 41381271.
[2020-03-02 01:33:42] M103928393 Iter# = 45000000 Res64: B1A45BB8243F2532. MaxErr = 0.203125000. shift = 46613693.
[2020-03-02 13:17:34] M103928393 Iter# = 46000000 Res64: 2C0A2FFA764C82BA. MaxErr = 0.203125000. shift = 92985671.
[2020-03-02 18:22:38] M103928393 Iter# = 47000000 Res64: 00CD4BD9278888D4. MaxErr = 0.218750000. shift = 46300678.
[2020-03-02 23:45:02] M103928393 Iter# = 48000000 Res64: B98EDEC6431ACA7A. MaxErr = 0.218750000. shift = 76984717.
[2020-03-03 16:22:45] M103928393 Iter# = 49000000 Res64: D3C28C254DD0D26F. MaxErr = 0.218750000. shift = 100048267.
Gerbicz check failed! Restarting from last-good-Gerbicz-check data, or from scratch if iteration < 1000000
[2020-03-03 21:20:31] M103928393 Iter# = 50000000 Res64: 5752753AF252BE20. MaxErr = 0.203125000. shift = 57343354.
[2020-03-04 03:14:25] M103928393 Iter# = 50000000 Res64: 4063D9F2B7BE34EE. MaxErr = 0.203125000. shift = 57343354.
[2020-03-04 08:23:33] M103928393 Iter# = 51000000 Res64: 490F2A5D543108AA. MaxErr = 0.218750000. shift = 45026483.
[2020-03-04 14:27:26] M103928393 Iter# = 52000000 Res64: 8A931552D40A1C3C. MaxErr = 0.187500000. shift = 50135037.
[2020-03-04 20:42:32] M103928393 Iter# = 53000000 Res64: 46C9814B2B4B4429. MaxErr = 0.218750000. shift = 60994797.
[2020-03-05 13:27:18] M103928393 Iter# = 54000000 Res64: F424E1ABA000A55D. MaxErr = 0.218750000. shift = 36384582.
[2020-03-05 18:46:37] M103928393 Iter# = 55000000 Res64: 9695E2732CD478FE. MaxErr = 0.218750000. shift = 66125097.
[2020-03-06 00:08:23] M103928393 Iter# = 56000000 Res64: 9BD8E857DAFCCD37. MaxErr = 0.218750000. shift = 58385411.
[2020-03-06 05:32:35] M103928393 Iter# = 57000000 Res64: 3E97DC5E3101937E. MaxErr = 0.203125000. shift = 30148663.
[2020-03-06 10:59:50] M103928393 Iter# = 58000000 Res64: 681CA06F306588A8. MaxErr = 0.218750000. shift = 57683924.
[2020-03-06 16:36:30] M103928393 Iter# = 59000000 Res64: 2313B6026B7D5B5A. MaxErr = 0.218750000. shift = 100909260.
[2020-03-06 21:54:06] M103928393 Iter# = 60000000 Res64: 79DA089E14798438. MaxErr = 0.203125000. shift = 52052766.
[2020-03-07 03:10:30] M103928393 Iter# = 61000000 Res64: 0604748846C7BA43. MaxErr = 0.203125000. shift = 50100806.
[2020-03-07 13:24:02] M103928393 Iter# = 62000000 Res64: 73BA8F7177734BFC. MaxErr = 0.218750000. shift = 24937938.
[2020-03-07 18:42:52] M103928393 Iter# = 63000000 Res64: E15DB4126ED9FDCC. MaxErr = 0.203125000. shift = 102809206.
[2020-03-08 00:02:44] M103928393 Iter# = 64000000 Res64: ACE78A6D6AFCD490. MaxErr = 0.187500000. shift = 10622924.
[2020-03-08 06:24:27] M103928393 Iter# = 65000000 Res64: 8B6F19963789D44D. MaxErr = 0.203125000. shift = 15187609.
[2020-03-08 11:48:59] M103928393 Iter# = 66000000 Res64: A13DA6AEB79BA5BC. MaxErr = 0.218750000. shift = 19703128.
[2020-03-08 17:08:14] M103928393 Iter# = 67000000 Res64: 9DE7325DEBECE0A7. MaxErr = 0.203125000. shift = 31776614.
[2020-03-08 22:27:35] M103928393 Iter# = 68000000 Res64: 0DD39D6A8AEA83C5. MaxErr = 0.187500000. shift = 16852816.
[2020-03-09 03:47:32] M103928393 Iter# = 69000000 Res64: CB9C15D14E09FA8A. MaxErr = 0.218750000. shift = 34242940.
[2020-03-09 09:10:22] M103928393 Iter# = 70000000 Res64: 2EF92E9C7474B7DF. MaxErr = 0.203125000. shift = 13081306.
[2020-03-09 14:32:28] M103928393 Iter# = 71000000 Res64: 68E9F52257DED1CB. MaxErr = 0.203125000. shift = 83966239.
[2020-03-09 19:42:51] M103928393 Iter# = 72000000 Res64: FB96E02765CAE316. MaxErr = 0.218750000. shift = 67499467.
Gerbicz check failed! Restarting from last-good-Gerbicz-check data, or from scratch if iteration < 1000000
[2020-03-10 02:44:11] M103928393 Iter# = 73000000 Res64: 684D004C6CB2BE79. MaxErr = 0.203125000. shift = 89238658.
[2020-03-10 07:58:53] M103928393 Iter# = 73000000 Res64: E618711B71EB2D54. MaxErr = 0.218750000. shift = 89238658.
[2020-03-10 13:41:19] M103928393 Iter# = 74000000 Res64: 17DDE850894B6914. MaxErr = 0.218750000. shift = 5272140.
[2020-03-10 19:02:27] M103928393 Iter# = 75000000 Res64: 32B399B8B93D9210. MaxErr = 0.218750000. shift = 45348029.
Gerbicz check failed! Restarting from last-good-Gerbicz-check data, or from scratch if iteration < 1000000
[2020-03-11 01:25:59] M103928393 Iter# = 76000000 Res64: 541D72E853D20C99. MaxErr = 0.218750000. shift = 93566275.
[2020-03-11 13:02:27] M103928393 Iter# = 76000000 Res64: C98E4FDC71CBB181. MaxErr = 0.218750000. shift = 93566275.
[2020-03-11 18:15:54] M103928393 Iter# = 77000000 Res64: 3D4B7BCF6B1A8368. MaxErr = 0.203125000. shift = 22817827.
[2020-03-11 23:27:29] M103928393 Iter# = 78000000 Res64: 7E17BD3314DBBEE4. MaxErr = 0.218750000. shift = 25349038.
[2020-03-12 11:59:14] M103928393 Iter# = 79000000 Res64: D4179244E4CEC500. MaxErr = 0.203125000. shift = 35774788.
[2020-03-12 17:35:40] M103928393 Iter# = 80000000 Res64: F91127EB7F404C68. MaxErr = 0.187500000. shift = 33102786.
[2020-03-12 23:08:51] M103928393 Iter# = 81000000 Res64: 8A0E250E067312BA. MaxErr = 0.203125000. shift = 98858751.
[2020-03-13 11:16:35] M103928393 Iter# = 82000000 Res64: 6C5E069A5B68F3A0. MaxErr = 0.203125000. shift = 56389958.
[2020-03-13 16:33:22] M103928393 Iter# = 83000000 Res64: 95B59DE3DC7CCF89. MaxErr = 0.218750000. shift = 7674443.
[2020-03-13 21:52:00] M103928393 Iter# = 84000000 Res64: 594F77368081548A. MaxErr = 0.203125000. shift = 15739078.
[2020-03-14 03:11:26] M103928393 Iter# = 85000000 Res64: 0D438FDBFFA56306. MaxErr = 0.234375000. shift = 21319387.
[2020-03-14 08:33:02] M103928393 Iter# = 86000000 Res64: 05C10F4FEA66F3EF. MaxErr = 0.218750000. shift = 14519481.
[2020-03-14 13:54:05] M103928393 Iter# = 87000000 Res64: E53A17F251CB45DE. MaxErr = 0.218750000. shift = 76083446.
[2020-03-14 19:14:57] M103928393 Iter# = 88000000 Res64: C1C67FC031E0C7C2. MaxErr = 0.218750000. shift = 59669909.
[2020-03-15 00:34:47] M103928393 Iter# = 89000000 Res64: 4C71AE552A4C1882. MaxErr = 0.203125000. shift = 5762860.
[2020-03-15 11:00:37] M103928393 Iter# = 90000000 Res64: 4FEC488B88836FAE. MaxErr = 0.203125000. shift = 58418774.
[2020-03-15 16:17:27] M103928393 Iter# = 91000000 Res64: F769669BA8561D38. MaxErr = 0.203125000. shift = 19813633.
[2020-03-15 21:34:46] M103928393 Iter# = 92000000 Res64: 6D5B294ED5503BF6. MaxErr = 0.207031250. shift = 37832227.
[2020-03-16 02:53:00] M103928393 Iter# = 93000000 Res64: 3ACB5D989A7F0AFF. MaxErr = 0.218750000. shift = 21992959.
[2020-03-16 08:14:54] M103928393 Iter# = 94000000 Res64: B33A4FFDC97BEA3F. MaxErr = 0.203125000. shift = 59446616.
[2020-03-16 13:34:35] M103928393 Iter# = 95000000 Res64: C9E06367147C632E. MaxErr = 0.203125000. shift = 49196708.
[2020-03-16 18:56:31] M103928393 Iter# = 96000000 Res64: 111AAB6935CA25B3. MaxErr = 0.218750000. shift = 9592755.
[2020-03-17 00:15:11] M103928393 Iter# = 97000000 Res64: 5DF28D3B980EDDC1. MaxErr = 0.218750000. shift = 100780262.
[2020-03-17 05:35:12] M103928393 Iter# = 98000000 Res64: 61712C5A42CD31F5. MaxErr = 0.218750000. shift = 76892552.
[2020-03-17 10:53:45] M103928393 Iter# = 99000000 Res64: 3C5CBC3DC2FE5DAF. MaxErr = 0.218750000. shift = 38276972.
[2020-03-17 16:11:59] M103928393 Iter# = 100000000 Res64: C8A48D71DC6613FD. MaxErr = 0.218750000. shift = 73920976.
[2020-03-17 21:19:25] M103928393 Iter# = 101000000 Res64: F6C5F459CD73AFFC. MaxErr = 0.218750000. shift = 40478354.
[2020-03-18 02:28:11] M103928393 Iter# = 102000000 Res64: 1A67D6A14E2C10C8. MaxErr = 0.203125000. shift = 56433856.
[2020-03-18 18:36:41] M103928393 Iter# = 103000000 Res64: D01FC617E6CD6FAB. MaxErr = 0.218750000. shift = 76799979.[/code]

LaurV 2020-03-23 08:31

What do I see there? do you have some kind of "random" shift for every iteration? :shock:

R. Gerbicz 2020-03-23 13:08

[QUOTE=ewmayer;540551]My PRP test of [url=https://www.mersenne.org/report_exponent/?exp_lo=103928393&full=1]M103928393[/url] on my famously glitchy and error-prone Haswell quad finished a couple days ago. I noticed it had no fewer than 4 Gerbicz-check failure-and-retries along the way, so decided a DC using gpuOwl on my GPU was in order. [/QUOTE]
Are you using error checking for the last few iterations, in your case (928393) ?
You can do this in multiple ways.
Say forcing an error check at 103928000 and then (boring) double checking the last 393 iterations.
Other way what gpuowl's idea: go past, do 929000 iterations and then force an error check.
George's code is a little more complicated on this (not even using a fixed block size=1000).

LaurV, the whole shifting idea isn't that very necessary (but it is still good to have shifting).

LaurV 2020-03-23 16:11

[QUOTE=R. Gerbicz;540594]LaurV, the whole shifting idea isn't that very necessary (but it is still good to have shifting).[/QUOTE]
My argument was that having the shift value changed every time randomly, is stupid. You must start with a shift, and keep it to the end. Otherwise, restoring any test is impossible (like for hunting bugs, fixing FFT mess, etc. I used to found bugs in cudaLucas that appeared for a particular shift, but not for the other, i.e. particular values for FFT, and those bugs could be traced by repeating the test with the same shift and watch where the residues start to differ (do a binary search with the checkpoint value for residues, etc). Maybe my understanding of the shift column is wrong, but with random shifts not only that such "bug fixing" became impossible, but also the tests themselves become unsure, how do you know that, even if you started with two different shifts, it didn't happen that in exactly the same point of both tests (FC and DC) the shifts weren't the same, and both tests aren't screwed? (probabilistic impossible, but the mother life is a bitch..., if you catch my point...)

ewmayer 2020-03-23 19:34

[QUOTE=R. Gerbicz;540594]Are you using error checking for the last few iterations, in your case (928393) ?[/QUOTE]
Not yet - that is on my to-do list for v20. Thus I currently have "99% coverage" in terms of applying your check.

[QUOTE=LaurV;540585]What do I see there? do you have some kind of "random" shift for every iteration? :shock:[/QUOTE]

No - the initial shift is pseudorandom in [0,p) (I should probably fiddle the code to make sure it's nonzero), but subsequent shifts are result of successive (mod p) doublings of the initial shift. If you look at top of my logfile, you see initial residue shift count = 13429491. 1 million mod-squarings later, we expect shift = 13429491*2^1000000 (mod 103928393). Here's a tiny snip of *nix bc code to illustrate:
[code]define shift_update(s0,niters,p) {
auto s,i;;
s = s0;
for(i = 0; i < niters; i++) {
s = 2*s % p;
}
print s0," * 2^",niters," (mod ",p,") = ",s,".\n";
}[/code]
And let's see what the shift should be after 10^6 iterations:
[code]p=103928393
shift_update(13429491,10^6,p)
13429491 * 2^1000000 (mod 103928393) = 72344423.[/code]
...which matches the value in my every-1M-iteration list.

Interestingly, though, adding randomization to this process proves crucial for doing *Fermat* number testing-with-shift. Here's why: for squaring chains modulo a Fermat number Fm, each subsequent squaring doubles the shift count (mod 2^m), so we have the problem that any nonzero initial shift s0 will lead to a shift s = 0 after precisely m square-mods, and the shift will remain 0 throughout the remainder of the squaring chain. One way to get around this problem is to define an auxiliary random-bit array, and on the (i)th square-mod, if bit[i] = 1, do a further mod-doubling of the residue, thus yielding the shift-count update s[i] = 2.s[i-1] + bit[i]. For a pair of Pepin tests of Fm such as ar typically done to allow for real-time cross-validation of interim residues, it is also advisable to seed the bit-arrays differently for each run.

W.r.to debug-purposes reprducibility, they key is pseudorandomness in all this, i.e. that one's various 'random' numbers are 'reproducibly random' as the admittedly oxymoronic term of art has it.

LaurV 2020-03-24 02:48

[QUOTE=ewmayer;540649]
No - the initial shift is pseudorandom in [0,p)
<snip>
W.r.to debug-purposes reprducibility, they key is pseudorandomness in all this, i.e. that one's various 'random' numbers are 'reproducibly random' as the admittedly oxymoronic term of art has it.[/QUOTE]
Exactly. We now talk the same language. Then my understanding was wrong, I was put off by that random-looking sting of residues on each line (which don't really needed to be printed, you could just print the shift at the beginning). In the past there was a talk on the forum about "random shifts" and I was afraid that you subscribed to that silly idea. False alarm. Sorry.

tdulcet 2020-05-20 12:40

Install Script for Linux
 
I wrote a script to download, setup, build and run Mlucas on Linux. It supports x86 Intel and AMD and ARM CPUs: [URL]https://github.com/tdulcet/Distributed-Computing-Scripts#mlucas[/URL]

If the required dependencies (Make and the GNU C compiler) are already installed, it should work on any Linux distribution. Otherwise, it will install the required dependencies on Ubuntu and Debian/Raspbian. Pull requests are welcome!

By creating a Makefile and using Make's jobs ([C]-j[/C]) feature with one job for each CPU thread, this script will build Mlucas significantly faster than if you manually ran the gcc commands in the [URL="https://www.mersenneforum.org/mayer/README.html#download"]Mlucas README[/URL]. For example, if your CPU has four CPU threads, it will build approximately four times faster. This script follows the recommended instructions on the Mlucas README for each architecture and CPU, which should provide the best performance for most users. It also saves the Makefile so users can easily change the gcc parameters and rerun [C]make[/C].

There is a log of the script running on Travis CI [URL="https://travis-ci.org/github/tdulcet/Distributed-Computing-Scripts/jobs/688790504"]here[/URL]. Note that there are over 20,000 lines of output, most of which are warnings from gcc. There are also separate scripts to download, setup and run [URL="https://www.mersenneforum.org/showthread.php?p=491112#post491112"]Prime95[/URL] and [URL="https://www.mersenneforum.org/showthread.php?p=490167#post490167"]CUDALucas[/URL] on Linux.

ewmayer 2020-05-21 19:02

@tdulcet - thanks for this! Yes, parallel-buildability is the one clear advantage automake has over my beloved command-line mode.

I think a good way to proceed is to let other Mlucas-building GIMPSers try your script out, if feedback is positive I'll be happy to add a link to it, along with suitable text and credit-to-you, on the README page. Sound good?

tdulcet 2020-05-22 02:00

@ewmayer Sounds great. Thanks! Feedback is always welcome.

ewmayer 2020-06-07 20:40

[QUOTE=tdulcet;546167]@ewmayer Sounds great. Thanks! Feedback is always welcome.[/QUOTE]

FYI, I added a note re. your script under "News" at the Mlucas readme page - hopefully that will encourage a few more people to try it out and provide feedback.

bayanne 2020-07-05 11:05

Curiosity has got the better of me.
Will the new ARM cpu being designed to run Mac OS11 be able to run Mlucas, or is it far too early to know ...

Thanks

ewmayer 2020-07-05 23:35

[QUOTE=bayanne;549813]Curiosity has got the better of me.
Will the new ARM cpu being designed to run Mac OS11 be able to run Mlucas, or is it far too early to know ...

Thanks[/QUOTE]

I intend to port my ARM inline-asm routines to the forthcoming "Apple cores", yes.

ewmayer 2020-07-08 23:36

1 Attachment(s)
Couple significant announcements, of the I-have-good-news-and-bad-news variety.

o First, the bad: I have found and fixed a pair of critical bugs affecting FFT lengths of form 3*2[sup]k[/sup]. This means current GIMPS double-checks at FFT length 3M (3072K) and recently-reached-by-the-GIMPS-first-time-test-wavefront 6M (6144K). The bug is specific to 256-bit SIMD builds, meaning x86 avx and avx2-targeting builds. Assuming a multithreaded build, only FFT radix sets of form 192,[powers of 2] are affected, but radix-192 is the most common leading radix appearing in the runtime-based mlucas.cfg tuning file for GIMPS-relevant FFT lengths of the above form, in my experience. ARM builds are not affected, but I rebuilt those binaries as a matter of course, to make sure my bugfix didn't inadvertently break anything there.

o Now the good news: Thanks to Loïc Le Loarer, we have a vastly improved version of the primenet.py assignment-management script for users to try out. This uses Primenet v5 API calls to do cool stuff like register one's computer and monitor progress of one's various runs across multiple devices, as shown in the attachment below.

Check out the Mlucas readme page for details and updated links!

kriesel 2020-07-09 03:41

[QUOTE=ewmayer;550081]I have found and fixed a pair of critical bugs affecting FFT lengths of form 3*2[sup]k[/sup]. This means current GIMPS double-checks at FFT length 3M (3072K) and recently-reached-by-the-GIMPS-first-time-test-wavefront 6M (6144K). The bug is specific to 256-bit SIMD builds, meaning x86 avx and avx2-targeting builds. Assuming a multithreaded build, only FFT radix sets of form 192,[powers of 2] are affected, but radix-192 is the most common leading radix appearing in the runtime-based mlucas.cfg tuning file for GIMPS-relevant FFT lengths of the above form, in my experience. ARM builds are not affected, but I rebuilt those binaries as a matter of course, to make sure my bugfix didn't inadvertently break anything there.[/QUOTE]
This raises some questions.
100Mdigit exponents use 18M ffts.
For example, complex FFT radices 36 16 16 32 32
which is 3[SUP]2[/SUP] 2[SUP]k[/SUP].
Are they affected too? Have they been checked for whether they're affected?
How many versions back do the relevant bugs go?
But it would not affect SSE2 builds, correct?

ewmayer 2020-07-09 05:29

[QUOTE=kriesel;550088]This raises some questions.
100Mdigit exponents use 18M ffts.
For example, complex FFT radices 36 16 16 32 32
which is 3[SUP]2[/SUP] 2[SUP]k[/SUP].
Are they affected too? Have they been checked for whether they're affected?
How many versions back do the relevant bugs go?
But it would not affect SSE2 builds, correct?[/QUOTE]

The patch-related notes on the readme detail all this - your example is FFT length 9*2^k, not the 3*2^k possibly affected by the bug. The bug is actually quite easy to look for based on certain key params in the carry-related sourcefiles, now that I know to double-check those prior to release.

Bug is only related to new code introduced in v19, and if your run at whatever FFT length and radices doesn't crash for expos < 98% the max permitted for the given FFT length (look for the p[ = ...]/pmax_rec on startup, it's unaffected by the bug.

henryzz 2020-07-09 11:59

Are any exponents potentially affected tested and double checked with mlucas?

tdulcet 2020-07-09 14:10

New PrimeNet script
 
[QUOTE=ewmayer;550081]Thanks to Loïc Le Loarer, we have a vastly improved version of the primenet.py assignment-management script for users to try out.[/QUOTE]

Thank you Loïc Le Loarer for creating this new PrimeNet script! I just updated my [URL="https://www.mersenneforum.org/showthread.php?p=545949#post545949"]Install Script for Linux[/URL] to use it. It will automatically set most of the [C]--register[/C] options with the computers system info.

[QUOTE=ewmayer;550081]I have found and fixed a pair of critical bugs affecting FFT lengths of form 3*2[sup]k[/sup].[/QUOTE]

@ewmayer I tried to download the new source tarballs, but I am getting the old MD5 checksums and not what is listed on the Mlucas README. I was going to update my scripts with the new checksums.

ewmayer 2020-07-09 20:14

[QUOTE=henryzz;550101]Are any exponents potentially affected tested and double checked with mlucas?[/QUOTE]

The signature of the bug is an outright crash - no data corruption is involved, I can be sure of this since I hit it during a PRP run, and after fixing the bug and resuming the run, the ensuing Gerbicz check passed. But I'll be happy to post the expo to the strategic-DC thread in a couple of weeks once my run finishes, if that helps put your mind at ease.

[QUOTE=tdulcet;550110]Thank you Loïc Le Loarer for creating this new PrimeNet script! I just updated my [URL="https://www.mersenneforum.org/showthread.php?p=545949#post545949"]Install Script for Linux[/URL] to use it. It will automatically set most of the [C]--register[/C] options with the computers system info.[/QUOTE]
Thanks - I hope you and other Mlucas users fnd the new primenet.py useful.

[QUOTE]@ewmayer I tried to download the new source tarballs, but I am getting the old MD5 checksums and not what is listed on the Mlucas README. I was going to update my scripts with the new checksums.[/QUOTE]
Grr - I uploaded the 4 code-tarballs (2 ARM binaries together with sample mlucas.cfg files from my self-tests of said binaries, 2 source-tarballs with different compression methods) one dir above where the should go, into /src rather than /src/C. I only have ftp access to that site, so unable to 'mv' them into the C-subdir. And bizarrely, when I re-upload them into the proper /src/C dir, the server isn't updating the files! See ftp log below, note the 4 files at the end are still dated 18 Jan.

By way of workaround, have uploaded a fresh version of README, which links to the 4 src-files, not the old one in src/C. Please reload README.html and try the links again.

[code]MacBook:src ewmayer$ ll ../*19*.t*
-rwxrwxrwx 1 ewmayer staff 1210672 Jul 5 13:47 ../Mlucas_v19_c2simd.txz
-rwxrwxrwx 1 ewmayer staff 1301896 Jul 5 13:47 ../Mlucas_v19_nosimd.txz
-rwxr-xr-x 1 ewmayer staff 2819344 Jul 5 13:47 ../mlucas_v19.tbz2
-rwxrwxrwx 1 ewmayer staff 1663952 Jul 5 13:47 ../mlucas_v19.txz
MacBook:src ewmayer$ md5 ../*19*.t*
MD5 (../Mlucas_v19_c2simd.txz) = 281361ac4eb32cc922299fc0d8916035
MD5 (../Mlucas_v19_nosimd.txz) = dab56d0566043b4081ee5048b5cc3842
MD5 (../mlucas_v19.tbz2) = 127178db00bc2852be85bcea4b10988e
MD5 (../mlucas_v19.txz) = 10906d3f1f4206ae93ebdb045f36535c
MacBook:src ewmayer$ mftp
Connected to 209.68.5.141.
220 ProFTPD Server (pair Networks, Inc FTP server) [209.68.5.141]
331 Password required for vang_mayer
Password:
230 User vang_mayer logged in
Remote system type is UNIX.
Using binary mode to transfer files.
ftp> cd src/C
250 CWD command successful
ftp> mput ../*19*.t*
mput ../Mlucas_v19_c2simd.txz [anpqy?]? y
229 Entering Extended Passive Mode (|||6186|)
150 Opening BINARY mode data connection for ../Mlucas_v19_c2simd.txz
100% |******************************************************************************************| 1182 KiB 548.29 KiB/s 00:00 ETA
226 Transfer complete
1210672 bytes sent in 00:02 (504.38 KiB/s)
mput ../Mlucas_v19_nosimd.txz [anpqy?]? y
229 Entering Extended Passive Mode (|||7867|)
150 Opening BINARY mode data connection for ../Mlucas_v19_nosimd.txz
100% |******************************************************************************************| 1271 KiB 507.57 KiB/s 00:00 ETA
226 Transfer complete
1301896 bytes sent in 00:02 (486.17 KiB/s)
mput ../mlucas_v19.tbz2 [anpqy?]? y
229 Entering Extended Passive Mode (|||1071|)
150 Opening BINARY mode data connection for ../mlucas_v19.tbz2
100% |******************************************************************************************| 2753 KiB 662.17 KiB/s 00:00 ETA
226 Transfer complete
2819344 bytes sent in 00:04 (643.03 KiB/s)
mput ../mlucas_v19.txz [anpqy?]? y
229 Entering Extended Passive Mode (|||25712|)
150 Opening BINARY mode data connection for ../mlucas_v19.txz
100% |******************************************************************************************| 1624 KiB 619.33 KiB/s 00:00 ETA
226 Transfer complete
1663952 bytes sent in 00:02 (589.63 KiB/s)
ftp> qui
221 Goodbye.
MacBook:src ewmayer$ mftp
Connected to 209.68.5.141.
220 ProFTPD Server (pair Networks, Inc FTP server) [209.68.5.141]
331 Password required for vang_mayer
Password:
230 User vang_mayer logged in
Remote system type is UNIX.
Using binary mode to transfer files.
ftp> cd src/C
250 CWD command successful
ftp> ls *19*.t*
229 Entering Extended Passive Mode (|||48690|)
150 Opening BINARY mode data connection for file list
-rw-r--r-- 1 vang_mayer users 1210636 Jan 18 16:33 Mlucas_v19_c2simd.txz
-rw-r--r-- 1 vang_mayer users 1301848 Jan 18 16:34 Mlucas_v19_nosimd.txz
-rw-r--r-- 1 vang_mayer users 2769578 Jan 18 16:34 mlucas_v19.tbz2
-rw-r--r-- 1 vang_mayer users 1611944 Jan 18 16:34 mlucas_v19.txz[/code]

Nick 2020-07-09 21:17

Is the mput taking the whole source file path as the destination file path,
including the initial two dots?

ewmayer 2020-07-09 21:34

[QUOTE=Nick;550129]Is the mput taking the whole source file path as the destination file path,
including the initial two dots?[/QUOTE]

I thought it was, based on the fact that it echoes the precise file size transferred, e.g. "1210672 bytes sent." Never had any issues with local-path stuff like this before that I can recall, but anyhow, tried cd'ing up one dir locally, into the precise dir containing the 4 tar-files, re-did the ftp, now as 'mput *19*.*t*', same file sizes echoed, more-or-less-same upload times, but *now* the files actually get updated. WTF? If [shitty ftp utility] can't handle local-paths, then [shitty ftp utility] shouldn't tell me it's uploading files, and take the expected time to upload files, and echo exactly how-many-bytes-transferred, while not actually doing the upload, or uploading to some kind of byte-limbo at the far end. Ridiculous for such a basic Linux utility to misbehave like this, especially since one can't well change the local-dir while in an ftp session, so the need to support local-paths is obvious, for the same reason we support local-paths in a Linux term session: because we want to be able to access files in multiple locations, without having to cd into each one, and then back into the dir we are working from. Jeebus, that's the kind of stupid getting-in-the-user's-way one typically associates with Windows.

/rant

chris2be8 2020-07-10 15:36

Could it have taken the file name you gave it ,including the ../, as the desired destination, and put the files one level up from where you meant to put them? That would explain things.

Also check if mftp has a [c]lcd[/c] command. I don't have mftp installed, but ftp does have a [c]lcd[/c] command to change local directory.

Chris

ewmayer 2020-07-11 00:31

[QUOTE=chris2be8;550184]Could it have taken the file name you gave it ,including the ../, as the desired destination, and put the files one level up from where you meant to put them? That would explain things.[/QUOTE]
Yep, that's it - I had Mike/Xyzzy (who owns the account and thus has file-move-and-delete ability, whereas I can only upload) delete the 4 copies that ended up one-dir-too-high (at same time I updated the README links to point back to src/C, correct server-side dir, rather than to the 4 one-to-high-dir files), then re-did my 'failed upload' from my local src/dir using 'mput ../[wildcarded tar-file abbreviation]', and indeed saw what you describe. How silly - in a Linux term-session if I 'cp ../[file] [destination dir]', [file] ends up in [destination dir], not in [destination dir]/.. . Why would one write the code for ftp to behave fundamentally differently?

[QUOTE]Also check if mftp has a [c]lcd[/c] command. I don't have mftp installed, but ftp does have a [c]lcd[/c] command to change local directory.

Chris[/QUOTE]
Clarification - 'mftp' is my alias for ftp-under-mike's-userid-to-the-server-in-question, i.e. it's just ftp. Using lcd is a good tip - thanks.

ewmayer 2020-07-12 01:03

[QUOTE=bayanne;549813]Curiosity has got the better of me.
Will the new ARM cpu being designed to run Mac OS11 be able to run Mlucas, or is it far too early to know ...[/QUOTE]

I am keen to start porting my ARM assembly-code routines via the ARM SVE framework to the new Apple/ARM CPUs, but for the life of me cannot find basic info like an Instruction Set Reference and whether the vector-math will be 128-bit-wide like current ARMs, or 256-bits like current x86_64. Had a look here, no joy - the 2nd link sounds promising, but is just a list of typealiases:

[url]https://developer.apple.com/documentation/apple_silicon/addressing_architectural_differences_in_your_macos_code[/url]

[url]https://developer.apple.com/documentation/accelerate/simd[/url]

Also, will code built using command-line gcc and llvm/clang - as I use on my venerable Core2Duo MacBook Classic - run on the new Apple-Silicon versions of MacOS, or are binaries going to be inextricably tied to the iOS/Xcode/App-dev framework?

bayanne 2020-07-12 05:58

[QUOTE=ewmayer;550308]I am keen to start porting my ARM assembly-code routines via the ARM SVE framework to the new Apple/ARM CPUs, but for the life of me cannot find basic info like an Instruction Set Reference and whether the vector-math will be 128-bit-wide like current ARMs, or 256-bits like current x86_64. Had a look here, no joy - the 2nd link sounds promising, but is just a list of typealiases:

[url]https://developer.apple.com/documentation/apple_silicon/addressing_architectural_differences_in_your_macos_code[/url]

[url]https://developer.apple.com/documentation/accelerate/simd[/url]

Also, will code built using command-line gcc and llvm/clang - as I use on my venerable Core2Duo MacBook Classic - run on the new Apple-Silicon versions of MacOS, or are binaries going to be inextricably tied to the iOS/Xcode/App-dev framework?[/QUOTE]

Perhaps there will be some similarity to how procedures are run using iPad/iPhone etc ...

chris2be8 2020-07-12 15:31

[QUOTE=ewmayer;550235]Clarification - 'mftp' is my alias for ftp-under-mike's-userid-to-the-server-in-question, i.e. it's just ftp. [/QUOTE]

From the man page for ftp:
[quote]
put local-file [remote-file]
Store a local file on the remote machine. If remote-file is left unspecified, the local file name is used after processing according to
any ntrans or nmap settings in naming the remote file. File transfer uses the current settings for type, format, mode, and structure.
[/quote]

So you can copy and rename a file in one operation. The whole man page for ftp is well worth reading if you use it at all often.

Chris

ldesnogu 2020-07-12 22:30

[QUOTE=ewmayer;550308]I am keen to start porting my ARM assembly-code routines via the ARM SVE framework to the new Apple/ARM CPUs, but for the life of me cannot find basic info like an Instruction Set Reference and whether the vector-math will be 128-bit-wide like current ARMs, or 256-bits like current x86_64.[/QUOTE]
What makes you think Apple CPU will have SVE by end of this year?

ewmayer 2020-07-12 22:49

[QUOTE=ldesnogu;550387]What makes you think Apple CPU will have SVE by end of this year?[/QUOTE]

SWAG - but if the new Apple CPUs are just running the vanilla Arm 128-bit ASIMD behind all that hoopla, hey, less work for me.

I just hope they're not gonna walled-garden it by requiring code running on it to be packaged as an iOS app, but it does rather seem like just the sort of dickish move modern Big Tech companies like to pull.

chris2be8 2020-07-13 15:44

I learnt a lot about FTP by typing help when in it. That gets a list of commands and you can the ask for help on each command. Which is a bit more concise than the man page.

It's especially useful on a platform where you can't read the man page for some reason (eg z/OS). FTP is similar but not identical on all platforms.

As you may be able to tell I've had to use it on several platforms. But I prefer scp or sftp if they are available.

Chris

ewmayer 2020-07-13 19:26

[QUOTE=chris2be8;550454]As you may be able to tell I've had to use it on several platforms. But I prefer scp or sftp if they are available.[/QUOTE]

I always use scp when available, perhaps my expectations re. fs-path handling have been colored by that. But on this particular server (or perhaps my remote-access privileges to it), only ftp is available.

kriesel 2020-07-29 16:48

PRP proof
 
Are you implementing patnashev's prp proof generation in Mlucas?

Uncwilly 2020-07-29 20:17

:goodposting:
Hadn't thought to ask that myself. If it had come to mind, I would have. It will be useful when we find the next candidate prime.

ewmayer 2020-07-29 22:27

PRP-proof support will be in v20, yes. I am alas behind the curve there - between the pandemic and a series of non-life-threatening but still frequently day-week-and-month-ruining health bugaboos, this year has been one of continual annoying distractions. And EOM my housemates-of-2-years (young professional couple who just bought a starter home in the area) are vacating the MBR suite of our large shared apartment, so I have tons of busywork to do getting the place ready to show to prospective renters. What a year...

My one main concern re. PRP-proof support is that it appears that the memory needs will relegate many smaller compute devices (Android phones, Odroid and RPi-style micros) to doing LL-DC and cleanup PRP-DC. It's downright undemocratic elitism, it is. ;)

kriesel 2020-07-30 04:01

[QUOTE=ewmayer;551947]My one main concern re. PRP-proof support is that it appears that the memory needs will relegate many smaller compute devices (Android phones, Odroid and RPi-style micros) to doing LL-DC and cleanup PRP-DC. It's downright undemocratic elitism, it is. ;)[/QUOTE]
Low power proofs are better than none. Standalone devices could drop to 6 (or even 5 if necessary?) and still save ~90+% of a DC.

Per [URL]https://mersenneforum.org/showpost.php?p=548565&postcount=46[/URL] power 7 takes 1.5GB disk space for residues at 100M p. Since Odroid is Ubuntu and GigE, why not pile residues on a network shared drive and then clean them up after the proof file exists? A Droid, Pi or phone farm could share a single TB drive.
[QUOTE=ewmayer;551947]PRP-proof support will be in v20, yes. I am alas behind the curve[/QUOTE]Right is more important than soon. And life happening affects how soon is practical.

Dylan14 2020-11-28 19:46

I have posted a working PKGBUILD for the latest Mlucas to the AUR. You can find it [URL="https://aur.archlinux.org/packages/mlucas/"]here.[/URL]

There are two patches that I had to make to the source to get it to build correctly:
1. In the file platform.h, I had to comment out line 1304:
[code]#include <sys/sysctl.h>[/code]
This is because the sysctl.h header was removed in Linux Kernel 5.5, per [url=https://github.com/PowerShell/PowerShell-Native/issues/33]this issue on the PowerShell GitHub.[/url]
2. In the file Mlucas.c, I removed the *fp part of FILE on line 100. This is because the linker (gcc 10.2.0) was complaining that fp was defined elsewhere (namely, in gcd_lehmer.c).

pvn 2020-11-30 18:54

I just built v19 and I'm fairly new to the Arm universe. I am poking around on some of the AWS EC2 instances with "graviton" processors. I notice that if I run with 4 cores, using a command line like this:


[CODE]# ./Mlucas -s m -cpu 0:3[/CODE]


then I get this message in the output quite a bit:


[CODE]mers_mod_square: radix0/2 not exactly divisible by NTHREADS - This will hurt performance.[/CODE]


and, sure enough, runs with four cores tend to be (much) slower than 2 cores or even 1 core on the same instance. is there something I should be doing differently?

ewmayer 2020-12-11 21:41

@pvn: Sorry for belated reply - that warning message is more common for larger threadcounts, it's basically telling you that part of the FFT code needs the leading (leftmost in the "Using complex FFT radices" info-print) to be divisible by #threads in order to run optimally. Example from a DC my last functioning bought-cheap-used Android phone is currently doing:
[i]
Using complex FFT radices 192 32 16 16
[/i]
The leading radix here is radix0 = 192, thus radix0/2 = 96 = 32*3. Sticking to power-of-2 thread counts (which the other main part of my 2-phases-per-iteration FFT code needs to run optimally) we'd be fine for #threads = 2,4,8,16,32, but 64 would give you the warning you saw.

Do you recall which precise radix set you saw the warning at in your case? To see it for 4-threads implies radix0/2 is not divisible by 4, which is only true for a handful of small leading radices: radix0 = 12,20,28,36,44,52,60. That's no problem, it just means that in using the self-tests to create the mlucas.cfg file for your particular -cpu [lo:hi] choice, the above suboptimality will likely cause a different FFT-radix-combo at the given FFT length to run best, which will be reflected in the corresponding mlucas.cfg file entry.

I've always gotten quite good multithreaded scaling on my Arm devices (Odroid min-PC and Android phone) up to 4-threads - did you run separate self-tests for -cpu 0, -cpu 0:1 and -cpu 0:3 and compare the resulting mlucas.cfg files?

On the Graviton instance you're using, what does /proc/cpu show in terms of #cores?

pvn 2021-01-10 17:03

Hi ernst, thanks for looking at this and apologies for delays on my end.

[QUOTE]Do you recall which precise radix set you saw the warning at in your case? To see it for 4-threads implies radix0/2 is not divisible by 4, which is only true for a handful of small leading radices: radix0 = 12,20,28,36,44,52,60. That's no problem, it just means that in using the self-tests to create the mlucas.cfg file for your particular -cpu [lo:hi] choice, the above suboptimality will likely cause a different FFT-radix-combo at the given FFT length to run best, which will be reflected in the corresponding mlucas.cfg file entry.[/QUOTE]Does this mean that the self-test run is taking longer because it's... weeding out the unsuitable radicies? I think this makes sense given what I see in the resulting cfg files (at any given FFT length, the msec/iter (roughly) scales with the number of cores used even when the 4-core self test takes unexpectedly too much time overall.


Also, it seems important to note that all of the radicies that actually get saved in the mlucas.cfg when running -cpu 0:3 are evenly divisible by NTHREADS*2 (in this case, NTHREADS=4).


here's some of the output with the radix sets that gave the "this will hurt perforamnce" message (these runs seem to take about 50% more time than the other runs at the same FFT size):

M43765019: using FFT length 2304K = 2359296 8-byte floats, initial residue shift count = 29224505
this gives an average 18.550033145480686 bits per digit
Using complex FFT radices 36 32 32 32
mers_mod_square: radix0/2 not exactly divisible by NTHREADS - This will hurt performance.

M48515021: using FFT length 2560K = 2621440 8-byte floats, initial residue shift count = 31467905
this gives an average 18.507011795043944 bits per digit
Using complex FFT radices 20 16 16 16 16
mers_mod_square: radix0/2 not exactly divisible by NTHREADS - This will hurt performance.

M53254447: using FFT length 2816K = 2883584 8-byte floats, initial residue shift count = 35280290
this gives an average 18.468144850297406 bits per digit
Using complex FFT radices 44 32 32 32
mers_mod_square: radix0/2 not exactly divisible by NTHREADS - This will hurt performance.

M53254447: using FFT length 2816K = 2883584 8-byte floats, initial residue shift count = 23722047
this gives an average 18.468144850297406 bits per digit
Using complex FFT radices 44 8 16 16 16
mers_mod_square: radix0/2 not exactly divisible by NTHREADS - This will hurt performance.

M62705077: using FFT length 3328K = 3407872 8-byte floats, initial residue shift count = 61480382
this gives an average 18.400068136361931 bits per digit
Using complex FFT radices 52 32 32 32
mers_mod_square: radix0/2 not exactly divisible by NTHREADS - This will hurt performance.

M67417873: using FFT length 3584K = 3670016 8-byte floats, initial residue shift count = 63290971
this gives an average 18.369912556239537 bits per digit
Using complex FFT radices 28 16 16 16 16
mers_mod_square: radix0/2 not exactly divisible by NTHREADS - This will hurt performance.

M72123137: using FFT length 3840K = 3932160 8-byte floats, initial residue shift count = 65799790
this gives an average 18.341862233479819 bits per digit
Using complex FFT radices 60 32 32 32
mers_mod_square: radix0/2 not exactly divisible by NTHREADS - This will hurt performance.

M86198291: using FFT length 4608K = 4718592 8-byte floats, initial residue shift count = 21266494
this gives an average 18.267799165513779 bits per digit
Using complex FFT radices 36 16 16 16 16
mers_mod_square: radix0/2 not exactly divisible by NTHREADS - This will hurt performance.

M95551873: using FFT length 5120K = 5242880 8-byte floats, initial residue shift count = 93620243
this gives an average 18.225073432922365 bits per digit
Using complex FFT radices 20 16 16 16 32
mers_mod_square: radix0/2 not exactly divisible by NTHREADS - This will hurt performance.

M95551873: using FFT length 5120K = 5242880 8-byte floats, initial residue shift count = 43929528
this gives an average 18.225073432922365 bits per digit
Using complex FFT radices 20 32 16 16 16
mers_mod_square: radix0/2 not exactly divisible by NTHREADS - This will hurt performance.

M104884309: using FFT length 5632K = 5767168 8-byte floats, initial residue shift count = 24783492
this gives an average 18.186449397693981 bits per digit
Using complex FFT radices 44 16 16 16 16
mers_mod_square: radix0/2 not exactly divisible by NTHREADS - This will hurt performance.

M123493333: using FFT length 6656K = 6815744 8-byte floats, initial residue shift count = 30371346
this gives an average 18.118833835308369 bits per digit
Using complex FFT radices 52 16 16 16 16
mers_mod_square: radix0/2 not exactly divisible by NTHREADS - This will hurt performance.

M132772789: using FFT length 7168K = 7340032 8-byte floats, initial residue shift count = 24638813
this gives an average 18.088856969560897 bits per digit
Using complex FFT radices 28 16 16 16 32
mers_mod_square: radix0/2 not exactly divisible by NTHREADS - This will hurt performance.

M132772789: using FFT length 7168K = 7340032 8-byte floats, initial residue shift count = 92450206
this gives an average 18.088856969560897 bits per digit
Using complex FFT radices 28 32 16 16 16
mers_mod_square: radix0/2 not exactly divisible by NTHREADS - This will hurt performance.

M142037359: using FFT length 7680K = 7864320 8-byte floats, initial residue shift count = 90349695
this gives an average 18.060984166463218 bits per digit
Using complex FFT radices 60 16 16 16 16
mers_mod_square: radix0/2 not exactly divisible by NTHREADS - This will hurt performance.

ewmayer 2021-01-12 21:02

@pvn:

The self-tests are intended to do two things:

[1] Check correctness of the compiled code;

[2] Find the best-performing combination of radices for each FFT length on the user's platform. That means trying each combination of radices available for assembling each FFT length and picking the one which runs fastest, unless the fastest happens to show unacceptably high levels of roundoff error, in which the combo which runs fastest *and* has acceptable ROE levels gets stored to the mlucas.cfg file.

The mlucas.cfg file is read at start of each LL or PRP test: for the current exponent being tested, the program computes the default FFT length based on expected levels of roundoff error, then reads the radix-combo data for that FFT length from mlucas.cfg and uses those FFT radices for the run.

The user is still expected to have a basic understanding of their hardware's multicore aspects in terms of running the self-tests using one or more -cpu [core number range] settings. I haven't found a good way to automate this "identify best core topology" step, but it's usually pretty obvious which candidate core-combos to try. Some examples:

o On my Intel Haswell quad, there are 4 physical cores, no hyperthreading: run self-tests with '-s m -cpu 0:3' to use all 4 cores;

o On my Intel Broadwell NUC mini, there are 2 physical cores, but with hyperthreading: I ran self-tests with '-s m -cpu 0:1' to use just the 2 physical cores, then 'mv mlucas.cfg mlucas.cfg.2' to not get those timings mixed up with the next self-test. Next ran with '-s m -cpu 0:3' to use all 4 cores (2 physical, 2 logical), then 'mv mlucas.cfg mlucas.cfg.4'. Comparing the msec/iter numbers between the 2 files showed the latter set of timings to be 5-10% faster, meaning the hyperthreading was beneficial, so that's the run mode I use: 'ln -s -f mlucas.cfg.4 mlucas.cfg' to link the desired .4-renamed cfg-file to the name 'mlucas.cfg' looked for by the code at runtime, then queue up some work using the primenet.py script and fire up the program using flags '-cpu 0:3'.

On manycore and multisocket systems finding the run mode which gives best total throughput takes a bit more work, but "don't split runs across sockets" is rule #1, so then you find the way to max out throughput on an individual socket, and duplicate that setup on socket 2, by incrementing the low:high indices following the -cpu flag appropriately.

Regarding your other observations:

o It's not surprising that all of the radix sets that appear in your mlucas.cfg when running -cpu 0:3 having leading radix evenly divisible by NTHREADS*2 - like the runtime warning says, if that does not hold (say radix0 = 12 and 4-threads using -cpu 0:3), it will generally hurt performance, meaning such combos will run more slowly due to suboptimal thread utilization, and will nearly always be bested by one or more radix combos which satisfy the divisibility criterion. Nothing the user need worry about, it's all automated, whichever combo runs fastest appears in the cfg file.

o The reason the self-tests with 4 threads (-cpu 0:3) take longer than you expected is that for 4 or more threads the default #iters used for each timing test gets raised from 100 to 1000, in order to get a more accurate timing sample. You can override that by specifying -iters 100 for such tests.

Cheers, and have fun,
-E

tdulcet 2021-01-13 12:10

[QUOTE=ewmayer;569113]The user is still expected to have a basic understanding of their hardware's multicore aspects in terms of running the self-tests using one or more -cpu [core number range] settings. I haven't found a good way to automate this "identify best core topology" step, but it's usually pretty obvious which candidate core-combos to try.[/QUOTE]

My [URL="https://www.mersenneforum.org/showthread.php?p=545949#post545949"]install script for Linux[/URL] currently follows the recommended instructions on the Mlucas README for each architecture to hopefully provide the best performance for most users, but I would be interested in adding this feature to automatically try different combinations of CPU cores/threads and then picking the one with the best performance, although I am not sure what the correct procedure is to do this for each architecture and CPU or how the [C]-DUSE_THREADS[/C] compile flag factors in. The scripts goal is to automate the entire download, build, setup and run process for Mlucas, so I think this could be an important component of that. I have not received any feedback on the script so far, so I am also not even sure if there is any interest in this feature or what percentage of systems it would affect.

pvn 2021-01-13 15:17

tdulcet,


ah, this is very helpful. I spent a good bit of time yesterday doing something similar, building essentially a barebones version of this to build docker images.



For intel, I just built multiple binaries (for sse, avx, avx2, avx512) and use an entrypoint script to determine at runtime what hardware is available and run the right binary.



I have had some trouble with the build on arm, though, so for now I'm just using the precompiled binaries, but similar routine in the entrypoint script to run the nosmid/c2smid binary as needed.


the docker image is at pvnovarese/mlucas_v19:latest (it's a multi-arch image, with both aaarch64 and x86_64)



Dockerfile etc can be found here: [url]https://github.com/pvnovarese/mlucas_v19[/url]


I will review your script as well, it looks like you've thought a lot more about this than I have :)

ewmayer 2021-01-14 00:09

[QUOTE=tdulcet;569157]My [URL="https://www.mersenneforum.org/showthread.php?p=545949#post545949"]install script for Linux[/URL] currently follows the recommended instructions on the Mlucas README for each architecture to hopefully provide the best performance for most users, but I would be interested in adding this feature to automatically try different combinations of CPU cores/threads and then picking the one with the best performance, although I am not sure what the correct procedure is to do this for each architecture and CPU or how the [C]-DUSE_THREADS[/C] compile flag factors in. The scripts goal is to automate the entire download, build, setup and run process for Mlucas, so I think this could be an important component of that. I have not received any feedback on the script so far, so I am also not even sure if there is any interest in this feature or what percentage of systems it would affect.[/QUOTE]

-DUSE_THREADS is needed to enable multithreaded build mode; without it you get a single-threaded build, which would only be useful if all you ever wanted to do is run one such 1-thread job per core. Even in that, the core-affinity stuff (-cpu argument) is not available for such builds, so you'd basically be stuck firing up a bunch of executable images, each from its own run directory (unique worktodo.ini file, copy of mlucas.cfg and primenet.py script to manage work for that directory) and hoping the OS does a good job managing the core affinities.

(Basically, there's just no good reason to omit the above flag anymore).

Re. some kind of script to automate the self-testing using various suitable candidate -cpu arguments, that would indeed be useful. George uses the freeware hwloc library in his Prime95 code to suss out the topology of the machine running the code - I'd considered using it for my own as well in the past, but had seen a few too many threads that boiled down to "hwloc doesn't work properly on my machine" and needing some intervention re. that library by George for my taste. In any event, let me think on it more, and perhaps some playing-around with that library by those of you interested in this aspect would be a good starting point.

tdulcet 2021-01-14 13:21

[QUOTE=pvn;569167]
Dockerfile etc can be found here: [URL]https://github.com/pvnovarese/mlucas_v19[/URL]


I will review your script as well, it looks like you've thought a lot more about this than I have :)[/QUOTE]

Nice! Thanks. With my script you should be able to compile Mlucas on demand, since is uses a parallel Makefile with one job for each CPU thread, it should only take a couple minutes or less to compile on most systems. It uses the [C]-march=native[/C] compile flag on x86 systems, so the resulting binaries should also be slightly faster, although they are generally not portable. What was the issue you had building on ARM?

There is a longstanding issue with 32-bit ARM, where the [C]mi64.c[/C] file hangs when compiling with GCC. If you remove the [C]-O3[/C] optimization you get these errors:
[QUOTE]
../src/mi64.c: In function ‘mi64_shl_short’:
../src/mi64.c:1038:2: error: unknown register name ‘rsi’ in ‘asm’
__asm__ volatile (\
^~~~~~~
../src/mi64.c:1038:2: error: unknown register name ‘rcx’ in ‘asm’
../src/mi64.c:1038:2: error: unknown register name ‘rbx’ in ‘asm’
../src/mi64.c:1038:2: error: unknown register name ‘rax’ in ‘asm’
../src/mi64.c: In function ‘mi64_shrl_short’:
../src/mi64.c:1536:2: error: unknown register name ‘rsi’ in ‘asm’
__asm__ volatile (\
^~~~~~~
../src/mi64.c:1536:2: error: unknown register name ‘rcx’ in ‘asm’
../src/mi64.c:1536:2: error: unknown register name ‘rbx’ in ‘asm’
../src/mi64.c:1536:2: error: unknown register name ‘rax’ in ‘asm’[/QUOTE][QUOTE=ewmayer;569228](Basically, there's just no good reason to omit the above flag anymore).[/QUOTE]

OK, thanks for the info. That is what I thought. I just wanted to make sure that there was not some edge case where my script should omit the flag.

[QUOTE=ewmayer;569228]Re. some kind of script to automate the self-testing using various suitable candidate -cpu arguments, that would indeed be useful. George uses the freeware hwloc library in his Prime95 code to suss out the topology of the machine running the code - I'd considered using it for my own as well in the past, but had seen a few too many threads that boiled down to "hwloc doesn't work properly on my machine" and needing some intervention re. that library by George for my taste. In any event, let me think on it more, and perhaps some playing-around with that library by those of you interested in this aspect would be a good starting point.[/QUOTE]

OK, I was just thinking that there was some procedure my script could use given the CPU (Intel, AMD or ARM), the number of CPU Cores and the number of CPU threads to generate all possible candidate combinations for the [C]-cpu[/C] argument that could realistically generate the best performance. It could then try the different candidate combinations (as described in the two examples of your previous post) and pick the one with the best performance.

Based on the "Advanced Users" and "Advanced Usage" sections of the Mlucas README, for an example 8 core/16 thread system, this is my best guess of the candidate combinations to try with the [C]-cpu[/C] argument:

[B]Intel[/B]
[CODE]
0 (1-threaded)
0:1 (2-threaded)
0:3 (4-threaded)
0:7 (8-threaded)
0:15 (16-threaded)
0,8 (2 threads per core, 1-threaded) (current default)
0:1,8:9 (2 threads per core, 2-threaded)
0:3,8:11 (2 threads per core, 4-threaded)[/CODE][B]
AMD[/B]
[CODE]
0 (1-threaded)
0:3:2 (2-threaded)
0:7:2 (4-threaded)
0:15:2 (8-threaded)
0:1 (2 threads per core, 1-threaded) (current default)
0:3 (2 threads per core, 2-threaded)
0:7 (2 threads per core, 4-threaded)
0:15 (2 threads per core, 8-threaded)[/CODE][B]
ARM[/B] (8 core/8 thread)
[CODE]
0 (1-threaded)
0:3 (4-threaded) (current default)
0:7 (8-threaded)[/CODE]I am not sure if these are all the combinations worth testing or if we could rule any of them out.

ewmayer 2021-01-16 21:06

1 Attachment(s)
@tdulcet: Extremely busy this past month working on a high-priority 'intermediate' v19.1 release (this will restore Clang/llvm buildability on Arm, problem was first IDed on the new Apple M1 CPU but is more general), alas no time to give the automation of best-total-throughput-finding the attention it deserves. But that's where folks like you come in. :)

First off - the mi64.c compile issue has been fixed in the as-yet-unreleased 19.1 code, as the mods in that file are small I will attach it here, suggest you save a copy of the old one so you can diff and see the changes for yourself. Briefly, a big chunk of x86_64 inline-asm needed extra wrapping inside a '#ifdef YES_ASM' preprocessor directive. That flag is def'd (or not) in mi64.h like so:
[code] #if(defined(CPU_IS_X86_64) && defined(COMPILER_TYPE_GCC) && (OS_BITS == 64))
#define YES_ASM
#endif[/code]
Re. your core/thread-combos-to-try on an example 8c/16t system, those look correct. The remaining trick, though, is figuring out which of the most promising c/t combos give the best total-throughput on the user's system. For example - sticking to just 1-thread-per-physical-core for the moment - we expect 1t to run roughly 2x slower that 2t. Say the ratio is 1.8, and the user has an 8-core system. The real question is, how does the total-throughput compare for 8x1t jobs versus 4x2t?

Similarly, we usually see a steep dropoff in || scaling beyond 4 cores - but that need not imply that running two 4-thread jobs is better than one 8-thread one. If said dropoff is due to the workload saturing the memory banwidth, we might well see a similar performance hit with two 4-thread jobs

ewmayer 2021-01-16 22:23

[b]Addendum:[/b] OK, I think the roadmap needs to look something like this - abbreviation-wise, 'c' refers to physical cores, 't' to threadcount:

[b]1.[/b] Based on the user's HW topology, identify a set of 'most likely to succeed' core/thread combos, like tdulcet did in his above post. For x86 this needs to take into account the different core-numbering conventions used by Intel and AMD;

[b]2.[/b] For each combo in [1], run the automated self-tests, and save the resulting mlucas.cfg file under a unique name, e.g. for 4c/8t call it mlucas.cfg.4c.8t;

[b]3.[/b] The various cfg-files hold the best FFT-radix combo to use at each FFT length for the given c/t combo, i.e. in terms of maximizing total throughput on the user's system we can focus on just those. So let's take a hypothetical example: Say on my 8c/16t AMD processor the round of self-tests in [1] has shown that using just 1c, 1c2t is 10% faster than 1c1t. We now need to see how 1c2t scales to all physical cores, across the various FFT lengths in the self-test. E.g. at FFT length 4096K, say the best radix combo found for 1c2t is 64,32,32,32 (note the product of those = 2048K rather than 4096K because to match general GIMPS convention "FFT length" refers to #doubles, but Mlucas uses an underlying complex-FFT, so the individual radices are complex and refer to pairs-of-doubles). So we next want to fire up 8 separate 1c2t jobs at 4096K, each using that radix combo and running on a distinct physical core, thus our 8 jobs would use -cpu flags (I used AMD for my example to avoid the comm confusion Inte;'s convention would case here) 0:1,2:3,4:5,6:7,8:9,10:11,12:13 and 14:15, respectively. I would further like to specify the foregoing radix combo via the -radset flag, but here we hit a small snag: at present, there is no way to specify an actual radix-combo. Instead one must find the target FFT length in the big case() table in get_fft_radices.c and match the desired radix-combo to a case-index. For 4096K, we see 64,32,32,32 maps to 'case 7', so we'd use -radset 7 for each of our 8 launch-at-same-time jobs. I may need to do some code-fiddling to make that less awkward.

Anyhow, since we're now using just 1 radix-combo at each FFT length and we want a decent timing sample not dominated by start-up init and thread-management overhead, we might use -iters 1000 for each of our 8 jobs. Launch at more-or-less same time, they will have a range of msec/iter timings t0-t7 which we convert into total throughput in iters/sec via 1000*(1/t0+1/t1+1/t2+1/t3+1/t4+1/t5+1/t6+1/t7). Repeat for each FFT length of interest, generating a set of total throughput numbers.

[b]4.[/b] Repeast [3] for each c/t combo in [1]. It may well prove the case that a single c/t combo does not give best total throughput across all FFT lengths, but for a first cut it seems best to somehow generate some kind of weighted average-across-all-FFT-lengths for each c/t combo and pick the best one. In [3] we generated total throughput iters/sec numbers at each FFT length, maybe multiply each by its corresponding FFT length and sum over all FFT lengths.

tdulcet 2021-01-17 16:05

[QUOTE=ewmayer;569481]@tdulcet: Extremely busy this past month working on a high-priority 'intermediate' v19.1 release (this will restore Clang/llvm buildability on Arm, problem was first IDed on the new Apple M1 CPU but is more general), alas no time to give the automation of best-total-throughput-finding the attention it deserves. But that's where folks like you come in. :)[/QUOTE]

No problem. I will look forward to your new v19.1 release.

[QUOTE=ewmayer;569481]First off - the mi64.c compile issue has been fixed in the as-yet-unreleased 19.1 code, as the mods in that file are small I will attach it here[/QUOTE]

Thanks for the fix. This had been preventing me from running Mlucas on my Raspberry Pis for a couple years, so that is great that it will now work.

[QUOTE=ewmayer;569481]Re. your core/thread-combos-to-try on an example 8c/16t system, those look correct. The remaining trick, though, is figuring out which of the most promising c/t combos give the best total-throughput on the user's system. For example - sticking to just 1-thread-per-physical-core for the moment - we expect 1t to run roughly 2x slower that 2t. Say the ratio is 1.8, and the user has an 8-core system. The real question is, how does the total-throughput compare for 8x1t jobs versus 4x2t?[/QUOTE]

Yes, this will be difficult. I implemented a preliminary version that follows the instructions on the Mlucas README. Specifically, it will multiply the [C]4x2t[/C] msec/iter times by 1.5 before comparing them. Multiplying by 2 would of course produce different results in this case.

[QUOTE=ewmayer;569485][B]Addendum:[/B] OK, I think the roadmap needs to look something like this[/QUOTE]

Wow, thanks for the detailed roadmap, it is very helpful!

1. OK, I wrote Bash code to automatically generate the combinations from my previous post above, for the user's CPU and number of CPU cores/threads. It will generate a nice table like one of these for example:
[CODE]The CPU is Intel.
# Workers/Runs Threads -cpu arguments
1 1 16, 2 per core 0:15
2 1 8, 1 per core 0:7
3 2 4, 1 per core 0:3 4:7
4 4 2, 1 per core 0:1 2:3 4:5 6:7
5 8 1, 1 per core 0 1 2 3 4 5 6 7
6 2 4, 2 per core 0:3,8:11 4:7,12:15
7 4 2, 2 per core 0:1,8:9 2:3,10:11 4:5,12:13 6:7,14:15
8 8 1, 2 per core 0,8 1,9 2,10 3,11 4,12 5,13 6,14 7,15

The CPU is AMD.
# Workers/Runs Threads -cpu arguments
1 1 16, 2 per core 0:15
2 1 8, 1 per core 0:15:2
3 2 4, 1 per core 0:7:2 8:15:2
4 4 2, 1 per core 0:3:2 4:7:2 8:11:2 12:15:2
5 8 1, 1 per core 0 2 4 6 8 10 12 14
6 2 4, 2 per core 0:7 8:15
7 4 2, 2 per core 0:3 4:7 8:11 12:15
8 8 1, 2 per core 0:1 2:3 4:5 6:7 8:9 10:11 12:13 14:15

The CPU is ARM.
# Workers/Runs Threads -cpu arguments
1 1 8 0:7
2 2 4 0:3 4:7
3 4 2 0:1 2:3 4:5 6:7
4 8 1 0 1 2 3 4 5 6 7[/CODE]The combinations are the same as my previous post above, except I added a 2-threaded combination for ARM and the ordering is different.

2. Done.
3./4. Interesting, this is going to be a lot more complex to implement then I originally thought. The switch statement in [C]get_fft_radices.c[/C] is too big to store in my script and creating an [C]awk[/C] command to extract the case number based on the FFT length and radix combo would obviously be extremely difficult, particularly because there are nested switch statements. I am going to have to think about how best to do this... I welcome suggestions from anyone who is reading this. In the meantime, I wrote code to directly compare the adjusted msec/iter times from the [C]mlucas.cfg[/C] files from step #2. This of course does not account for any of the scaling issues that @ewmayer described. It will generate two tables (the fastest combination and the rest of the combinations tested) like these for my 6 core/12 thread Intel system for example:
[CODE]
Fastest
# Workers/Runs Threads First -cpu argument Adjusted msec/iter times
6 6 1, 2 per core 0,6 8.47 9.69 10.72 12.26 12.71 14.53 14.76 16.54 16.1 18.89 20.94 23.94 26.39 28.85 29.16 32.98

Mean/Average faster # Workers/Runs Threads First -cpu argument Adjusted msec/iter times
3.248 ± 0.101 (324.8%) 1 1 12, 2 per core 0:11 28.92 31.74 33.78 38.64 42.66 44.52 46.56 51.06 52.26 61.8 70.14 79.38 88.62 92.94 97.26 106.2
3.627 ± 0.146 (362.7%) 2 1 6, 1 per core 0:5 34.14 34.8 39.66 45.48 47.28 51.18 51.3 56.64 60.78 66.24 73.02 87.78 96.12 102.6 108.12 116.22
2.607 ± 0.068 (260.7%) 3 2 3, 1 per core 0:2 22.98 25.53 27.3 30.12 33.63 37.83 36.72 42.15 42.69 48.9 54.66 61.68 71.19 76.44 77.49 87.36
1.736 ± 0.029 (173.6%) 4 6 1, 1 per core 0 14.41 17.1 18.46 20.88 22.72 24.99 25.36 28.38 28.82 32.67 36.06 40.67 46.12 49.87 51.32 58.26
1.816 ± 0.047 (181.6%) 5 2 3, 2 per core 0:2,6:8 16.11 18.09 19.32 21.99 23.64 25.41 25.92 29.85 30.57 33.78 38.7 42.42 48.57 51.12 52.53 59.19[/CODE]The two tables show that 1-threaded with 2 threads per core is ~1.7 times faster then 1-threaded with 1 thread per core for example.

ewmayer 2021-01-18 20:49

@tdulcet - How about I add support in v19.1 for the -radset flag to take either an index into the big table, or an actual set of comma-separated FFT radices? Shouldn't be difficult - if the expected -radset[whitespace]numeric arg pair is immediately followed by a comma, the code assumes it's a set of radices, reads those in, checks that said set is supported and if so runs with it.

I expect - fingers crossed, still plenty of work to do - to be able to release v19.1 around EOM, so you'd have to wait a little while, but it sounds like this is the way to go.

[b]Edit:[/b] Why make people wait - here is a modified version of Mlucas.c which supports the above-described -radset argument. You should be able to drop into your current v19 source archive, but suggest you save the old Mlucas.c under a different name - maybe add a '.bak' - so you can diff the 2 versions to see the changes, the first and most obvious of which is the version number, now bumped up to 19.1.

Note user-supplied radix set is considered "advanced usage" in the sense that I assume users of it know what they are doing, though I have included a set of sanity checks on inputs. Most important is to understand the difference between the FFT length conventions between the -fftlen and -radset args: -fftlen supplies a real-vector FFT length in Kdoubles; -radset [comma-separated list of radices] specifies a corresponding set of complex-FFT radices. If the user has supplied a real-FFT length (in Kdoubles) via -fftlen, the product of the complex-FFT radices (call it 'rad_prod') must correspond to half that value, accounting for the Kdoubles scaling of the former. In C-code terms, we require that (rad_prod>>9) == fftlen .

Note that event though this is strictly-speaking redundant, the -fftlen arg is required even if the user supplies an actual radix set; this is for purposes of sanity checking the latter, because the above-described differing conventions make it easy to get confused. Using any of the radix sets listed in the mlucas.cfg file along with the corresponding FFT length is of course guaranteed to be OK.

Examples: After building the attached Mlucas.c file and relinking, try running the resulting binary with the following sets of command-line arguments to see what happens:

-iters 100 -fftlen 1664 -radset 0
-iters 100 -fftlen 1664 -radset 208,16,16,16
-iters 100 -fftlen 1668 -radset 208,16,16,16
-iters 100 -fftlen 1664 -radset 207,16,16,16
-iters 100 -fftlen 1664 -radset 208,8,8,8,8

tdulcet 2021-01-19 13:46

[QUOTE=ewmayer;569613]@tdulcet - How about I add support in v19.1 for the -radset flag to take either an index into the big table, or an actual set of comma-separated FFT radices?[/QUOTE]

That would be very helpful to automate this!

[QUOTE=ewmayer;569613][B]Edit:[/B] Why make people wait - here is a modified version of Mlucas.c which supports the above-described -radset argument.[/QUOTE]

Wow, thanks for doing it so quickly! This will be very helpful. I committed and pushed the the changes I described in my previous post to GitHub [URL="https://github.com/tdulcet/Distributed-Computing-Scripts/commit/9e64f15704d6c26b29a8fd558d94b0eb2a37c1db"]here[/URL], which basically implements step # 1, 2 and part of 4. I will now get started on step 3 and the rest of 4 using your new version of [C]Mlucas.c[/C].

In my previous post on an example 8c/16t system, I said it will multiply the [C]4x2t[/C] msec/iter times by 1.5 before comparing them to the [C]8x1t[/C] times, following the instructions on the Mlucas README. After doing more testing, I was getting unexpected results with this formula ([C](CPU cores / workers) - 0.5[/C]), so it will now multiply the times by 2 ([C]CPU cores / workers[/C]) for this example. This should be irrelevant once I implement step 3.

I thought I should note that some systems like the Intel [URL="https://en.wikipedia.org/wiki/Xeon_Phi"]Xeon Phi[/URL] can have more then two CPU threads per CPU core. The Mlucas README does not mention this case, but my script should correctly handle it for Intel and AMD x86 systems. For example, on a 64 core/256 thread Intel Xeon Phi system it would try these combinations (only showing the first [C]-cpu[/C] argument for brevity):
[CODE]
# Workers/Runs Threads -cpu arguments
1 1 64, 1 per core 0:63
2 2 32, 1 per core 0:31
3 4 16, 1 per core 0:15
4 8 8, 1 per core 0:7
5 16 4, 1 per core 0:3
6 32 2, 1 per core 0:1
7 64 1, 1 per core 0
8 1 128, 2 per core 0:63,64:127
9 2 64, 2 per core 0:31,64:95
10 4 32, 2 per core 0:15,64:79
11 8 16, 2 per core 0:7,64:71
12 16 8, 2 per core 0:3,64:67
13 32 4, 2 per core 0:1,64:65
14 64 2, 2 per core 0,64
15 1 256, 4 per core 0:63,64:127,128:191,192:255
16 2 128, 4 per core 0:31,64:95,128:159,192:223
17 4 64, 4 per core 0:15,64:79,128:143,192:207
18 8 32, 4 per core 0:7,64:71,128:135,192:199
19 16 16, 4 per core 0:3,64:67,128:131,192:195
20 32 8, 4 per core 0:1,64:65,128:129,192:193
21 64 4, 4 per core 0,64,128,192[/CODE]

ewmayer 2021-01-19 20:19

@tdulcet: Glad to be of service to someone else who wants be of service, or something. :)

o Re. KNL, yes I have a barebones one sitting next to me and running a big 64M-FFT primality test, 1 thread on each of physical cores 0:63. On KNL I've never found any advantage from running this kind of code with more than 1 thread per physical core.

o One of your timing sample above mentioned getting nearly 2x speedup from running 2 threads on 1 physical core, with the other cores unused. I suspect that may be the OS actually putting 1 thread on each of 2 physical cores. Remember, those pthread affinity settings are treated as *hints* to the OS, we hope that under heavy load the OS will respect them because there are no otherwise-idle physical cores it can bounce threads to.

o You mentioned the mi64.c missing-x86-preprocessor-flag-wrapper was keeping you from building on your Raspberry Pi - that was even with -O3? And did you as a result just use the precompiled Arm/Linux binaries on that machine?

joniano 2021-01-20 07:46

Possible Symptoms of a Bug on ARM64 Build - Running too fast
 
Hello Folks - I recently got Mlucas running on a Raspberry Pi 4, 8GB of RAM, running Ubuntu and I am doing PRP checks on large primes.

I'm assuming either Mlucas is [I]extremely fast[/I] and consistent or I'm running into some sort of a bug.

If you look at a few lines of the ".stat" file for one of my recent primes, you'll see that every few seconds I blast through 10,000 iterations at exactly the same ms/iter speed and it seems to take [I]under a day[/I] to fully PRP test a new number.

[CODE][2021-01-19 21:42:45] M110899639 Iter# = 110780000 [99.89% complete] clocks = 00:15:20.953 [ 92.0954 msec/iter] Res64: 461E323B49699D73. AvgMaxErr = 0.000000000. MaxErr = 0.000000000. Residue shift count = 13555775.
[2021-01-19 21:42:48] M110899639 Iter# = 110790000 [99.90% complete] clocks = 00:15:20.953 [ 92.0954 msec/iter] Res64: 461E323B49699D73. AvgMaxErr = 0.000000000. MaxErr = 0.000000000. Residue shift count = 13555775.
[2021-01-19 21:42:50] M110899639 Iter# = 110800000 [99.91% complete] clocks = 00:15:20.953 [ 92.0954 msec/iter] Res64: 461E323B49699D73. AvgMaxErr = 0.000000000. MaxErr = 0.000000000. Residue shift count = 13555775.
[2021-01-19 21:42:52] M110899639 Iter# = 110810000 [99.92% complete] clocks = 00:15:20.953 [ 92.0954 msec/iter] Res64: 461E323B49699D73. AvgMaxErr = 0.000000000. MaxErr = 0.000000000. Residue shift count = 13555775.
[2021-01-19 21:42:55] M110899639 Iter# = 110820000 [99.93% complete] clocks = 00:15:20.953 [ 92.0954 msec/iter] Res64: 461E323B49699D73. AvgMaxErr = 0.000000000. MaxErr = 0.000000000. Residue shift count = 13555775.
[2021-01-19 21:42:57] M110899639 Iter# = 110830000 [99.94% complete] clocks = 00:15:20.953 [ 92.0954 msec/iter] Res64: 461E323B49699D73. AvgMaxErr = 0.000000000. MaxErr = 0.000000000. Residue shift count = 13555775.
[2021-01-19 21:42:59] M110899639 Iter# = 110840000 [99.95% complete] clocks = 00:15:20.953 [ 92.0954 msec/iter] Res64: 461E323B49699D73. AvgMaxErr = 0.000000000. MaxErr = 0.000000000. Residue shift count = 13555775.
[2021-01-19 21:43:01] M110899639 Iter# = 110850000 [99.96% complete] clocks = 00:15:20.953 [ 92.0954 msec/iter] Res64: 461E323B49699D73. AvgMaxErr = 0.000000000. MaxErr = 0.000000000. Residue shift count = 13555775.
[2021-01-19 21:43:04] M110899639 Iter# = 110860000 [99.96% complete] clocks = 00:15:20.953 [ 92.0954 msec/iter] Res64: 461E323B49699D73. AvgMaxErr = 0.000000000. MaxErr = 0.000000000. Residue shift count = 13555775.
[2021-01-19 21:43:06] M110899639 Iter# = 110870000 [99.97% complete] clocks = 00:15:20.953 [ 92.0954 msec/iter] Res64: 461E323B49699D73. AvgMaxErr = 0.000000000. MaxErr = 0.000000000. Residue shift count = 13555775.
[2021-01-19 21:43:08] M110899639 Iter# = 110880000 [99.98% complete] clocks = 00:15:20.953 [ 92.0954 msec/iter] Res64: 461E323B49699D73. AvgMaxErr = 0.000000000. MaxErr = 0.000000000. Residue shift count = 13555775.
[2021-01-19 21:43:10] M110899639 Iter# = 110890000 [99.99% complete] clocks = 00:15:20.953 [ 92.0954 msec/iter] Res64: 461E323B49699D73. AvgMaxErr = 0.000000000. MaxErr = 0.000000000. Residue shift count = 13555775.
[2021-01-19 21:43:13] M110899639 Iter# = 110899639 [100.00% complete] clocks = 00:15:20.953 [ 95.5445 msec/iter] Res64: 461E323B49699D73. AvgMaxErr = 0.000000000. MaxErr = 0.000000000. Residue shift count = 13555775.
M110899639 is not prime. Res64: 243C3E785D7D8345. Program: E19.0. Final residue shift count = 13555775
M110899639 mod 2^35 - 1 = 20387533375
M110899639 mod 2^36 - 1 = 12983321457[/CODE]

Does this look suspicious to anyone else?

I also run Prime95 on a seemingly much more powerful Core i7-7700 and that is taking about 14 days to PRP-test a single number, which is what is making me question this.

I'm glad to provide more detail if it would help troubleshoot.

LaurV 2021-01-20 09:28

[QUOTE=joniano;569708]Does this look suspicious to anyone else?
[/QUOTE]
Yep. Very. The residues are the same, which is close to impossible. Like one in k chances, where k is much larger than the number of particles in the universe :razz:
Unfortunately I can't help, not the Linux neither mLucas guy. :sad:

tdulcet 2021-01-20 14:30

[QUOTE=ewmayer;569688]o Re. KNL, yes I have a barebones one sitting next to me and running a big 64M-FFT primality test, 1 thread on each of physical cores 0:63. On KNL I've never found any advantage from running this kind of code with more than 1 thread per physical core.[/QUOTE]

Nice. I used to have access to that system in college. MPrime defaulted to configuration #5 on it. I never tried running Mlucas, but I would have thought configuration 19 or 21 would provide the best performance.

[QUOTE=ewmayer;569688]o One of your timing sample above mentioned getting nearly 2x speedup from running 2 threads on 1 physical core, with the other cores unused. I suspect that may be the OS actually putting 1 thread on each of 2 physical cores.[/QUOTE]

OK, interesting. I am guessing that this will be fixed after I finish implementing your step # 3 and 4.

[QUOTE=ewmayer;569688]o You mentioned the mi64.c missing-x86-preprocessor-flag-wrapper was keeping you from building on your Raspberry Pi - that was even with -O3?[/QUOTE]

With [C]-O3[/C] optimization (and [C]-O2[/C] and [C]-O1[/C]), GCC would just run forever at 100% CPU usage, I am guessing because of some bug in GCC. Without optimization, GCC would immediately output those errors I posted after the usual warnings. It happened with both Mlucas v18 and v19.

[QUOTE=ewmayer;569688]And did you as a result just use the precompiled Arm/Linux binaries on that machine?[/QUOTE]

No, because I wanted to use my script to automatically setup everything. I also suspect that compiling Mlucas directly on the Pi and with my script will provide better performance, since GCC by default on the Raspberry Pi adds a bunch of compile flags:
[CODE]
pi@raspberrypi:~ $ gcc -march=native -Q --help=target | grep -iv disabled
The following options are target specific:
-mabi= aapcs-linux
-march= armv8-a+crc+simd
-marm [enabled]
-mbe32 [enabled]
-mbranch-cost= -1
-mcpu=
-mfloat-abi= hard
-mfp16-format= none
-mfpu= vfp
-mglibc [enabled]
-mhard-float
-mlittle-endian [enabled]
-mpic-data-is-text-relative [enabled]
-mpic-register=
-msched-prolog [enabled]
-msoft-float
-mstructure-size-boundary= 8
-mtls-dialect= gnu
-mtp= cp15
-mtune=
-munaligned-access [enabled]
-mvectorize-with-neon-quad [enabled]
[/CODE]a few of which should improve the resulting performance.

@ewmayer Regarding step # 3, I have a quick question. What is the correct way to get the needed msec/iter times from each job? The Mlucas output has a [C]Clocks = [/C] line, so should I parse that, convert it to milliseconds and then divide by 1000 (the number of iterations)?

ewmayer 2021-01-20 19:56

[QUOTE=tdulcet;569719]I wanted to use my script to automatically setup everything. I also suspect that compiling Mlucas directly on the Pi and with my script will provide better performance, since GCC by default on the Raspberry Pi adds a bunch of compile flags:
[snip]
a few of which should improve the resulting performance.[/QUOTE]
For SIMD builds the runtime is dominated by the asm-macro instructions, so there might in fact be little or no difference. But in any case, with the patched mi64.c file I posted earlier in this thread (which will be part of the soon-to-come v19.1 release) you can now directly compare performance of the prebuilt binary and your own.

OTOH there is clearly some benefit to be had from improved optimization of the C "glue" code and the integration of the asm-macros into that ... I still have a few problematic-for-Clang asm-macros to convert so that compiler will build them on Armv8, but I also installed Clang on my main Ubuntu Linux box, a quad-core Haswell mostly used for builds and hosting a couple GPUs, and built v19 using it there a week ago - the result looks to run 5-10% faster than my GCC build of the same source base. We hope for a similar speedup from build of v19.1 on Arm.

[QUOTE]Regarding step # 3, I have a quick question. What is the correct way to get the needed msec/iter times from each job? The Mlucas output has a [C]Clocks = [/C] line, so should I parse that, convert it to milliseconds and then divide by 1000 (the number of iterations)?[/QUOTE]
Yep!

ewmayer 2021-01-20 20:21

Hi, Joniano -

[QUOTE=joniano;569708]Hello Folks - I recently got Mlucas running on a Raspberry Pi 4, 8GB of RAM, running Ubuntu and I am doing PRP checks on large primes.

I'm assuming either Mlucas is [I]extremely fast[/I] and consistent or I'm running into some sort of a bug.

If you look at a few lines of the ".stat" file for one of my recent primes, you'll see that every few seconds I blast through 10,000 iterations at exactly the same ms/iter speed and it seems to take [I]under a day[/I] to fully PRP test a new number.[/quote]
Yep, that's a weird one - the timings in the "clocks = ... [msec/iter]" part of each 10Kiter checkpoint-status line look perfectly reasonably for a Pi4, around 15 minutes per interval. But those lines are getting written every 2-3 *seconds*, not every 15 minutes, and as LaurV noted, the Res64 value is frozen.

Looking deeper: the time for each interval is exactly the same, 15:20.953 [ 92.0954 msec/iter] - in normal runs that never happens. And your floating-point errors are 0, also not something you'll ever seen with a good build and normally functioning hardware.

So some questions for you:

o Did you use the precompiled v19 binary for Arm SIMD, or build v19 yourself?

o If you built a binary, which compiler are you using, and did you use the recommended build flags on the README webpage?

o Did your post-build self-tests appear to run at "normal" speed? (If you're not sure, just try a quick sample using an FFT length suitable for your exponent:
[i]
./[name of your binary] -m 110899639 -iters 100 -fftlen 6144 -radset 1 -cpu 0:3
[/i]
That should take on the order of 10 seconds on your hardware, and produce Res64: A4F77554A5DC940F.

Also, please post a copy of the mlucas.cfg file resulting from your post-build self-tests, and a copy of the first 10 lines of your p110899639.stat file . Thanks!

tdulcet 2021-01-22 13:57

[QUOTE=ewmayer;569728]I also installed Clang on my main Ubuntu Linux box, a quad-core Haswell mostly used for builds and hosting a couple GPUs, and built v19 using it there a week ago - the result looks to run 5-10% faster than my GCC build of the same source base. We hope for a similar speedup from build of v19.1 on Arm.[/QUOTE]

Wow, that is an impressive speedup. I will also update my install script to support Clang after v19.1 is released.

I finished implementing steps 3 and 4 from post #71, although I get a few errors. For example on my 4 core/8 thread Intel system, there is this line in one of the [C]mlucas.cfg[/C] files:
[CODE]
...
3840 msec/iter = 20.20 ROE[avg,max] = [0.194384766, 0.218750000] radices = 60 32 32 32 0 0 0 0 0 0
...
[/CODE]However, if I try to run [C]./Mlucas -fftlen 3840 -iters 1000 -radset 60,32,32,32 -cpu "0:1"[/C], I get this error:
[QUOTE]
$ ./Mlucas -fftlen 3840 -iters 1000 -radset 60,32,32,32 -cpu "0:1"

Mlucas 19.1

[URL]http://www.mersenneforum.org/mayer/README.html[/URL]

INFO: testing qfloat routines...
CPU Family = x86_64, OS = Linux, 64-bit Version, compiled with Gnu C [or other compatible], Version 9.3.0.
INFO: Build uses AVX2 instruction set.
INFO: Using inline-macro form of MUL_LOHI64.
INFO: Using FMADD-based 100-bit modmul routines for factoring.
INFO: MLUCAS_PATH is set to ""
INFO: using 64-bit-significand form of floating-double rounding constant for scalar-mode DNINT emulation.
Setting DAT_BITS = 10, PAD_BITS = 2
INFO: testing IMUL routines...
INFO: System has 8 available processor cores.
INFO: testing FFT radix tables...
Set affinity for the following 2 cores: 0.1.
ERROR: radix set index 5 for FFT length 3840 K exceeds maximum allowable of 4.
ERROR: at line 3897 of file ../src/Mlucas.c
Assertion failed: ERROR: radix set index 5 for FFT length 3840 K exceeds maximum allowable of 4.

[/QUOTE]This also happens with these other FFT length and radix combos on the system:

[C]./Mlucas -fftlen 4096 -iters 1000 -radset 16,16,16,16,32 -cpu "0"[/C]
[C]./Mlucas -fftlen 2816 -iters 1000 -radset 44,8,16,16,16 -cpu "0,4"[/C]

It seems to be some kind of off by one error, since the radix set index is always one greater than the maximum.

ewmayer 2021-01-22 20:45

@tdulcet - yes, off-by-one indexing error is precisely what it is - good catch. At line 3595 of the v19.1 Mlucas.c I posted above, that "radset = i" needs to be changed to "radset = i-1" to undo the last post-increment of i in the enclosing while() loop's call to get_fft_radices(). You can do that yourself, or grab the updated attachment to my [url=https://mersenneforum.org/showpost.php?p=569613&postcount=73]Post #73[/url].

tdulcet 2021-01-23 15:14

🆕 New Install Script for Linux
 
[QUOTE=ewmayer;569880]@tdulcet - yes, off-by-one indexing error is precisely what it is - good catch.[/QUOTE]

Great, thanks for fixing it so quickly!

I attached the new version of my [URL="https://www.mersenneforum.org/showthread.php?p=545949#post545949"]install script for Linux[/URL], which implements @ewmayer's steps 1-4 from [URL="https://www.mersenneforum.org/showpost.php?p=569485&postcount=71"]post #71[/URL] and I will push it to GitHub after v19.1 is released. This attached version of the script is for testing and will automatically download, build and partially setup Mlucas as described in post #71. It will also download the required new v19.1 [C]Mlucas.c[/C] file from [URL="https://www.mersenneforum.org/showpost.php?p=569613&postcount=73"]post #73[/URL]. The command line arguments are not used, so users can just run it with [C]bash mlucas.sh[/C].

To completely setup and run Mlucas for production with the PrimeNet Python script, remove the [C]exit[/C] command on line 354. In this case, users will need to provide the command line arguments if the defaults are incorrect.

As with the [URL="https://www.mersenneforum.org/showpost.php?p=569661&postcount=74"]previous version[/URL], it will generate two tables (the fastest combination and the rest of the combinations tested) like these for my 4 core/8 thread Intel system for example:
[CODE]
Fastest combination
# Workers/Runs Threads First -cpu argument
1 1 4, 1 per core 0:3

Mean/Average faster # Workers/Runs Threads First -cpu argument
1.020 ± 0.103 (102.0%) 2 2 2, 1 per core 0:1
1.092 ± 0.263 (109.2%) 3 4 1, 1 per core 0
1.058 ± 0.075 (105.8%) 4 1 8, 2 per core 0:3,4:7
1.043 ± 0.067 (104.3%) 5 2 4, 2 per core 0:1,4:5
1.084 ± 0.168 (108.4%) 6 4 2, 2 per core 0,4
[/CODE]The two tables show that 4-threaded with 1 thread per core is ~1.06 times faster then 8-threaded with 2 threads per core for example.

On many the systems I have tested it on so far, I actually get significantly different results than the previous version that directly compared the adjusted msec/iter times from the [C]mlucas.cfg[/C] files. I would be interested to hear whether other people get the results they were expecting. Feedback is also welcome.

tdulcet 2021-01-24 14:41

Self-test issues
 
1 Attachment(s)
These are probably known issues, but I thought I should note that the [C]./Mlucas -s a[/C], [C]./Mlucas -s h[/C] and [C]./Mlucas -s t[/C] self-test options do not work. I do not have a personal interest in these options, I was just trying to test my install script against more than the default [C]./Mlucas -s m[/C] FFT lengths to verify that it properly scales.

For both [C]./Mlucas -s a[/C] and [C]./Mlucas -s t[/C], I get this error before it immediately exits:
[QUOTE]
ERROR: at line 85 of file ../src/radix8_ditN_cy_dif1.c
Assertion failed: CY routines with radix < 16 do not support shifted residues!
[/QUOTE]Before that error, I also get two of these warning:
[QUOTE]
WARN: At line 327 of file ../src/mers_mod_square.c:
n/radix0 must be >= 1024! Skipping this radix combo.
[/QUOTE]For [C]./Mlucas -s h[/C], I get errors like this for every FFT length and radix combo tested and it never adds anything to the [C]mlucas.cfg[/C] file:
[QUOTE]
Res mod 2^35 - 1 = 7270151463
Res mod 2^36 - 1 = 68679090081
*** Res35m1 Error ***
current = 7270151463
should be = 29128713056
*** Res36m1 Error ***
current = 68679090081
should be = 7270151463
--
Return with code ERR_INCORRECT_RES64
Error detected - this radix set will not be used.
WARNING: 0 of 10 radix-sets at FFT length 65536 K passed - skipping it. PLEASE CHECK YOUR BUILD OPTIONS.
[/QUOTE]Also, for [C]./Mlucas -s s[/C], the 1792K FFT length does not work:
[QUOTE]
Res mod 2^35 - 1 = 679541321
Res mod 2^36 - 1 = 62692450676
*** Res64 Error ***
current = 9796448591002464256
should be = 11513515421623922688
*** Res35m1 Error ***
current = 679541321
should be = 1603847275
*** Res36m1 Error ***
current = 62692450676
should be = 51947401644
--
Return with code ERR_INCORRECT_RES64
Error detected - this radix set will not be used.
WARNING: 0 of 5 radix-sets at FFT length 1792 K passed - skipping it. PLEASE CHECK YOUR BUILD OPTIONS.
[/QUOTE]I was able to reproduce these issues on multiple systems, although the above quotes were from a 4 core/8 thread Intel system, the same as with my above post. Here are the specific details copied from the top of my install script's output:
[CODE]
Linux Distribution: Ubuntu 20.04.1 LTS
Linux Kernel: 5.4.0-58-generic
Computer Model: Dell Inc. Precision T1700 01
Processor (CPU): Intel(R) Xeon(R) CPU E3-1241 v3 @ 3.50GHz
CPU Cores/Threads: 4/8
Architecture: x86_64 (64-bit)
Total memory (RAM): 15,954 MiB (16GiB) (16,729 MB (17GB))
Total swap space: 3,903 MiB (3.9GiB) (4,093 MB (4.1GB))
[/CODE]I attached all four of the respective output files. For reference, the above quotes were found with the [C]grep -i -B 2 -A 2 'error\|warn\|assert' <file>[/C] command on the attached output files.

ewmayer 2021-01-24 19:55

Thanks for the list, T - while few or no users will be interested in those options, the issues need to get addressed on the way to the 19.1 release.

[b]Edit:[/b]

OK, worked through the issues you listed above:

1. [i]ERROR: at line 85 of file ../src/radix8_ditN_cy_dif1.c
Assertion failed: CY routines with radix < 16 do not support shifted residues![/i]

In fact, no leading radices between 8 and 15 aside from 12 support shift - in all those I changed the assertion to return(ERR_ASSERT), which is a way of telling the self-test control logic "skip this radix set and continue."

2. [i]WARN: At line 327 of file ../src/mers_mod_square.c:
n/radix0 must be >= 1024! Skipping this radix combo.[/i]

That is expected - certain speed-related data-structures introduced a few years ago come at the price of this limitation, which mainly affects small FFT lengths.

3. 1792K self-test residue error message: Somehow this one ended up with the wrong set of reference residues.

All the above fixes will be in the soon-to-come 19.1 release.

ewmayer 2021-01-29 22:39

This is more of an extended "code fun" diary entry, but thought it might be of interest to other close-to-the-metal coders who hang around here:

v19.1 shakedown testing is ongoing - Laurent D. was able to successfully build using Clang/LLVM on Apple M1, comparison of timings between that build and a GCC/Brew build of the same source base showed the Clang and GCC builds to be more or less identical in terms of speed. OTOH Clang builds on my Android phone with a quad Armv8+SIMD CPU are consistently 5-10% *slower* than GCC builds on same - go figure. On my old Haswell quad the Clang build seems a bit faster, but that machine has a lot of run-to-run timing variability, so I'm still working on how to best get consistent timing data.

One thing the testing has reminded of: when testing significant amounts of new assembly, [b]always test on the oldest CPU family supporting the particular instruction set architecture (ISA) targeted by the asm[/b]. This is especially true for Intel SIMD, because Intel is notorious for releasing first-cut ISAs with glaringly obvious missing key functionality, then patching that with later add-ons. SSE2 is perhaps the best (as in worst) example of this, there the add-ons went all the way up to SSE4.2. My old Macbook on which I'm typing this only has SSE2, SSE3 and SSE3e, so e.g. my Mlucas SSE2 carry routines don't use the ROUNDPD instruction, which was only belatedly added in SSE4.1. When Intel released 256-bit AVX, that only had full 256-bit vector instructions for floating data types, not integer - the latter were only added with AVX2.

Getting back to the above bolded point - the reduced-length asm-macro arglist constraint imposed by Clang for Arm builds which are the focus of the v19.1 release mean a lot of what were once sets of I/O addresses for various short-length discrete Fourier transform (DFTs) in the macro arglist now get replaced by pairs of [base-address, start-of-offset-index-vector] pointers, from which the needed multiple I/O addresses are computed inside the asm-macro. Those pointer-pairs - one for inputs, one for outputs - each need 2 general-purpose registers (GPRs) to store. Not a problem for Armv8 since it comes with a generous 32 GPRs. But we want identical macro interfaces across architectures, so the same kinds of code changes go into the Intel x86_64 versions of each macro, and x86_64 only gives us 14 usable GPRs (rax,rbx,rcx,rdx,rsi,rdi,r8-15; rsp and rbp are reserved for OS and compiler use). To mitigate the resulting not-enough-GPRs problem, I resorted in a couple places to what I thought was a nifty hack: when needed, copy one or more selected GPR contents into the otherwise-unused legacy MMX registers, which are the same 64-bit width and of which there 8, rather than spilling to and later reading back from memory. Note that the spill issue in my case didn't affect the SSE2 version of the macro, as that uses a less register-dense way of structuring the DFT algorithm in question, since we're not targeting Intel FMA3 instructions with their 2-per-cycle throughput and 4-5-cycle latency.

For my AVX, AVX2 and AVX-512 builds of the new code I was using an Intel NUC8I3CYSM mini with a i3-8121U-Processor for build & test. Convenient, because once I do the first-cut proof-of-principle recode of a given DFT macro on my Mac, I port to the AVX, AVX2 and AVX-512 versions of same, and can test all 3 of those on the same machine. For the code involving the spill-to-MMX trick, all the timings looked good. Yesterday I figured, better also build and test on the old Haswell quad just to see how things look there - uh-oh, there spill-to-MMX is a complete disaster, performance-wise. So spill-to-memory it is.

Second example of same lesson: AVX512 is probably the best-designed ISA in first-release form Intel has done - when I was first porting my Mlucas asm macros to it in 2H2017 using the GIMPS KNL for builds, the AVX512F (F =foundation) instructions gave me more or less everything I needed, just a few small "this particular piece of code only needs a 256-bit vector width, but AVX512F only supports full-width ZMM operands, not YMM, so we use just the low 256 bits of an ZMM for our data" instances. Integer support in AVX512F not quite so good, especially for wide-multiply, but that's not an issue for FFT code. So after hitting the above don't-use-this-instruction-on-Haswell issue, I also built the new code in AVX-512 mode on the barebones 68-core KNL I bought late last year. That crashed with a SIGILL, illegal instruction exception, very early in build test-test. I'd been careful to avoid later-release-than-AVX512F instructions in those versions of the recoded macros, did I maybe miss something? Nope - gdb revealed the problem was GCC-generated code for a simple section of C code consisting of nothing more than some simple pointer-arithmetic adds - here are the relevant snips of C code and the roughly corresponding disassembly, with the gdb-added ==> indicating the offender:
[code] nisrt2 = tmp + 0x00; // For the +- isrt2 pair put the - datum first, thus cc0 satisfies
isrt2 = tmp + 0x01; // the same "cc-1 gets you isrt2" property as do the other +-[cc,ss] pairs.
...
0x0000000000723126 <+1718>: vmovdqa %xmm5,0x690(%rsp)
0x000000000072312f <+1727>: vmovq %rsi,%xmm5
0x0000000000723134 <+1732>: lea 0x4540(%rbp),%rsi
0x000000000072313b <+1739>: vpinsrq $0x1,%rcx,%xmm5,%xmm13
0x0000000000723141 <+1745>: vmovq %rsi,%xmm5
0x0000000000723146 <+1750>: lea 0x4580(%rbp),%rsi
=> 0x000000000072314d <+1757>: vmovdqa64 %xmm16,%xmm6
0x0000000000723153 <+1763>: vpinsrq $0x1,%rsi,%xmm5,%xmm5[/code]
Note that there are lots of vmovdqa instructions in the complete disassembly, but the arrowed one is the only 64-suffixed one. This family of instructions is listed as "Move Aligned Packed Integer Values", comes in a bunch of different flavors depending on the precise ISA on's CPU uses - thanks, Intel - and ta da! Only the 512-bit ZMM-operand version is available in AVX512F; the XMM,YMM forms need AVX512VL, and GCC resorted to the XMM form above. Here is the list of 512-bit instruction subsets supported by the KNL, note no 'vl'-suffix in there:

[c]avx512f avx512pf avx512er avx512cd[/c]

Here is the analogous list for my NUC, which explains why no problems occurred for that build:

[c]avx512f avx512cd avx512dq avx512ifma avx512bw avx512vl avx512vbmi[/c]

On the KNL I specified '-O3 -march=knl' for my build (versus '-O3 -march=skylake-avx512' on the NUC), so it's clearly a GCC bug - but in sympathy, I'm guessing the same scattershot-ISA-release fun makes for a real headache for compiler writers. Anyway, what to do? This is not a bug in my assembly, it's simple gcc-compiled C code. Thankfully, the workaround proved to be to compiled just this 1 source file with -O2, the rest with usual -O3, and fortunately this was the only source file needing such compile-flag hackery.

tdulcet 2021-01-30 14:34

[QUOTE=ewmayer;570037]Thanks for the list, T - while few or no users will be interested in those options, the issues need to get addressed on the way to the 19.1 release.

[B]Edit:[/B]

...

All the above fixes will be in the soon-to-come 19.1 release.[/QUOTE]

No problem. There was also an issue with the [C]./Mlucas -s h[/C] command, where none of the FFT lengths (65536K - 196608K) and radix combos tested worked (see [URL="https://www.mersenneforum.org/showpost.php?p=569998&postcount=84"]post #84[/URL]). Sorry if you already fixed this, but it was not mentioned in your post.

[QUOTE=ewmayer;570455]I'm still working on how to best get consistent timing data.[/QUOTE]

I have been using my [URL="https://github.com/tdulcet/Benchmarking-Tool"]Benchmarking Tool[/URL] to verify the results of the install script, but I think it would also work for this. For example, if you had two binaries, [C]gcc_Mlucas[/C] and [C]clang_Mlucas[/C] compiled with the respective compilers, a command like this would run each 10 times by default and compute the mean, median and standard deviation of the runtimes among other info (Bash syntax):

[C]./time.sh ./{gcc,clang}'_Mlucas -fftlen 6144 -iters 1000 -radset 48,16,16,16,16 -cpu 0,4'[/C]

[QUOTE=ewmayer;570455]On the KNL I specified '-O3 -march=knl' for my build (versus '-O3 -march=skylake-avx512' on the NUC), so it's clearly a GCC bug - but in sympathy, I'm guessing the same scattershot-ISA-release fun makes for a real headache for compiler writers. Anyway, what to do? This is not a bug in my assembly, it's simple gcc-compiled C code.[/QUOTE]

I had a similar issues trying to correctly automatically set the [C]-march=[/C] flag on AVX512 systems in early versions of my install script. The script is automatically tested after every commit [URL="https://travis-ci.org/github/tdulcet/Distributed-Computing-Scripts"]with Travis CI[/URL], which uses Google Cloud for their x86 VMs. After Google Cloud started providing AVX512 systems, the script would sometimes fail. I tried a bunch of different solutions before I figured out I could set the [C]-march=[/C] flag to [C]native[/C] and the compiler would automatically set it and a few other flags to the correct value for the current system, reducing the complexity of the script. This is what the script does now for x86 systems and I have not had any issues on Travis CI or any of the other systems I have tested it on since. I am not sure if this would also workaround your GCC bug...

ewmayer 2021-01-31 23:52

[QUOTE=tdulcet;570500]No problem. There was also an issue with the [C]./Mlucas -s h[/C] command, where none of the FFT lengths (65536K - 196608K) and radix combos tested worked (see [URL="https://www.mersenneforum.org/showpost.php?p=569998&postcount=84"]post #84[/URL]). Sorry if you already fixed this, but it was not mentioned in your post.[/quote]
Ah, forgot to note - those incorrect reference residues for the huge-FFT self-test are low-priority, I've added them to my v20 to-do list.

[quote]I have been using my [URL="https://github.com/tdulcet/Benchmarking-Tool"]Benchmarking Tool[/URL] to verify the results of the install script, but I think it would also work for this. For example, if you had two binaries, [C]gcc_Mlucas[/C] and [C]clang_Mlucas[/C] compiled with the respective compilers, a command like this would run each 10 times by default and compute the mean, median and standard deviation of the runtimes among other info (Bash syntax):

[C]./time.sh ./{gcc,clang}'_Mlucas -fftlen 6144 -iters 1000 -radset 48,16,16,16,16 -cpu 0,4'[/C][/quote]
I plan to release 19.1 in next few days, first need to play with your recently-enhanced build&tune script so I can add suitable text about that to the README page. More feedback soon.

[quote]I had a similar issues trying to correctly automatically set the [C]-march=[/C] flag on AVX512 systems in early versions of my install script. The script is automatically tested after every commit [URL="https://travis-ci.org/github/tdulcet/Distributed-Computing-Scripts"]with Travis CI[/URL], which uses Google Cloud for their x86 VMs. After Google Cloud started providing AVX512 systems, the script would sometimes fail. I tried a bunch of different solutions before I figured out I could set the [C]-march=[/C] flag to [C]native[/C] and the compiler would automatically set it and a few other flags to the correct value for the current system, reducing the complexity of the script. This is what the script does now for x86 systems and I have not had any issues on Travis CI or any of the other systems I have tested it on since. I am not sure if this would also workaround your GCC bug...[/QUOTE]
-march=native is a good suggestion, alas it did not cure the illegal-instruction issue with that one .c file in my KNL build. However, it should allow me to simplify the manual-build instructions on the README page, for the same reason you note above. This is the first such GCC bug I've hit in my KNL builds of various Mlucas releases, so since very few people have a KNL and even fewer of them run Mlucas on it, one hopes this sort of issue continue to be a rare glitch over coming GCC releases.

tdulcet 2021-02-01 15:03

New Install Script for Linux version 3
 
[QUOTE=ewmayer;570615]Ah, forgot to note - those incorrect reference residues for the huge-FFT self-test are low-priority, I've added them to my v20 to-do list.[/QUOTE]

OK, thanks.

[QUOTE=ewmayer;570615]I plan to release 19.1 in next few days, first need to play with your recently-enhanced build&tune script so I can add suitable text about that to the README page. More feedback soon.[/QUOTE]

Great, I will look forward to that. I cannot seem to update my attachment on [URL="https://www.mersenneforum.org/showpost.php?p=569920&postcount=83"]post #83[/URL], so I attached a version 3 of the new install script for Linux to this post with some minor fixes. It will now correctly generate the combinations of CPU cores/threads to test (step #1) on systems where the number of CPU cores is not a power of two (mainly VMs). It will also handle the case where the FFT lengths in each [C]mlucas.cfg[/C] file from step #2 are not all the same, for example if one or more of the files is missing one or more FFT lengths. I am not sure if this case is possible without their being a bug in Mlucas, but the script should now handle it. Everything else from post #83 still applies, except the [C]exit[/C] command to remove is now on line 400.

[QUOTE=ewmayer;570615]-march=native is a good suggestion, alas it did not cure the illegal-instruction issue with that one .c file in my KNL build. However, it should allow me to simplify the manual-build instructions on the README page, for the same reason you note above. This is the first such GCC bug I've hit in my KNL builds of various Mlucas releases, so since very few people have a KNL and even fewer of them run Mlucas on it, one hopes this sort of issue continue to be a rare glitch over coming GCC releases.[/QUOTE]

With the new install script for Mlucas v19.1, users will be able to run [C]export CC=clang[/C] before the script to build Mlucas with Clang instead of the default GCC, which would be another possible workaround, if anyone else has issues with GCC.

@ewmayer BTW, I just saw [URL="https://www.mersenneforum.org/showthread.php?t=24813"]your thread[/URL] over on the Linux forum from a couple years ago and I thought I should note that the install script will automatically create a script and cron job do that by default. Early versions of the script put the commands to run both Mlucas and the PrimeNet script entirely in a cron job, similar to what is described in that thread, but on systems with many CPU cores, the cron job was too long, so as of a few months ago, it will put the commands in a separate [C]obj/Mlucas.sh[/C] script, which is then automatically run from the cron job. (The attached version is of course for testing and will not do this unless you remove the [C]exit[/C] command as described in post #83.)

ewmayer 2021-02-01 20:37

A ha ha, I'm such an idiot - let's again look at your example '-s h' self-test error message:
[QUOTE=tdulcet;569998]Res mod 2^35 - 1 = 7270151463
Res mod 2^36 - 1 = 68679090081
*** Res35m1 Error ***
current = 7270151463
should be = 29128713056
*** Res36m1 Error ***
current = 68679090081
should be = 7270151463
--
Return with code ERR_INCORRECT_RES64
Error detected - this radix set will not be used.
WARNING: 0 of 10 radix-sets at FFT length 65536 K passed - skipping it. PLEASE CHECK YOUR BUILD OPTIONS. [/QUOTE]
Notice that the 'current' (= from self-test) Res35m1 value matches the 'should be' (= pretabulated reference value) Res36m1 one? Let's have a look at the pretabulated reference residues for this FFT length in Mlucas.c:
[i]
{ 65536,1166083297u, { {0xA0FE3066C834E360ull, 29128713056ull, 7270151463ull}, ...
[/i]
65536K FFT, reference test exponent 1166083297, followed by 3 ref-residue triplets, corresponding to the test residue modulo 2^64 (low 64 bits of full residue = Res64, in hex), 2^35-1 and 2^36-1. The triplets are for -iters 100,1000,10000, I have copied only the 100-iter one above.

The residues modulo 2^35-1 and 2^36-1 are a.k.a. the Selfridge-Hurwitz residues, after the 2 worthies who used them for their Fermat-number primality-testing work in the 1960s, on mainframe hardware which supported a 36-bit integer type. They also included the residue modulo 2^36, but as that is just the low 36 bits of the GIMPS-used Res64, it's redundant. But until a couple years ago, Mlucas would print all 4 residues like so - let's again use the above case to illustrate, since we can trivially extract the Res36 from the Res64:
[code]Res64: 0xA0FE3066C834E360
Res mod 2^36 = 29128713056
Res mod 2^35 - 1 = 7270151463
Res mod 2^36 - 1 = 68679090081[/code]
Now it's clear what must've happened - a few years back when I tabulated those '-s h' entries, I must've done all the needed runs, assembled the resulting 4-line entries as above, then batch-edited them. Should've deleted the Res mod 2^36 line and used the remaining 3, but must've instead kept the first 2 decimal residues and deleted the the mod 2^36 - 1 one in a bunch of cases.

The good news is that that makes it easy to see which tabulated entries are fubar in this manner - just extract the low 9 hexits of the Res64 for each triplet, print in decimal, see if that matches the first of the 2 following 36-bit decimal entries in the triplet, if so, it's fubar.

Per that, for the 100-iter triplets, all but last 3 need redo - but those 3, for FFT lengths 212992,229376,245760 - should be OK. Further, none of the 1000-iter reference triplets show the above 'whoops', so those should all be OK.

But I see I never got around to filling in the 10000-iter table entries for these large FFT lengths, so since the 100 and 1000-iter are pretty fast to recompute compared to those, gonna redo them all.


All times are UTC. The time now is 06:10.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.