![]() |
|
|
#12 |
|
∂2ω=0
Sep 2002
República de California
103×113 Posts |
One of the beta-code build & test-ers reports build and runtime errors on several varieties of big-endian hardware ... a code review confirms that some byte-array-based bitwise-utilities funtionality I added in the last few years for the sake of efficiency breaks endian-independence. Easy enough to fix the issue - just need to wrap the handful of byte-array-based utils in an endian-ness preprocessor clause and run the byte-processing in reverse order in the big-endian case. But - what compiler predefine to use for said preprocessor clauses? On my Mac, 'gcc -dM -E [random source file] < /dev/null | grep ENDIAN' gives this:
#define __LITTLE_ENDIAN__ 1 My hopes that that would be a gcc-standard predef were quickly dashed - On my ARMv8/linux, things are far less straightforward: #define __ORDER_LITTLE_ENDIAN__ 1234 #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__ #define __ORDER_PDP_ENDIAN__ 3412 #define __ORDER_BIG_ENDIAN__ 4321 #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__ As long as the range of supported predefs across Posixworld is decently small that's OK - can folks reading this try the above gcc predef-dump command on their systems and let me know if they spot anything that would not be covered by the following? Code:
#if (__LITTLE_ENDIAN__ == 0) || (__BYTE_ORDER__ != __ORDER_LITTLE_ENDIAN__) #define USE_BIG_ENDIAN #endif Last fiddled with by ewmayer on 2019-02-26 at 22:17 |
|
|
|
|
|
#13 |
|
∂2ω=0
Sep 2002
República de California
103·113 Posts |
Mlucas v18 has gone live - I've updated the OP in this thread to note that and to remove the beta source-tarball. Thanks to all who built and provide feedback. Would someone with access to an ARMv8 CPU please try the prebuilt-with-SIMD binary I posted? I built it on my Odroid C2, with non-static linkage (which is how the v17.1 binaries were done, IIRC), and need to see how portable that is.
Summary of changes-since-beta-source posted: I found and fixed several bugs in Mlucas.c, the first 2 of which were exposed by the same testing circumstance, where I had a 1st-time LL-test with p in the 80M range complete and then the code started in on the next assignment, which was a partially-complete (migrated from another machine) DC in the 50M range. So those 2 bugs can be considered corner(ish)-case scenarios, but it was still important to get them fixed. Bug 1: During processing of the savefiles for the 50M-exponent, the code spat this out: read_ppm1_savefiles: On restart: Res35m1 checksum error!ERROR: read_ppm1_savefiles Failed on savefile p55******! After inserting the obviously-missing newline that should follow the !, I dug into the source of the error, which is referring the Selfridge-Hurwitz residues (LL-test residue mod 235-1 and 236-1) which I compute as an integity-check for Mlucas savefiles. (The S-H residues were pioneered by those 2 luminaries in their Fermat-number-testing work during the 1950s on hardware of the day which supported 36-bit integers in addition to floating-point numbers). When writing an interime savefile I compute those during the conversion of the residue from floating-point to packed-bit form and tack them onto the full-length residue written to the 2 redundant savefiles. When I restart from a savefile, after reading the full-length residue R and the 2 checksums, I use a different method to on-the-fly compute R mod 235-1 and 236-1, namely the fast Montgomery-mod remaindering I described in this manuscript a few years back. I then compare those 2 just-computed remainders to the ones stored in the savefile, to make sure the savefile data were not corrupted in some fashion. It was that check which was failing, and doing so on both the primary and (normally identical) secondary savefiles. The problem turned out to be this: I use a bytewise array to store R, and in calling the aforementioned remaindering function, which is part of my mi64 (personal GMP-style library) function suite, I cast said array from (uint8*) to (uint64*). (If you're about to ask "but won;t that break endian-portability?", indeed it does - more on that in Bug 5 below.) Problem was, in the above finish-big-exponent-then-proceed-to-DC scenario, I was failing to clear any high bytes in the topmost 64-bit limb of the resulting treated-as-64-bit-integer array, i.e. bytes above those needed for the current p-bit residue which had previously held bytes of the larger previous-test residue. Adding the needed short (1-7 passes) clear-bytes loop, all is well, but then I hit... Bug 2: The first test in the above one-test-finishes-and-we-proceed-to-the-next-one scenario was for an exponent very close to the 4608K FFT-length upper limit, so much so that at several points during the run the program detected a 0.4375 fractional part during the per-iteration rounding step, causing it to stop execution, reread from the last savefile and restart at FFT length 5120K. Based on the relative rarity of the 0.4375 ROEs I decided running at 4608K was safe except for the occasional 0.4375-containing iteration interval, so whenever I noticed such auto-switching to 5120K had occurred I killed the code and restarted with an explicit '-fftlen 4608' added to the command line, which overrides the last-FFT-length-used stored in the savefile. Problem was, after finishing the first-time LL test using that length, the program also overrode the 3072K default length for the subsequent DC exponent with 4608K. So a bug in the control logic, now fixed. Note that Bug 1 is not in play if the next-assignment is the typical from-scratch one. Alex Vong, who is working to incorporate Mlucas v18 into the Debian freeware suite, also reported a few bugs: Bug 3: src/radix16_dyadic_square.c', the function 'SSE2_RADI16_CALC_TWIDDLES_1_2_4_8_13(...)' misses a 'X', it should be 'SSE2_RADIX16_CALC_TWIDDLES_1_2_4_8_13(...)' instead. -- This sllly typo is in the 32-bit-build preprocessor-flag-wrapped portion of said sourcefile, and the code is used only for Fermat-number testing, not Mersenne, but it's still a showstopper in that build mode because it will prevent object-linkability-into-a-binary. Rather bizarrely, on my Mac bth cc and clang fail to flag the no-macro-by-this-name error. Bug 4: Missing wide-integer-product macros and wide-mul macro syntax errors in PowerPC 32-bit builds. It's been so long since I've built on PPC32 that this is a wayback-machine exercise, but anyhow: The missing __MULL64 and __MULH64 macros have been added, and the macro name-collisions which caused the syntax errors have been fixed. Bug 5: Endian-portability broken due to several byte-array-based functions I added to Mlucas v17. This has been fixed. At least I believe it's been fixed - I don't have access to any big-endian hardware. Bug 6: Fixed one-dereference-too-few error "_cy_r[0] = -2" instead of "_cy_r[0][0] = -2" in non-SIMD code in radix[1008|1024|4032]*c. These were not present in v17, rather were introduced by some careless search-and-replace-across-multiple-files editing I did in my v18 development work. I only noticed the errors when I did a non-SIMD v18 build on ARM just prior to release and hit segfaults for those carry-step-wrapping radices during self-testing. Last fiddled with by ewmayer on 2019-03-06 at 21:49 |
|
|
|
|
|
#14 |
|
"Sam Laur"
Dec 2018
Turku, Finland
13E16 Posts |
Works on Raspberry Pi 3B+ and 3A+... No changes in performance though, at 5120K FFT size it can't use the radix-320 front end due to excessive roundoff. Works at 2560K though, but there it's about the same speed as before (but it still chose 160 32 16 16 before on version 17.1, and 320 16 16 16 now on version 18.0, so maybe there's some difference anyway).
|
|
|
|
|
|
#15 | |
|
∂2ω=0
Sep 2002
República de California
101101011101112 Posts |
Quote:
2560K: Radices 320,16x3 run @123.3 ms/iter, maxROE = 0.3125; 160,32,16,16 @124.1 ms/iter, maxROE = 0.34375, so radix 320 gives a tiny speedup here, and both top-candidate radix sets give acceptable ROE levels. 5120K: 320,32,16,16 gives 286.9 ms/iter but ROE = 0.4375 on iters 80,752, (thus deemed ineligible as the cfg-file entry for this FFT length); 160,32,32,16 gives 281.9 ms/iter and maxROE = 0.3125, thus is both fastest and has acceptably low ROE, thus gets the nod. Now were those 2 timings reversed and were I planning to do some first-time-tests @5120K on the hardware in question, I would consider manually hacking the mlucas.cfg file to force radix set 320,32,16,16 at this length. Do you still have your self-test screenlog so you can check the timings in this manner? The timing deterioration on the C2 between 2560 and 5120K is marked - this hardware is thus ill-suited for first-time-tests even ignoring the long runtime and risk of assignment-expiry such a run would incur. |
|
|
|
|
|
|
#16 |
|
"Sam Laur"
Dec 2018
Turku, Finland
2·3·53 Posts |
Well I thought I had saved the output from the self-test but I forgot that it goes to stderr, not stdout. Meh. Ran it again, anyway.
2560K 320 16 16 16: 145.1 ms/iter, MaxErr = 0.34375 (chosen in mlucas.cfg) 2560K 160 16 16 32: 152.2 ms/iter, MaxErr = 0.3125 2560K 160 32 16 16: 146.7 ms/iter, MaxErr = 0.3125 5120K 320 16 16 32: 356.2 ms/iter, MaxErr = 0.46875 (limit exceeded on iters 425, 709, 795 if it makes a difference) 5120K 160 16 32 32: 349.3 ms/iter, MaxErr = 0.3125 5120K 160 32 32 16: 336.7 ms/iter, MaxErr = 0.28125 (chosen in mlucas.cfg) So the same here, 5120K with radix-320 is slower for some reason. I'm only running double checks, and the exponents I'm getting from Primenet seem to run on a 2816K FFT. I really don't know how the internals work, so what I'm asking may be totally silly: would a doubling of radix-176 (352?) be beneficial, or even work at all? Oh, and one more thing. When I stop the program with Control-C as before, there is this error message: received SIGINT signal. ERROR: at line 2146 of file ../src/mers_mod_square.c Assertion failed: nanosleep fail! |
|
|
|
|
|
#17 | |||
|
∂2ω=0
Sep 2002
República de California
103×113 Posts |
Quote:
Quote:
Quote:
|
|||
|
|
|
|
|
#18 | |
|
"Sam Laur"
Dec 2018
Turku, Finland
2·3·53 Posts |
Quote:
Code:
[Mar 07 20:29:18] M5132xxxx Iter# = 36790000 [71.67% complete] clocks = 00:28:04.083 [168.4083 msec/iter] Res64: 74873862A50BB57E. AvgMaxErr = 0.071576885. MaxErr = 0.109375000. Residue shift count = 0. Restarting M5132xxxx at iteration = 36790000. Res64: 74873862A50BB57E, residue shift count = 0 M5132xxxx: using FFT length 2816K = 2883584 8-byte floats, initial residue shift count = 0 this gives an average 17.800407062877309 bits per digit Using complex FFT radices 176 32 16 16 [Mar 08 17:10:27] M5132xxxx Iter# = 36800000 [71.69% complete] clocks = 00:27:58.834 [167.8834 msec/iter] Res64: C6533BF704CDF1F1. AvgMaxErr = 0.071412456. MaxErr = 0.101562500. Residue shift count = 0. |
|
|
|
|
|
|
#19 |
|
∂2ω=0
Sep 2002
República de California
103×113 Posts |
Sorry to see signal-handling not working for you - that was a *very* late-breaking addition to v18. In any event you're no worse off than with the previous version, you just need to stop it with the constant interruptions! How do you expect us to get any work done... :)
Also, a corrigendum to my note re. leading-radix 352 and FFT length 5632K: Actually p ~106M is the *upper* limit for 5632, lower limit (i.e. upper limit for 5120K) is ~96M. So I guess I better get on that! |
|
|
|
|
|
#20 | |
|
"Sam Laur"
Dec 2018
Turku, Finland
2·3·53 Posts |
Quote:
|
|
|
|
|
|
|
#21 | |
|
∂2ω=0
Sep 2002
República de California
103×113 Posts |
Quote:
|
|
|
|
|
|
|
#22 | ||
|
∂2ω=0
Sep 2002
República de California
103×113 Posts |
Quote:
Code:
[Mar 11 06:22:34] M86687009 Iter# = 8030000 [ 9.26% complete] clocks = 00:01:58.909 [ 11.8909 msec/iter] Res64: 5C4BB4BE6AE5BBB0. AvgMaxErr = 0.214745667. MaxErr = 0.312500000. Residue shift count = 12479875. [Mar 11 06:24:32] M86687009 Iter# = 8040000 [ 9.27% complete] clocks = 00:01:58.691 [ 11.8691 msec/iter] Res64: 6757643F59CD637A. AvgMaxErr = 0.214732219. MaxErr = 0.312500000. Residue shift count = 15291034. received SIGTERM signal. Iter = 8041419: Writing savefiles and exiting. [Mar 11 06:24:50] M86687009 Iter# = 8041419 [ 9.28% complete] clocks = 00:00:16.910 [ 11.9174 msec/iter] Res64: A15F129B39F5F7AD. AvgMaxErr = 0.214937561. MaxErr = 0.281250000. Residue shift count = 24137129. ... Restarting M86687009 at iteration = 8041419. Res64: A15F129B39F5F7AD, residue shift count = 24137129 M86687009: using FFT length 4608K = 4718592 8-byte floats, initial residue shift count = 24137129 this gives an average 18.371372010972763 bits per digit Using complex FFT radices 288 16 16 32 [Mar 11 13:15:57] M86687009 Iter# = 8050000 [ 9.29% complete] clocks = 00:01:40.629 [ 11.7271 msec/iter] Res64: FA296EE64B5710E2. AvgMaxErr = 0.214812786. MaxErr = 0.281250000. Residue shift count = 20164181. Quote:
|
||
|
|
|
![]() |
| Thread Tools | |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Mlucas version 17.1 | ewmayer | Mlucas | 96 | 2019-10-16 12:55 |
| Mlucas on ubuntu | Damian | Mlucas | 17 | 2017-11-13 18:12 |
| Mlucas version 17 | ewmayer | Mlucas | 3 | 2017-06-17 11:18 |
| MLucas on IBM Mainframe | Lorenzo | Mlucas | 52 | 2016-03-13 08:45 |
| mlucas on sun | delta_t | Mlucas | 14 | 2007-10-04 05:45 |