mersenneforum.org Mlucas v18 available
 Register FAQ Search Today's Posts Mark Forums Read

 2019-02-26, 22:04 #12 ewmayer ∂2ω=0     Sep 2002 República de California 2×3×29×67 Posts One of the beta-code build & test-ers reports build and runtime errors on several varieties of big-endian hardware ... a code review confirms that some byte-array-based bitwise-utilities funtionality I added in the last few years for the sake of efficiency breaks endian-independence. Easy enough to fix the issue - just need to wrap the handful of byte-array-based utils in an endian-ness preprocessor clause and run the byte-processing in reverse order in the big-endian case. But - what compiler predefine to use for said preprocessor clauses? On my Mac, 'gcc -dM -E [random source file] < /dev/null | grep ENDIAN' gives this: #define __LITTLE_ENDIAN__ 1 My hopes that that would be a gcc-standard predef were quickly dashed - On my ARMv8/linux, things are far less straightforward: #define __ORDER_LITTLE_ENDIAN__ 1234 #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__ #define __ORDER_PDP_ENDIAN__ 3412 #define __ORDER_BIG_ENDIAN__ 4321 #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__ As long as the range of supported predefs across Posixworld is decently small that's OK - can folks reading this try the above gcc predef-dump command on their systems and let me know if they spot anything that would not be covered by the following? Code: #if (__LITTLE_ENDIAN__ == 0) || (__BYTE_ORDER__ != __ORDER_LITTLE_ENDIAN__) #define USE_BIG_ENDIAN #endif Edit: Another option would be to key off the relatively limited set of CPU families using big-endian in the platform.h file and set an internal USE_BIG_ENDIAN preprocessor flag based on CPU-family. Since most major CPU families on which the code has been built already have their own little predef-sections in the header file, it would simply be another predef that gets set-or-not there. Thoughts welcome! Last fiddled with by ewmayer on 2019-02-26 at 22:17
2019-03-07, 08:06   #14

"Sam Laur"
Dec 2018
Turku, Finland

317 Posts

Quote:
 Originally Posted by ewmayer Would someone with access to an ARMv8 CPU please try the prebuilt-with-SIMD binary I posted? I built it on my Odroid C2, with non-static linkage (which is how the v17.1 binaries were done, IIRC), and need to see how portable that is.
Works on Raspberry Pi 3B+ and 3A+... No changes in performance though, at 5120K FFT size it can't use the radix-320 front end due to excessive roundoff. Works at 2560K though, but there it's about the same speed as before (but it still chose 160 32 16 16 before on version 17.1, and 320 16 16 16 now on version 18.0, so maybe there's some difference anyway).

2019-03-07, 20:15   #15
ewmayer
2ω=0

Sep 2002
República de California

1165810 Posts

Quote:
 Originally Posted by nomead Works on Raspberry Pi 3B+ and 3A+... No changes in performance though, at 5120K FFT size it can't use the radix-320 front end due to excessive roundoff. Works at 2560K though, but there it's about the same speed as before (but it still chose 160 32 16 16 before on version 17.1, and 320 16 16 16 now on version 18.0, so maybe there's some difference anyway).
Hmm, just did a quick single-FFT-length self-test @2560 and 5120K on my Odroid C2 using the SIMD binary, '-cpu 0:3 -iters 1000' - here is the summary:

2560K: Radices 320,16x3 run @123.3 ms/iter, maxROE = 0.3125; 160,32,16,16 @124.1 ms/iter, maxROE = 0.34375, so radix 320 gives a tiny speedup here, and both top-candidate radix sets give acceptable ROE levels.

5120K: 320,32,16,16 gives 286.9 ms/iter but ROE = 0.4375 on iters 80,752, (thus deemed ineligible as the cfg-file entry for this FFT length); 160,32,32,16 gives 281.9 ms/iter and maxROE = 0.3125, thus is both fastest and has acceptably low ROE, thus gets the nod. Now were those 2 timings reversed and were I planning to do some first-time-tests @5120K on the hardware in question, I would consider manually hacking the mlucas.cfg file to force radix set 320,32,16,16 at this length. Do you still have your self-test screenlog so you can check the timings in this manner?

The timing deterioration on the C2 between 2560 and 5120K is marked - this hardware is thus ill-suited for first-time-tests even ignoring the long runtime and risk of assignment-expiry such a run would incur.

 2019-03-08, 22:44 #16 nomead     "Sam Laur" Dec 2018 Turku, Finland 31710 Posts Well I thought I had saved the output from the self-test but I forgot that it goes to stderr, not stdout. Meh. Ran it again, anyway. 2560K 320 16 16 16: 145.1 ms/iter, MaxErr = 0.34375 (chosen in mlucas.cfg) 2560K 160 16 16 32: 152.2 ms/iter, MaxErr = 0.3125 2560K 160 32 16 16: 146.7 ms/iter, MaxErr = 0.3125 5120K 320 16 16 32: 356.2 ms/iter, MaxErr = 0.46875 (limit exceeded on iters 425, 709, 795 if it makes a difference) 5120K 160 16 32 32: 349.3 ms/iter, MaxErr = 0.3125 5120K 160 32 32 16: 336.7 ms/iter, MaxErr = 0.28125 (chosen in mlucas.cfg) So the same here, 5120K with radix-320 is slower for some reason. I'm only running double checks, and the exponents I'm getting from Primenet seem to run on a 2816K FFT. I really don't know how the internals work, so what I'm asking may be totally silly: would a doubling of radix-176 (352?) be beneficial, or even work at all? Oh, and one more thing. When I stop the program with Control-C as before, there is this error message: received SIGINT signal. ERROR: at line 2146 of file ../src/mers_mod_square.c Assertion failed: nanosleep fail!
2019-03-08, 23:10   #17
ewmayer
2ω=0

Sep 2002
República de California

2×3×29×67 Posts

Quote:
 Originally Posted by nomead Well I thought I had saved the output from the self-test but I forgot that it goes to stderr, not stdout. Meh. Ran it again, anyway. 2560K 320 16 16 16: 145.1 ms/iter, MaxErr = 0.34375 (chosen in mlucas.cfg) 2560K 160 16 16 32: 152.2 ms/iter, MaxErr = 0.3125 2560K 160 32 16 16: 146.7 ms/iter, MaxErr = 0.3125 5120K 320 16 16 32: 356.2 ms/iter, MaxErr = 0.46875 (limit exceeded on iters 425, 709, 795 if it makes a difference) 5120K 160 16 32 32: 349.3 ms/iter, MaxErr = 0.3125 5120K 160 32 32 16: 336.7 ms/iter, MaxErr = 0.28125 (chosen in mlucas.cfg) So the same here, 5120K with radix-320 is slower for some reason.
Thanks - eagle-eyed readers may note that while your overall results are essentially the same, some of the details - specifically the precise maxROE value and iterations-with-ROE-warning - differ from those I posted. Same binary, same-ARMv8-compliant-hardware, so shouldn't the numbers be *exactly* the same? The reason for subtle differences lies in v18's usage of random residue shift - if your initial shift count differs from mine, the ROE numbers will as well.

Quote:
 I'm only running double checks, and the exponents I'm getting from Primenet seem to run on a 2816K FFT. I really don't know how the internals work, so what I'm asking may be totally silly: would a doubling of radix-176 (352?) be beneficial, or even work at all?
Not silly at all - but the larger initial radices appear quite hit-or-miss in terms of speedups: 288 is better tha 144 across most platforms especially at 4608K. (2304K is more precise-platform dependent). Radix-320 was rather more disappointing in that regard. I plan to implement Radix-352 in v19, but it's more likely to have an impact at 5632K than at 2816K, i.e. once the GIMPS first-time-testing wavefront passes p ~106M, thus no huge rush.

Quote:
 Oh, and one more thing. When I stop the program with Control-C as before, there is this error message: received SIGINT signal. ERROR: at line 2146 of file ../src/mers_mod_square.c Assertion failed: nanosleep fail!
I get those errors sometimes, typically in the context of running under the debugger - they basically mean some signal has interacted badly with the nanosleep() command I use as part of my wait-for-all-threads-to-finish-current-task management in multithreaded execution mode. Future enhancements of the new signal-catching code using the supposedly more robust sigaction() may help here; for now, YMMV as to whether the signal code works as intended. Worst case you lose the iterations done since the last normally sheduled checkpoint. I assume you restarted the above run - can you post the snip of code from the p*.stat file bracketing the interrupt?

2019-03-09, 03:52   #18

"Sam Laur"
Dec 2018
Turku, Finland

317 Posts

Quote:
 Originally Posted by ewmayer Worst case you lose the iterations done since the last normally sheduled checkpoint. I assume you restarted the above run - can you post the snip of code from the p*.stat file bracketing the interrupt?
And indeed, that seems to happen. The program doesn't manage to save progress when interrupted and restarts from the last save file. I started this test with 17.1 way back in December so the residue shift is 0. There is also some very small random variance in the execution speed, but that doesn't seem to change while the program is running. This behaviour was the same with version 17.1. It was 168.4 ms for some time before this restart (and this was also spot on the same iteration speed as on 17.1, for a couple of months) , and has stayed at 167.9 ms now for the time it's been running since the restart.
Code:
[Mar 07 20:29:18] M5132xxxx Iter# = 36790000 [71.67% complete] clocks = 00:28:04.083 [168.4083 msec/iter] Res64: 74873862A50BB57E. AvgMaxErr = 0.071576885. MaxErr = 0.109375000. Residue shift count = 0.
Restarting M5132xxxx at iteration = 36790000. Res64: 74873862A50BB57E, residue shift count = 0
M5132xxxx: using FFT length 2816K = 2883584 8-byte floats, initial residue shift count = 0
this gives an average   17.800407062877309 bits per digit
Using complex FFT radices       176        32        16        16
[Mar 08 17:10:27] M5132xxxx Iter# = 36800000 [71.69% complete] clocks = 00:27:58.834 [167.8834 msec/iter] Res64: C6533BF704CDF1F1. AvgMaxErr = 0.071412456. MaxErr = 0.101562500. Residue shift count = 0.

2019-03-09, 19:54   #19
ewmayer
2ω=0

Sep 2002
República de California

2·3·29·67 Posts

Sorry to see signal-handling not working for you - that was a *very* late-breaking addition to v18. In any event you're no worse off than with the previous version, you just need to stop it with the constant interruptions! How do you expect us to get any work done... :)

Also, a corrigendum to my note re. leading-radix 352 and FFT length 5632K:

Quote:
 Originally Posted by ewmayer I plan to implement Radix-352 in v19, but it's more likely to have an impact at 5632K than at 2816K, i.e. once the GIMPS first-time-testing wavefront passes p ~106M, thus no huge rush.
Actually p ~106M is the *upper* limit for 5632, lower limit (i.e. upper limit for 5120K) is ~96M. So I guess I better get on that!

2019-03-09, 20:52   #20

"Sam Laur"
Dec 2018
Turku, Finland

13D16 Posts

Quote:
 Originally Posted by ewmayer Sorry to see signal-handling not working for you - that was a *very* late-breaking addition to v18. In any event you're no worse off than with the previous version, you just need to stop it with the constant interruptions! How do you expect us to get any work done... :)
Yeah it's not a problem at all in real use, the last time I interrupted the run (before upgrading to 18.0) was a couple months ago... And I'm building a small cluster of RPi 3A+ boards in a "set and forget" configuration double checking LL. I actually got the first five up and almost running, but one of the boards has almost deaf WiFi for some reason, and I haven't had the time yet to check what's wrong with it. Maybe I'll just have to run that one board completely offline if it requires too much effort to fix. But once they're running, there will be no real need to interrupt them, ever.

2019-03-09, 21:11   #21
ewmayer
2ω=0

Sep 2002
República de California

2×3×29×67 Posts

Quote:
 Originally Posted by nomead Yeah it's not a problem at all in real use, the last time I interrupted the run (before upgrading to 18.0) was a couple months ago... And I'm building a small cluster of RPi 3A+ boards in a "set and forget" configuration double checking LL. I actually got the first five up and almost running, but one of the boards has almost deaf WiFi for some reason, and I haven't had the time yet to check what's wrong with it. Maybe I'll just have to run that one board completely offline if it requires too much effort to fix. But once they're running, there will be no real need to interrupt them, ever.
In order to aid this "army ant" computing model - I'm taking delivery of a couple of for-parts cellphones for my part - I'm currently working with Aaron (MadPoo) on enhancing the primenet.py script to do a couple of v5-server things to support assignment progress update. That should allow ARM user to run longer 1st-time tests, should they desire to, without having said assignments expire once they hit the 180-day mark.

2019-03-12, 19:21   #22
ewmayer
2ω=0

Sep 2002
República de California

2D8A16 Posts

Quote:
 Originally Posted by nomead Yeah it's not a problem at all in real use, the last time I interrupted the run (before upgrading to 18.0) was a couple months ago...
Here's the signal stuff working on my Debian-running Intel Haswell quad, on a first-time LL test running on all 4 cores ... yesterday was first really springlike day in my neck of the woods, my BR where the box sits in a corner has southern exposure and gets pretty warm on days like that. The haswell uses just stock cooling and even with the case side panel on the CPU side removed, I find the system starts getting flaky when ambient goes above 75F. So late morning yesterday clicked the on/off switch on the case to turn the system off, then back on late evening once things had cooled off. Note the times-of-day in the following p*.stat snip are ~8 hours behind, this is a headless system and I've just let the internal clock drift in the years I've owned it:
Code:
[Mar 11 06:22:34] M86687009 Iter# = 8030000 [ 9.26% complete] clocks = 00:01:58.909 [ 11.8909 msec/iter] Res64: 5C4BB4BE6AE5BBB0. AvgMaxErr = 0.214745667. MaxErr = 0.312500000. Residue shift count = 12479875.
[Mar 11 06:24:32] M86687009 Iter# = 8040000 [ 9.27% complete] clocks = 00:01:58.691 [ 11.8691 msec/iter] Res64: 6757643F59CD637A. AvgMaxErr = 0.214732219. MaxErr = 0.312500000. Residue shift count = 15291034.
Iter = 8041419: Writing savefiles and exiting.
[Mar 11 06:24:50] M86687009 Iter# = 8041419 [ 9.28% complete] clocks = 00:00:16.910 [ 11.9174 msec/iter] Res64: A15F129B39F5F7AD. AvgMaxErr = 0.214937561. MaxErr = 0.281250000. Residue shift count = 24137129.
...
Restarting M86687009 at iteration = 8041419. Res64: A15F129B39F5F7AD, residue shift count = 24137129
M86687009: using FFT length 4608K = 4718592 8-byte floats, initial residue shift count = 24137129
this gives an average   18.371372010972763 bits per digit
Using complex FFT radices       288        16        16        32
[Mar 11 13:15:57] M86687009 Iter# = 8050000 [ 9.29% complete] clocks = 00:01:40.629 [ 11.7271 msec/iter] Res64: FA296EE64B5710E2. AvgMaxErr = 0.214812786. MaxErr = 0.281250000. Residue shift count = 20164181.
Quote:
 And I'm building a small cluster of RPi 3A+ boards in a "set and forget" configuration double checking LL. I actually got the first five up and almost running, but one of the boards has almost deaf WiFi for some reason, and I haven't had the time yet to check what's wrong with it. Maybe I'll just have to run that one board completely offline if it requires too much effort to fix. But once they're running, there will be no real need to interrupt them, ever.
I just took delivery of two sold-for-parts-on-ebay Samsung Galaxy S7s in the past couple of days, project for the coming week is to get them rooted and running Mlucas, also awaiting delivery of a USB charging station (which should have enough juice to power 4 such phones running Mlucas on all cores) and USB fan (which the Q&A section on the product page says draws just 0.8W at top speed in USB mode) ... the fan should be sufficient to cool a pair of such 4-phone mini farms, which should give me a total compute throughput comparable to the above-mentioned Haswell quad, in a rather smaller footprint.

 Similar Threads Thread Thread Starter Forum Replies Last Post ewmayer Mlucas 96 2019-10-16 12:55 Damian Mlucas 17 2017-11-13 18:12 ewmayer Mlucas 3 2017-06-17 11:18 Lorenzo Mlucas 52 2016-03-13 08:45 delta_t Mlucas 14 2007-10-04 05:45

All times are UTC. The time now is 18:05.

Thu Oct 21 18:05:27 UTC 2021 up 90 days, 12:34, 1 user, load averages: 1.46, 1.43, 1.41