![]() |
![]() |
#56 |
∂2ω=0
Sep 2002
República de California
2·7·829 Posts |
![]()
I always use scp when available, perhaps my expectations re. fs-path handling have been colored by that. But on this particular server (or perhaps my remote-access privileges to it), only ftp is available.
|
![]() |
![]() |
![]() |
#57 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
32·5·109 Posts |
![]()
Are you implementing patnashev's prp proof generation in Mlucas?
|
![]() |
![]() |
![]() |
#58 |
6809 > 6502
"""""""""""""""""""
Aug 2003
101×103 Posts
23×7×167 Posts |
![]() ![]() Hadn't thought to ask that myself. If it had come to mind, I would have. It will be useful when we find the next candidate prime. |
![]() |
![]() |
![]() |
#59 |
∂2ω=0
Sep 2002
República de California
2·7·829 Posts |
![]()
PRP-proof support will be in v20, yes. I am alas behind the curve there - between the pandemic and a series of non-life-threatening but still frequently day-week-and-month-ruining health bugaboos, this year has been one of continual annoying distractions. And EOM my housemates-of-2-years (young professional couple who just bought a starter home in the area) are vacating the MBR suite of our large shared apartment, so I have tons of busywork to do getting the place ready to show to prospective renters. What a year...
My one main concern re. PRP-proof support is that it appears that the memory needs will relegate many smaller compute devices (Android phones, Odroid and RPi-style micros) to doing LL-DC and cleanup PRP-DC. It's downright undemocratic elitism, it is. ;) |
![]() |
![]() |
![]() |
#60 | |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
114518 Posts |
![]() Quote:
Per https://mersenneforum.org/showpost.p...5&postcount=46 power 7 takes 1.5GB disk space for residues at 100M p. Since Odroid is Ubuntu and GigE, why not pile residues on a network shared drive and then clean them up after the proof file exists? A Droid, Pi or phone farm could share a single TB drive. Right is more important than soon. And life happening affects how soon is practical. Last fiddled with by kriesel on 2020-07-30 at 04:09 |
|
![]() |
![]() |
![]() |
#61 |
"Dylan"
Mar 2017
3·11·17 Posts |
![]()
I have posted a working PKGBUILD for the latest Mlucas to the AUR. You can find it here.
There are two patches that I had to make to the source to get it to build correctly: 1. In the file platform.h, I had to comment out line 1304: Code:
#include <sys/sysctl.h> 2. In the file Mlucas.c, I removed the *fp part of FILE on line 100. This is because the linker (gcc 10.2.0) was complaining that fp was defined elsewhere (namely, in gcd_lehmer.c). |
![]() |
![]() |
![]() |
#62 |
Nov 2020
48 Posts |
![]()
I just built v19 and I'm fairly new to the Arm universe. I am poking around on some of the AWS EC2 instances with "graviton" processors. I notice that if I run with 4 cores, using a command line like this:
Code:
# ./Mlucas -s m -cpu 0:3 then I get this message in the output quite a bit: Code:
mers_mod_square: radix0/2 not exactly divisible by NTHREADS - This will hurt performance. and, sure enough, runs with four cores tend to be (much) slower than 2 cores or even 1 core on the same instance. is there something I should be doing differently? |
![]() |
![]() |
![]() |
#63 |
∂2ω=0
Sep 2002
República de California
2×7×829 Posts |
![]()
@pvn: Sorry for belated reply - that warning message is more common for larger threadcounts, it's basically telling you that part of the FFT code needs the leading (leftmost in the "Using complex FFT radices" info-print) to be divisible by #threads in order to run optimally. Example from a DC my last functioning bought-cheap-used Android phone is currently doing:
Using complex FFT radices 192 32 16 16 The leading radix here is radix0 = 192, thus radix0/2 = 96 = 32*3. Sticking to power-of-2 thread counts (which the other main part of my 2-phases-per-iteration FFT code needs to run optimally) we'd be fine for #threads = 2,4,8,16,32, but 64 would give you the warning you saw. Do you recall which precise radix set you saw the warning at in your case? To see it for 4-threads implies radix0/2 is not divisible by 4, which is only true for a handful of small leading radices: radix0 = 12,20,28,36,44,52,60. That's no problem, it just means that in using the self-tests to create the mlucas.cfg file for your particular -cpu [lo:hi] choice, the above suboptimality will likely cause a different FFT-radix-combo at the given FFT length to run best, which will be reflected in the corresponding mlucas.cfg file entry. I've always gotten quite good multithreaded scaling on my Arm devices (Odroid min-PC and Android phone) up to 4-threads - did you run separate self-tests for -cpu 0, -cpu 0:1 and -cpu 0:3 and compare the resulting mlucas.cfg files? On the Graviton instance you're using, what does /proc/cpu show in terms of #cores? |
![]() |
![]() |
![]() |
#64 | |
Nov 2020
22 Posts |
![]()
Hi ernst, thanks for looking at this and apologies for delays on my end.
Quote:
Also, it seems important to note that all of the radicies that actually get saved in the mlucas.cfg when running -cpu 0:3 are evenly divisible by NTHREADS*2 (in this case, NTHREADS=4). here's some of the output with the radix sets that gave the "this will hurt perforamnce" message (these runs seem to take about 50% more time than the other runs at the same FFT size): M43765019: using FFT length 2304K = 2359296 8-byte floats, initial residue shift count = 29224505 this gives an average 18.550033145480686 bits per digit Using complex FFT radices 36 32 32 32 mers_mod_square: radix0/2 not exactly divisible by NTHREADS - This will hurt performance. M48515021: using FFT length 2560K = 2621440 8-byte floats, initial residue shift count = 31467905 this gives an average 18.507011795043944 bits per digit Using complex FFT radices 20 16 16 16 16 mers_mod_square: radix0/2 not exactly divisible by NTHREADS - This will hurt performance. M53254447: using FFT length 2816K = 2883584 8-byte floats, initial residue shift count = 35280290 this gives an average 18.468144850297406 bits per digit Using complex FFT radices 44 32 32 32 mers_mod_square: radix0/2 not exactly divisible by NTHREADS - This will hurt performance. M53254447: using FFT length 2816K = 2883584 8-byte floats, initial residue shift count = 23722047 this gives an average 18.468144850297406 bits per digit Using complex FFT radices 44 8 16 16 16 mers_mod_square: radix0/2 not exactly divisible by NTHREADS - This will hurt performance. M62705077: using FFT length 3328K = 3407872 8-byte floats, initial residue shift count = 61480382 this gives an average 18.400068136361931 bits per digit Using complex FFT radices 52 32 32 32 mers_mod_square: radix0/2 not exactly divisible by NTHREADS - This will hurt performance. M67417873: using FFT length 3584K = 3670016 8-byte floats, initial residue shift count = 63290971 this gives an average 18.369912556239537 bits per digit Using complex FFT radices 28 16 16 16 16 mers_mod_square: radix0/2 not exactly divisible by NTHREADS - This will hurt performance. M72123137: using FFT length 3840K = 3932160 8-byte floats, initial residue shift count = 65799790 this gives an average 18.341862233479819 bits per digit Using complex FFT radices 60 32 32 32 mers_mod_square: radix0/2 not exactly divisible by NTHREADS - This will hurt performance. M86198291: using FFT length 4608K = 4718592 8-byte floats, initial residue shift count = 21266494 this gives an average 18.267799165513779 bits per digit Using complex FFT radices 36 16 16 16 16 mers_mod_square: radix0/2 not exactly divisible by NTHREADS - This will hurt performance. M95551873: using FFT length 5120K = 5242880 8-byte floats, initial residue shift count = 93620243 this gives an average 18.225073432922365 bits per digit Using complex FFT radices 20 16 16 16 32 mers_mod_square: radix0/2 not exactly divisible by NTHREADS - This will hurt performance. M95551873: using FFT length 5120K = 5242880 8-byte floats, initial residue shift count = 43929528 this gives an average 18.225073432922365 bits per digit Using complex FFT radices 20 32 16 16 16 mers_mod_square: radix0/2 not exactly divisible by NTHREADS - This will hurt performance. M104884309: using FFT length 5632K = 5767168 8-byte floats, initial residue shift count = 24783492 this gives an average 18.186449397693981 bits per digit Using complex FFT radices 44 16 16 16 16 mers_mod_square: radix0/2 not exactly divisible by NTHREADS - This will hurt performance. M123493333: using FFT length 6656K = 6815744 8-byte floats, initial residue shift count = 30371346 this gives an average 18.118833835308369 bits per digit Using complex FFT radices 52 16 16 16 16 mers_mod_square: radix0/2 not exactly divisible by NTHREADS - This will hurt performance. M132772789: using FFT length 7168K = 7340032 8-byte floats, initial residue shift count = 24638813 this gives an average 18.088856969560897 bits per digit Using complex FFT radices 28 16 16 16 32 mers_mod_square: radix0/2 not exactly divisible by NTHREADS - This will hurt performance. M132772789: using FFT length 7168K = 7340032 8-byte floats, initial residue shift count = 92450206 this gives an average 18.088856969560897 bits per digit Using complex FFT radices 28 32 16 16 16 mers_mod_square: radix0/2 not exactly divisible by NTHREADS - This will hurt performance. M142037359: using FFT length 7680K = 7864320 8-byte floats, initial residue shift count = 90349695 this gives an average 18.060984166463218 bits per digit Using complex FFT radices 60 16 16 16 16 mers_mod_square: radix0/2 not exactly divisible by NTHREADS - This will hurt performance. Last fiddled with by pvn on 2021-01-10 at 17:06 |
|
![]() |
![]() |
![]() |
#65 |
∂2ω=0
Sep 2002
República de California
2·7·829 Posts |
![]()
@pvn:
The self-tests are intended to do two things: [1] Check correctness of the compiled code; [2] Find the best-performing combination of radices for each FFT length on the user's platform. That means trying each combination of radices available for assembling each FFT length and picking the one which runs fastest, unless the fastest happens to show unacceptably high levels of roundoff error, in which the combo which runs fastest *and* has acceptable ROE levels gets stored to the mlucas.cfg file. The mlucas.cfg file is read at start of each LL or PRP test: for the current exponent being tested, the program computes the default FFT length based on expected levels of roundoff error, then reads the radix-combo data for that FFT length from mlucas.cfg and uses those FFT radices for the run. The user is still expected to have a basic understanding of their hardware's multicore aspects in terms of running the self-tests using one or more -cpu [core number range] settings. I haven't found a good way to automate this "identify best core topology" step, but it's usually pretty obvious which candidate core-combos to try. Some examples: o On my Intel Haswell quad, there are 4 physical cores, no hyperthreading: run self-tests with '-s m -cpu 0:3' to use all 4 cores; o On my Intel Broadwell NUC mini, there are 2 physical cores, but with hyperthreading: I ran self-tests with '-s m -cpu 0:1' to use just the 2 physical cores, then 'mv mlucas.cfg mlucas.cfg.2' to not get those timings mixed up with the next self-test. Next ran with '-s m -cpu 0:3' to use all 4 cores (2 physical, 2 logical), then 'mv mlucas.cfg mlucas.cfg.4'. Comparing the msec/iter numbers between the 2 files showed the latter set of timings to be 5-10% faster, meaning the hyperthreading was beneficial, so that's the run mode I use: 'ln -s -f mlucas.cfg.4 mlucas.cfg' to link the desired .4-renamed cfg-file to the name 'mlucas.cfg' looked for by the code at runtime, then queue up some work using the primenet.py script and fire up the program using flags '-cpu 0:3'. On manycore and multisocket systems finding the run mode which gives best total throughput takes a bit more work, but "don't split runs across sockets" is rule #1, so then you find the way to max out throughput on an individual socket, and duplicate that setup on socket 2, by incrementing the low:high indices following the -cpu flag appropriately. Regarding your other observations: o It's not surprising that all of the radix sets that appear in your mlucas.cfg when running -cpu 0:3 having leading radix evenly divisible by NTHREADS*2 - like the runtime warning says, if that does not hold (say radix0 = 12 and 4-threads using -cpu 0:3), it will generally hurt performance, meaning such combos will run more slowly due to suboptimal thread utilization, and will nearly always be bested by one or more radix combos which satisfy the divisibility criterion. Nothing the user need worry about, it's all automated, whichever combo runs fastest appears in the cfg file. o The reason the self-tests with 4 threads (-cpu 0:3) take longer than you expected is that for 4 or more threads the default #iters used for each timing test gets raised from 100 to 1000, in order to get a more accurate timing sample. You can override that by specifying -iters 100 for such tests. Cheers, and have fun, -E |
![]() |
![]() |
![]() |
#66 | |
"Teal Dulcet"
Jun 2018
3×7 Posts |
![]() Quote:
|
|
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Mlucas v18 available | ewmayer | Mlucas | 48 | 2019-11-28 02:53 |
Mlucas version 17 | ewmayer | Mlucas | 3 | 2017-06-17 11:18 |
MLucas on IBM Mainframe | Lorenzo | Mlucas | 52 | 2016-03-13 08:45 |
Mlucas on Sparc - | Unregistered | Mlucas | 0 | 2009-10-27 20:35 |
mlucas on sun | delta_t | Mlucas | 14 | 2007-10-04 05:45 |