![]() |
![]() |
#67 |
Nov 2020
22 Posts |
![]()
tdulcet,
ah, this is very helpful. I spent a good bit of time yesterday doing something similar, building essentially a barebones version of this to build docker images. For intel, I just built multiple binaries (for sse, avx, avx2, avx512) and use an entrypoint script to determine at runtime what hardware is available and run the right binary. I have had some trouble with the build on arm, though, so for now I'm just using the precompiled binaries, but similar routine in the entrypoint script to run the nosmid/c2smid binary as needed. the docker image is at pvnovarese/mlucas_v19:latest (it's a multi-arch image, with both aaarch64 and x86_64) Dockerfile etc can be found here: https://github.com/pvnovarese/mlucas_v19 I will review your script as well, it looks like you've thought a lot more about this than I have :) |
![]() |
![]() |
![]() |
#68 | |
∂2ω=0
Sep 2002
República de California
101101010101102 Posts |
![]() Quote:
(Basically, there's just no good reason to omit the above flag anymore). Re. some kind of script to automate the self-testing using various suitable candidate -cpu arguments, that would indeed be useful. George uses the freeware hwloc library in his Prime95 code to suss out the topology of the machine running the code - I'd considered using it for my own as well in the past, but had seen a few too many threads that boiled down to "hwloc doesn't work properly on my machine" and needing some intervention re. that library by George for my taste. In any event, let me think on it more, and perhaps some playing-around with that library by those of you interested in this aspect would be a good starting point. |
|
![]() |
![]() |
![]() |
#69 | ||||
"Teal Dulcet"
Jun 2018
3·7 Posts |
![]() Quote:
There is a longstanding issue with 32-bit ARM, where the mi64.c file hangs when compiling with GCC. If you remove the -O3 optimization you get these errors: Quote:
Quote:
Quote:
Based on the "Advanced Users" and "Advanced Usage" sections of the Mlucas README, for an example 8 core/16 thread system, this is my best guess of the candidate combinations to try with the -cpu argument: Intel Code:
0 (1-threaded) 0:1 (2-threaded) 0:3 (4-threaded) 0:7 (8-threaded) 0:15 (16-threaded) 0,8 (2 threads per core, 1-threaded) (current default) 0:1,8:9 (2 threads per core, 2-threaded) 0:3,8:11 (2 threads per core, 4-threaded) AMD Code:
0 (1-threaded) 0:3:2 (2-threaded) 0:7:2 (4-threaded) 0:15:2 (8-threaded) 0:1 (2 threads per core, 1-threaded) (current default) 0:3 (2 threads per core, 2-threaded) 0:7 (2 threads per core, 4-threaded) 0:15 (2 threads per core, 8-threaded) ARM (8 core/8 thread) Code:
0 (1-threaded) 0:3 (4-threaded) (current default) 0:7 (8-threaded) Last fiddled with by tdulcet on 2021-01-14 at 13:22 |
||||
![]() |
![]() |
![]() |
#70 |
∂2ω=0
Sep 2002
República de California
2D5616 Posts |
![]()
@tdulcet: Extremely busy this past month working on a high-priority 'intermediate' v19.1 release (this will restore Clang/llvm buildability on Arm, problem was first IDed on the new Apple M1 CPU but is more general), alas no time to give the automation of best-total-throughput-finding the attention it deserves. But that's where folks like you come in. :)
First off - the mi64.c compile issue has been fixed in the as-yet-unreleased 19.1 code, as the mods in that file are small I will attach it here, suggest you save a copy of the old one so you can diff and see the changes for yourself. Briefly, a big chunk of x86_64 inline-asm needed extra wrapping inside a '#ifdef YES_ASM' preprocessor directive. That flag is def'd (or not) in mi64.h like so: Code:
#if(defined(CPU_IS_X86_64) && defined(COMPILER_TYPE_GCC) && (OS_BITS == 64)) #define YES_ASM #endif Similarly, we usually see a steep dropoff in || scaling beyond 4 cores - but that need not imply that running two 4-thread jobs is better than one 8-thread one. If said dropoff is due to the workload saturing the memory banwidth, we might well see a similar performance hit with two 4-thread jobs |
![]() |
![]() |
![]() |
#71 |
∂2ω=0
Sep 2002
República de California
2·7·829 Posts |
![]()
Addendum: OK, I think the roadmap needs to look something like this - abbreviation-wise, 'c' refers to physical cores, 't' to threadcount:
1. Based on the user's HW topology, identify a set of 'most likely to succeed' core/thread combos, like tdulcet did in his above post. For x86 this needs to take into account the different core-numbering conventions used by Intel and AMD; 2. For each combo in [1], run the automated self-tests, and save the resulting mlucas.cfg file under a unique name, e.g. for 4c/8t call it mlucas.cfg.4c.8t; 3. The various cfg-files hold the best FFT-radix combo to use at each FFT length for the given c/t combo, i.e. in terms of maximizing total throughput on the user's system we can focus on just those. So let's take a hypothetical example: Say on my 8c/16t AMD processor the round of self-tests in [1] has shown that using just 1c, 1c2t is 10% faster than 1c1t. We now need to see how 1c2t scales to all physical cores, across the various FFT lengths in the self-test. E.g. at FFT length 4096K, say the best radix combo found for 1c2t is 64,32,32,32 (note the product of those = 2048K rather than 4096K because to match general GIMPS convention "FFT length" refers to #doubles, but Mlucas uses an underlying complex-FFT, so the individual radices are complex and refer to pairs-of-doubles). So we next want to fire up 8 separate 1c2t jobs at 4096K, each using that radix combo and running on a distinct physical core, thus our 8 jobs would use -cpu flags (I used AMD for my example to avoid the comm confusion Inte;'s convention would case here) 0:1,2:3,4:5,6:7,8:9,10:11,12:13 and 14:15, respectively. I would further like to specify the foregoing radix combo via the -radset flag, but here we hit a small snag: at present, there is no way to specify an actual radix-combo. Instead one must find the target FFT length in the big case() table in get_fft_radices.c and match the desired radix-combo to a case-index. For 4096K, we see 64,32,32,32 maps to 'case 7', so we'd use -radset 7 for each of our 8 launch-at-same-time jobs. I may need to do some code-fiddling to make that less awkward. Anyhow, since we're now using just 1 radix-combo at each FFT length and we want a decent timing sample not dominated by start-up init and thread-management overhead, we might use -iters 1000 for each of our 8 jobs. Launch at more-or-less same time, they will have a range of msec/iter timings t0-t7 which we convert into total throughput in iters/sec via 1000*(1/t0+1/t1+1/t2+1/t3+1/t4+1/t5+1/t6+1/t7). Repeat for each FFT length of interest, generating a set of total throughput numbers. 4. Repeast [3] for each c/t combo in [1]. It may well prove the case that a single c/t combo does not give best total throughput across all FFT lengths, but for a first cut it seems best to somehow generate some kind of weighted average-across-all-FFT-lengths for each c/t combo and pick the best one. In [3] we generated total throughput iters/sec numbers at each FFT length, maybe multiply each by its corresponding FFT length and sum over all FFT lengths. Last fiddled with by ewmayer on 2021-01-16 at 22:24 |
![]() |
![]() |
![]() |
#72 | ||||
"Teal Dulcet"
Jun 2018
3·7 Posts |
![]() Quote:
Quote:
Quote:
Quote:
1. OK, I wrote Bash code to automatically generate the combinations from my previous post above, for the user's CPU and number of CPU cores/threads. It will generate a nice table like one of these for example: Code:
The CPU is Intel. # Workers/Runs Threads -cpu arguments 1 1 16, 2 per core 0:15 2 1 8, 1 per core 0:7 3 2 4, 1 per core 0:3 4:7 4 4 2, 1 per core 0:1 2:3 4:5 6:7 5 8 1, 1 per core 0 1 2 3 4 5 6 7 6 2 4, 2 per core 0:3,8:11 4:7,12:15 7 4 2, 2 per core 0:1,8:9 2:3,10:11 4:5,12:13 6:7,14:15 8 8 1, 2 per core 0,8 1,9 2,10 3,11 4,12 5,13 6,14 7,15 The CPU is AMD. # Workers/Runs Threads -cpu arguments 1 1 16, 2 per core 0:15 2 1 8, 1 per core 0:15:2 3 2 4, 1 per core 0:7:2 8:15:2 4 4 2, 1 per core 0:3:2 4:7:2 8:11:2 12:15:2 5 8 1, 1 per core 0 2 4 6 8 10 12 14 6 2 4, 2 per core 0:7 8:15 7 4 2, 2 per core 0:3 4:7 8:11 12:15 8 8 1, 2 per core 0:1 2:3 4:5 6:7 8:9 10:11 12:13 14:15 The CPU is ARM. # Workers/Runs Threads -cpu arguments 1 1 8 0:7 2 2 4 0:3 4:7 3 4 2 0:1 2:3 4:5 6:7 4 8 1 0 1 2 3 4 5 6 7 2. Done. 3./4. Interesting, this is going to be a lot more complex to implement then I originally thought. The switch statement in get_fft_radices.c is too big to store in my script and creating an awk command to extract the case number based on the FFT length and radix combo would obviously be extremely difficult, particularly because there are nested switch statements. I am going to have to think about how best to do this... I welcome suggestions from anyone who is reading this. In the meantime, I wrote code to directly compare the adjusted msec/iter times from the mlucas.cfg files from step #2. This of course does not account for any of the scaling issues that @ewmayer described. It will generate two tables (the fastest combination and the rest of the combinations tested) like these for my 6 core/12 thread Intel system for example: Code:
Fastest # Workers/Runs Threads First -cpu argument Adjusted msec/iter times 6 6 1, 2 per core 0,6 8.47 9.69 10.72 12.26 12.71 14.53 14.76 16.54 16.1 18.89 20.94 23.94 26.39 28.85 29.16 32.98 Mean/Average faster # Workers/Runs Threads First -cpu argument Adjusted msec/iter times 3.248 ± 0.101 (324.8%) 1 1 12, 2 per core 0:11 28.92 31.74 33.78 38.64 42.66 44.52 46.56 51.06 52.26 61.8 70.14 79.38 88.62 92.94 97.26 106.2 3.627 ± 0.146 (362.7%) 2 1 6, 1 per core 0:5 34.14 34.8 39.66 45.48 47.28 51.18 51.3 56.64 60.78 66.24 73.02 87.78 96.12 102.6 108.12 116.22 2.607 ± 0.068 (260.7%) 3 2 3, 1 per core 0:2 22.98 25.53 27.3 30.12 33.63 37.83 36.72 42.15 42.69 48.9 54.66 61.68 71.19 76.44 77.49 87.36 1.736 ± 0.029 (173.6%) 4 6 1, 1 per core 0 14.41 17.1 18.46 20.88 22.72 24.99 25.36 28.38 28.82 32.67 36.06 40.67 46.12 49.87 51.32 58.26 1.816 ± 0.047 (181.6%) 5 2 3, 2 per core 0:2,6:8 16.11 18.09 19.32 21.99 23.64 25.41 25.92 29.85 30.57 33.78 38.7 42.42 48.57 51.12 52.53 59.19 |
||||
![]() |
![]() |
![]() |
#73 |
∂2ω=0
Sep 2002
República de California
265268 Posts |
![]()
@tdulcet - How about I add support in v19.1 for the -radset flag to take either an index into the big table, or an actual set of comma-separated FFT radices? Shouldn't be difficult - if the expected -radset[whitespace]numeric arg pair is immediately followed by a comma, the code assumes it's a set of radices, reads those in, checks that said set is supported and if so runs with it.
I expect - fingers crossed, still plenty of work to do - to be able to release v19.1 around EOM, so you'd have to wait a little while, but it sounds like this is the way to go. Edit: Why make people wait - here is a modified version of Mlucas.c which supports the above-described -radset argument. You should be able to drop into your current v19 source archive, but suggest you save the old Mlucas.c under a different name - maybe add a '.bak' - so you can diff the 2 versions to see the changes, the first and most obvious of which is the version number, now bumped up to 19.1. Note user-supplied radix set is considered "advanced usage" in the sense that I assume users of it know what they are doing, though I have included a set of sanity checks on inputs. Most important is to understand the difference between the FFT length conventions between the -fftlen and -radset args: -fftlen supplies a real-vector FFT length in Kdoubles; -radset [comma-separated list of radices] specifies a corresponding set of complex-FFT radices. If the user has supplied a real-FFT length (in Kdoubles) via -fftlen, the product of the complex-FFT radices (call it 'rad_prod') must correspond to half that value, accounting for the Kdoubles scaling of the former. In C-code terms, we require that (rad_prod>>9) == fftlen . Note that event though this is strictly-speaking redundant, the -fftlen arg is required even if the user supplies an actual radix set; this is for purposes of sanity checking the latter, because the above-described differing conventions make it easy to get confused. Using any of the radix sets listed in the mlucas.cfg file along with the corresponding FFT length is of course guaranteed to be OK. Examples: After building the attached Mlucas.c file and relinking, try running the resulting binary with the following sets of command-line arguments to see what happens: -iters 100 -fftlen 1664 -radset 0 -iters 100 -fftlen 1664 -radset 208,16,16,16 -iters 100 -fftlen 1668 -radset 208,16,16,16 -iters 100 -fftlen 1664 -radset 207,16,16,16 -iters 100 -fftlen 1664 -radset 208,8,8,8,8 Last fiddled with by ewmayer on 2021-02-13 at 20:06 Reason: Deleted attachment; code now part of v19.1 |
![]() |
![]() |
![]() |
#74 | ||
"Teal Dulcet"
Jun 2018
1516 Posts |
![]() Quote:
Quote:
In my previous post on an example 8c/16t system, I said it will multiply the 4x2t msec/iter times by 1.5 before comparing them to the 8x1t times, following the instructions on the Mlucas README. After doing more testing, I was getting unexpected results with this formula ((CPU cores / workers) - 0.5), so it will now multiply the times by 2 (CPU cores / workers) for this example. This should be irrelevant once I implement step 3. I thought I should note that some systems like the Intel Xeon Phi can have more then two CPU threads per CPU core. The Mlucas README does not mention this case, but my script should correctly handle it for Intel and AMD x86 systems. For example, on a 64 core/256 thread Intel Xeon Phi system it would try these combinations (only showing the first -cpu argument for brevity): Code:
# Workers/Runs Threads -cpu arguments 1 1 64, 1 per core 0:63 2 2 32, 1 per core 0:31 3 4 16, 1 per core 0:15 4 8 8, 1 per core 0:7 5 16 4, 1 per core 0:3 6 32 2, 1 per core 0:1 7 64 1, 1 per core 0 8 1 128, 2 per core 0:63,64:127 9 2 64, 2 per core 0:31,64:95 10 4 32, 2 per core 0:15,64:79 11 8 16, 2 per core 0:7,64:71 12 16 8, 2 per core 0:3,64:67 13 32 4, 2 per core 0:1,64:65 14 64 2, 2 per core 0,64 15 1 256, 4 per core 0:63,64:127,128:191,192:255 16 2 128, 4 per core 0:31,64:95,128:159,192:223 17 4 64, 4 per core 0:15,64:79,128:143,192:207 18 8 32, 4 per core 0:7,64:71,128:135,192:199 19 16 16, 4 per core 0:3,64:67,128:131,192:195 20 32 8, 4 per core 0:1,64:65,128:129,192:193 21 64 4, 4 per core 0,64,128,192 Last fiddled with by tdulcet on 2021-01-19 at 13:52 |
||
![]() |
![]() |
![]() |
#75 |
∂2ω=0
Sep 2002
República de California
2·7·829 Posts |
![]()
@tdulcet: Glad to be of service to someone else who wants be of service, or something. :)
o Re. KNL, yes I have a barebones one sitting next to me and running a big 64M-FFT primality test, 1 thread on each of physical cores 0:63. On KNL I've never found any advantage from running this kind of code with more than 1 thread per physical core. o One of your timing sample above mentioned getting nearly 2x speedup from running 2 threads on 1 physical core, with the other cores unused. I suspect that may be the OS actually putting 1 thread on each of 2 physical cores. Remember, those pthread affinity settings are treated as *hints* to the OS, we hope that under heavy load the OS will respect them because there are no otherwise-idle physical cores it can bounce threads to. o You mentioned the mi64.c missing-x86-preprocessor-flag-wrapper was keeping you from building on your Raspberry Pi - that was even with -O3? And did you as a result just use the precompiled Arm/Linux binaries on that machine? |
![]() |
![]() |
![]() |
#76 |
Jan 2021
1 Posts |
![]()
Hello Folks - I recently got Mlucas running on a Raspberry Pi 4, 8GB of RAM, running Ubuntu and I am doing PRP checks on large primes.
I'm assuming either Mlucas is extremely fast and consistent or I'm running into some sort of a bug. If you look at a few lines of the ".stat" file for one of my recent primes, you'll see that every few seconds I blast through 10,000 iterations at exactly the same ms/iter speed and it seems to take under a day to fully PRP test a new number. Code:
[2021-01-19 21:42:45] M110899639 Iter# = 110780000 [99.89% complete] clocks = 00:15:20.953 [ 92.0954 msec/iter] Res64: 461E323B49699D73. AvgMaxErr = 0.000000000. MaxErr = 0.000000000. Residue shift count = 13555775. [2021-01-19 21:42:48] M110899639 Iter# = 110790000 [99.90% complete] clocks = 00:15:20.953 [ 92.0954 msec/iter] Res64: 461E323B49699D73. AvgMaxErr = 0.000000000. MaxErr = 0.000000000. Residue shift count = 13555775. [2021-01-19 21:42:50] M110899639 Iter# = 110800000 [99.91% complete] clocks = 00:15:20.953 [ 92.0954 msec/iter] Res64: 461E323B49699D73. AvgMaxErr = 0.000000000. MaxErr = 0.000000000. Residue shift count = 13555775. [2021-01-19 21:42:52] M110899639 Iter# = 110810000 [99.92% complete] clocks = 00:15:20.953 [ 92.0954 msec/iter] Res64: 461E323B49699D73. AvgMaxErr = 0.000000000. MaxErr = 0.000000000. Residue shift count = 13555775. [2021-01-19 21:42:55] M110899639 Iter# = 110820000 [99.93% complete] clocks = 00:15:20.953 [ 92.0954 msec/iter] Res64: 461E323B49699D73. AvgMaxErr = 0.000000000. MaxErr = 0.000000000. Residue shift count = 13555775. [2021-01-19 21:42:57] M110899639 Iter# = 110830000 [99.94% complete] clocks = 00:15:20.953 [ 92.0954 msec/iter] Res64: 461E323B49699D73. AvgMaxErr = 0.000000000. MaxErr = 0.000000000. Residue shift count = 13555775. [2021-01-19 21:42:59] M110899639 Iter# = 110840000 [99.95% complete] clocks = 00:15:20.953 [ 92.0954 msec/iter] Res64: 461E323B49699D73. AvgMaxErr = 0.000000000. MaxErr = 0.000000000. Residue shift count = 13555775. [2021-01-19 21:43:01] M110899639 Iter# = 110850000 [99.96% complete] clocks = 00:15:20.953 [ 92.0954 msec/iter] Res64: 461E323B49699D73. AvgMaxErr = 0.000000000. MaxErr = 0.000000000. Residue shift count = 13555775. [2021-01-19 21:43:04] M110899639 Iter# = 110860000 [99.96% complete] clocks = 00:15:20.953 [ 92.0954 msec/iter] Res64: 461E323B49699D73. AvgMaxErr = 0.000000000. MaxErr = 0.000000000. Residue shift count = 13555775. [2021-01-19 21:43:06] M110899639 Iter# = 110870000 [99.97% complete] clocks = 00:15:20.953 [ 92.0954 msec/iter] Res64: 461E323B49699D73. AvgMaxErr = 0.000000000. MaxErr = 0.000000000. Residue shift count = 13555775. [2021-01-19 21:43:08] M110899639 Iter# = 110880000 [99.98% complete] clocks = 00:15:20.953 [ 92.0954 msec/iter] Res64: 461E323B49699D73. AvgMaxErr = 0.000000000. MaxErr = 0.000000000. Residue shift count = 13555775. [2021-01-19 21:43:10] M110899639 Iter# = 110890000 [99.99% complete] clocks = 00:15:20.953 [ 92.0954 msec/iter] Res64: 461E323B49699D73. AvgMaxErr = 0.000000000. MaxErr = 0.000000000. Residue shift count = 13555775. [2021-01-19 21:43:13] M110899639 Iter# = 110899639 [100.00% complete] clocks = 00:15:20.953 [ 95.5445 msec/iter] Res64: 461E323B49699D73. AvgMaxErr = 0.000000000. MaxErr = 0.000000000. Residue shift count = 13555775. M110899639 is not prime. Res64: 243C3E785D7D8345. Program: E19.0. Final residue shift count = 13555775 M110899639 mod 2^35 - 1 = 20387533375 M110899639 mod 2^36 - 1 = 12983321457 I also run Prime95 on a seemingly much more powerful Core i7-7700 and that is taking about 14 days to PRP-test a single number, which is what is making me question this. I'm glad to provide more detail if it would help troubleshoot. |
![]() |
![]() |
![]() |
#77 |
Romulan Interpreter
Jun 2011
Thailand
100100001100012 Posts |
![]() |
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Mlucas v18 available | ewmayer | Mlucas | 48 | 2019-11-28 02:53 |
Mlucas version 17 | ewmayer | Mlucas | 3 | 2017-06-17 11:18 |
MLucas on IBM Mainframe | Lorenzo | Mlucas | 52 | 2016-03-13 08:45 |
Mlucas on Sparc - | Unregistered | Mlucas | 0 | 2009-10-27 20:35 |
mlucas on sun | delta_t | Mlucas | 14 | 2007-10-04 05:45 |