mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software > Mlucas

Reply
 
Thread Tools
Old 2021-01-20, 14:30   #78
tdulcet
 
tdulcet's Avatar
 
"Teal Dulcet"
Jun 2018

1516 Posts
Default

Quote:
Originally Posted by ewmayer View Post
o Re. KNL, yes I have a barebones one sitting next to me and running a big 64M-FFT primality test, 1 thread on each of physical cores 0:63. On KNL I've never found any advantage from running this kind of code with more than 1 thread per physical core.
Nice. I used to have access to that system in college. MPrime defaulted to configuration #5 on it. I never tried running Mlucas, but I would have thought configuration 19 or 21 would provide the best performance.

Quote:
Originally Posted by ewmayer View Post
o One of your timing sample above mentioned getting nearly 2x speedup from running 2 threads on 1 physical core, with the other cores unused. I suspect that may be the OS actually putting 1 thread on each of 2 physical cores.
OK, interesting. I am guessing that this will be fixed after I finish implementing your step # 3 and 4.

Quote:
Originally Posted by ewmayer View Post
o You mentioned the mi64.c missing-x86-preprocessor-flag-wrapper was keeping you from building on your Raspberry Pi - that was even with -O3?
With -O3 optimization (and -O2 and -O1), GCC would just run forever at 100% CPU usage, I am guessing because of some bug in GCC. Without optimization, GCC would immediately output those errors I posted after the usual warnings. It happened with both Mlucas v18 and v19.

Quote:
Originally Posted by ewmayer View Post
And did you as a result just use the precompiled Arm/Linux binaries on that machine?
No, because I wanted to use my script to automatically setup everything. I also suspect that compiling Mlucas directly on the Pi and with my script will provide better performance, since GCC by default on the Raspberry Pi adds a bunch of compile flags:
Code:
pi@raspberrypi:~ $ gcc -march=native -Q --help=target | grep -iv disabled
The following options are target specific:
  -mabi=                                aapcs-linux
  -march=                               armv8-a+crc+simd
  -marm                                 [enabled]
  -mbe32                                [enabled]
  -mbranch-cost=                        -1
  -mcpu=
  -mfloat-abi=                          hard
  -mfp16-format=                        none
  -mfpu=                                vfp
  -mglibc                               [enabled]
  -mhard-float
  -mlittle-endian                       [enabled]
  -mpic-data-is-text-relative           [enabled]
  -mpic-register=
  -msched-prolog                        [enabled]
  -msoft-float
  -mstructure-size-boundary=            8
  -mtls-dialect=                        gnu
  -mtp=                                 cp15
  -mtune=
  -munaligned-access                    [enabled]
  -mvectorize-with-neon-quad            [enabled]
a few of which should improve the resulting performance.

@ewmayer Regarding step # 3, I have a quick question. What is the correct way to get the needed msec/iter times from each job? The Mlucas output has a Clocks = line, so should I parse that, convert it to milliseconds and then divide by 1000 (the number of iterations)?

Last fiddled with by tdulcet on 2021-01-20 at 14:36
tdulcet is offline   Reply With Quote
Old 2021-01-20, 19:56   #79
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

265268 Posts
Default

Quote:
Originally Posted by tdulcet View Post
I wanted to use my script to automatically setup everything. I also suspect that compiling Mlucas directly on the Pi and with my script will provide better performance, since GCC by default on the Raspberry Pi adds a bunch of compile flags:
[snip]
a few of which should improve the resulting performance.
For SIMD builds the runtime is dominated by the asm-macro instructions, so there might in fact be little or no difference. But in any case, with the patched mi64.c file I posted earlier in this thread (which will be part of the soon-to-come v19.1 release) you can now directly compare performance of the prebuilt binary and your own.

OTOH there is clearly some benefit to be had from improved optimization of the C "glue" code and the integration of the asm-macros into that ... I still have a few problematic-for-Clang asm-macros to convert so that compiler will build them on Armv8, but I also installed Clang on my main Ubuntu Linux box, a quad-core Haswell mostly used for builds and hosting a couple GPUs, and built v19 using it there a week ago - the result looks to run 5-10% faster than my GCC build of the same source base. We hope for a similar speedup from build of v19.1 on Arm.

Quote:
Regarding step # 3, I have a quick question. What is the correct way to get the needed msec/iter times from each job? The Mlucas output has a Clocks = line, so should I parse that, convert it to milliseconds and then divide by 1000 (the number of iterations)?
Yep!

Last fiddled with by ewmayer on 2021-01-20 at 20:01
ewmayer is offline   Reply With Quote
Old 2021-01-20, 20:21   #80
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

265268 Posts
Default

Hi, Joniano -

Quote:
Originally Posted by joniano View Post
Hello Folks - I recently got Mlucas running on a Raspberry Pi 4, 8GB of RAM, running Ubuntu and I am doing PRP checks on large primes.

I'm assuming either Mlucas is extremely fast and consistent or I'm running into some sort of a bug.

If you look at a few lines of the ".stat" file for one of my recent primes, you'll see that every few seconds I blast through 10,000 iterations at exactly the same ms/iter speed and it seems to take under a day to fully PRP test a new number.
Yep, that's a weird one - the timings in the "clocks = ... [msec/iter]" part of each 10Kiter checkpoint-status line look perfectly reasonably for a Pi4, around 15 minutes per interval. But those lines are getting written every 2-3 *seconds*, not every 15 minutes, and as LaurV noted, the Res64 value is frozen.

Looking deeper: the time for each interval is exactly the same, 15:20.953 [ 92.0954 msec/iter] - in normal runs that never happens. And your floating-point errors are 0, also not something you'll ever seen with a good build and normally functioning hardware.

So some questions for you:

o Did you use the precompiled v19 binary for Arm SIMD, or build v19 yourself?

o If you built a binary, which compiler are you using, and did you use the recommended build flags on the README webpage?

o Did your post-build self-tests appear to run at "normal" speed? (If you're not sure, just try a quick sample using an FFT length suitable for your exponent:

./[name of your binary] -m 110899639 -iters 100 -fftlen 6144 -radset 1 -cpu 0:3

That should take on the order of 10 seconds on your hardware, and produce Res64: A4F77554A5DC940F.

Also, please post a copy of the mlucas.cfg file resulting from your post-build self-tests, and a copy of the first 10 lines of your p110899639.stat file . Thanks!
ewmayer is offline   Reply With Quote
Old 2021-01-22, 13:57   #81
tdulcet
 
tdulcet's Avatar
 
"Teal Dulcet"
Jun 2018

3·7 Posts
Default

Quote:
Originally Posted by ewmayer View Post
I also installed Clang on my main Ubuntu Linux box, a quad-core Haswell mostly used for builds and hosting a couple GPUs, and built v19 using it there a week ago - the result looks to run 5-10% faster than my GCC build of the same source base. We hope for a similar speedup from build of v19.1 on Arm.
Wow, that is an impressive speedup. I will also update my install script to support Clang after v19.1 is released.

I finished implementing steps 3 and 4 from post #71, although I get a few errors. For example on my 4 core/8 thread Intel system, there is this line in one of the mlucas.cfg files:
Code:
...
3840  msec/iter =   20.20  ROE[avg,max] = [0.194384766, 0.218750000]  radices =  60 32 32 32  0  0  0  0  0  0
...
However, if I try to run ./Mlucas -fftlen 3840 -iters 1000 -radset 60,32,32,32 -cpu "0:1", I get this error:
Quote:
$ ./Mlucas -fftlen 3840 -iters 1000 -radset 60,32,32,32 -cpu "0:1"

Mlucas 19.1

http://www.mersenneforum.org/mayer/README.html

INFO: testing qfloat routines...
CPU Family = x86_64, OS = Linux, 64-bit Version, compiled with Gnu C [or other compatible], Version 9.3.0.
INFO: Build uses AVX2 instruction set.
INFO: Using inline-macro form of MUL_LOHI64.
INFO: Using FMADD-based 100-bit modmul routines for factoring.
INFO: MLUCAS_PATH is set to ""
INFO: using 64-bit-significand form of floating-double rounding constant for scalar-mode DNINT emulation.
Setting DAT_BITS = 10, PAD_BITS = 2
INFO: testing IMUL routines...
INFO: System has 8 available processor cores.
INFO: testing FFT radix tables...
Set affinity for the following 2 cores: 0.1.
ERROR: radix set index 5 for FFT length 3840 K exceeds maximum allowable of 4.
ERROR: at line 3897 of file ../src/Mlucas.c
Assertion failed: ERROR: radix set index 5 for FFT length 3840 K exceeds maximum allowable of 4.
This also happens with these other FFT length and radix combos on the system:

./Mlucas -fftlen 4096 -iters 1000 -radset 16,16,16,16,32 -cpu "0"
./Mlucas -fftlen 2816 -iters 1000 -radset 44,8,16,16,16 -cpu "0,4"

It seems to be some kind of off by one error, since the radix set index is always one greater than the maximum.
tdulcet is offline   Reply With Quote
Old 2021-01-22, 20:45   #82
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2·7·829 Posts
Default

@tdulcet - yes, off-by-one indexing error is precisely what it is - good catch. At line 3595 of the v19.1 Mlucas.c I posted above, that "radset = i" needs to be changed to "radset = i-1" to undo the last post-increment of i in the enclosing while() loop's call to get_fft_radices(). You can do that yourself, or grab the updated attachment to my Post #73.
ewmayer is offline   Reply With Quote
Old 2021-01-23, 15:14   #83
tdulcet
 
tdulcet's Avatar
 
"Teal Dulcet"
Jun 2018

3×7 Posts
Post 🆕 New Install Script for Linux

Quote:
Originally Posted by ewmayer View Post
@tdulcet - yes, off-by-one indexing error is precisely what it is - good catch.
Great, thanks for fixing it so quickly!

I attached the new version of my install script for Linux, which implements @ewmayer's steps 1-4 from post #71 and I will push it to GitHub after v19.1 is released. This attached version of the script is for testing and will automatically download, build and partially setup Mlucas as described in post #71. It will also download the required new v19.1 Mlucas.c file from post #73. The command line arguments are not used, so users can just run it with bash mlucas.sh.

To completely setup and run Mlucas for production with the PrimeNet Python script, remove the exit command on line 354. In this case, users will need to provide the command line arguments if the defaults are incorrect.

As with the previous version, it will generate two tables (the fastest combination and the rest of the combinations tested) like these for my 4 core/8 thread Intel system for example:
Code:
Fastest combination
#  Workers/Runs  Threads        First -cpu argument
1  1             4, 1 per core  0:3

Mean/Average faster     #  Workers/Runs  Threads        First -cpu argument
1.020 ± 0.103 (102.0%)  2  2             2, 1 per core  0:1
1.092 ± 0.263 (109.2%)  3  4             1, 1 per core  0
1.058 ± 0.075 (105.8%)  4  1             8, 2 per core  0:3,4:7
1.043 ± 0.067 (104.3%)  5  2             4, 2 per core  0:1,4:5
1.084 ± 0.168 (108.4%)  6  4             2, 2 per core  0,4
The two tables show that 4-threaded with 1 thread per core is ~1.06 times faster then 8-threaded with 2 threads per core for example.

On many the systems I have tested it on so far, I actually get significantly different results than the previous version that directly compared the adjusted msec/iter times from the mlucas.cfg files. I would be interested to hear whether other people get the results they were expecting. Feedback is also welcome.

Last fiddled with by ewmayer on 2021-02-13 at 20:07 Reason: Deleted attachment at poster's request
tdulcet is offline   Reply With Quote
Old 2021-01-24, 14:41   #84
tdulcet
 
tdulcet's Avatar
 
"Teal Dulcet"
Jun 2018

258 Posts
minus Self-test issues

These are probably known issues, but I thought I should note that the ./Mlucas -s a, ./Mlucas -s h and ./Mlucas -s t self-test options do not work. I do not have a personal interest in these options, I was just trying to test my install script against more than the default ./Mlucas -s m FFT lengths to verify that it properly scales.

For both ./Mlucas -s a and ./Mlucas -s t, I get this error before it immediately exits:
Quote:
ERROR: at line 85 of file ../src/radix8_ditN_cy_dif1.c
Assertion failed: CY routines with radix < 16 do not support shifted residues!
Before that error, I also get two of these warning:
Quote:
WARN: At line 327 of file ../src/mers_mod_square.c:
n/radix0 must be >= 1024! Skipping this radix combo.
For ./Mlucas -s h, I get errors like this for every FFT length and radix combo tested and it never adds anything to the mlucas.cfg file:
Quote:
Res mod 2^35 - 1 = 7270151463
Res mod 2^36 - 1 = 68679090081
*** Res35m1 Error ***
current = 7270151463
should be = 29128713056
*** Res36m1 Error ***
current = 68679090081
should be = 7270151463
--
Return with code ERR_INCORRECT_RES64
Error detected - this radix set will not be used.
WARNING: 0 of 10 radix-sets at FFT length 65536 K passed - skipping it. PLEASE CHECK YOUR BUILD OPTIONS.
Also, for ./Mlucas -s s, the 1792K FFT length does not work:
Quote:
Res mod 2^35 - 1 = 679541321
Res mod 2^36 - 1 = 62692450676
*** Res64 Error ***
current = 9796448591002464256
should be = 11513515421623922688
*** Res35m1 Error ***
current = 679541321
should be = 1603847275
*** Res36m1 Error ***
current = 62692450676
should be = 51947401644
--
Return with code ERR_INCORRECT_RES64
Error detected - this radix set will not be used.
WARNING: 0 of 5 radix-sets at FFT length 1792 K passed - skipping it. PLEASE CHECK YOUR BUILD OPTIONS.
I was able to reproduce these issues on multiple systems, although the above quotes were from a 4 core/8 thread Intel system, the same as with my above post. Here are the specific details copied from the top of my install script's output:
Code:
Linux Distribution:             Ubuntu 20.04.1 LTS
Linux Kernel:                   5.4.0-58-generic
Computer Model:                 Dell Inc. Precision T1700 01
Processor (CPU):                Intel(R) Xeon(R) CPU E3-1241 v3 @ 3.50GHz
CPU Cores/Threads:              4/8
Architecture:                   x86_64 (64-bit)
Total memory (RAM):             15,954 MiB (16GiB) (16,729 MB (17GB))
Total swap space:               3,903 MiB (3.9GiB) (4,093 MB (4.1GB))
I attached all four of the respective output files. For reference, the above quotes were found with the grep -i -B 2 -A 2 'error\|warn\|assert' <file> command on the attached output files.
Attached Files
File Type: zip self-tests.zip (18.5 KB, 20 views)

Last fiddled with by tdulcet on 2021-01-24 at 14:45
tdulcet is offline   Reply With Quote
Old 2021-01-24, 19:55   #85
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2D5616 Posts
Default

Thanks for the list, T - while few or no users will be interested in those options, the issues need to get addressed on the way to the 19.1 release.

Edit:

OK, worked through the issues you listed above:

1. ERROR: at line 85 of file ../src/radix8_ditN_cy_dif1.c
Assertion failed: CY routines with radix < 16 do not support shifted residues!


In fact, no leading radices between 8 and 15 aside from 12 support shift - in all those I changed the assertion to return(ERR_ASSERT), which is a way of telling the self-test control logic "skip this radix set and continue."

2. WARN: At line 327 of file ../src/mers_mod_square.c:
n/radix0 must be >= 1024! Skipping this radix combo.


That is expected - certain speed-related data-structures introduced a few years ago come at the price of this limitation, which mainly affects small FFT lengths.

3. 1792K self-test residue error message: Somehow this one ended up with the wrong set of reference residues.

All the above fixes will be in the soon-to-come 19.1 release.

Last fiddled with by ewmayer on 2021-01-25 at 04:08
ewmayer is offline   Reply With Quote
Old 2021-01-29, 22:39   #86
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2×7×829 Posts
Default

This is more of an extended "code fun" diary entry, but thought it might be of interest to other close-to-the-metal coders who hang around here:

v19.1 shakedown testing is ongoing - Laurent D. was able to successfully build using Clang/LLVM on Apple M1, comparison of timings between that build and a GCC/Brew build of the same source base showed the Clang and GCC builds to be more or less identical in terms of speed. OTOH Clang builds on my Android phone with a quad Armv8+SIMD CPU are consistently 5-10% *slower* than GCC builds on same - go figure. On my old Haswell quad the Clang build seems a bit faster, but that machine has a lot of run-to-run timing variability, so I'm still working on how to best get consistent timing data.

One thing the testing has reminded of: when testing significant amounts of new assembly, always test on the oldest CPU family supporting the particular instruction set architecture (ISA) targeted by the asm. This is especially true for Intel SIMD, because Intel is notorious for releasing first-cut ISAs with glaringly obvious missing key functionality, then patching that with later add-ons. SSE2 is perhaps the best (as in worst) example of this, there the add-ons went all the way up to SSE4.2. My old Macbook on which I'm typing this only has SSE2, SSE3 and SSE3e, so e.g. my Mlucas SSE2 carry routines don't use the ROUNDPD instruction, which was only belatedly added in SSE4.1. When Intel released 256-bit AVX, that only had full 256-bit vector instructions for floating data types, not integer - the latter were only added with AVX2.

Getting back to the above bolded point - the reduced-length asm-macro arglist constraint imposed by Clang for Arm builds which are the focus of the v19.1 release mean a lot of what were once sets of I/O addresses for various short-length discrete Fourier transform (DFTs) in the macro arglist now get replaced by pairs of [base-address, start-of-offset-index-vector] pointers, from which the needed multiple I/O addresses are computed inside the asm-macro. Those pointer-pairs - one for inputs, one for outputs - each need 2 general-purpose registers (GPRs) to store. Not a problem for Armv8 since it comes with a generous 32 GPRs. But we want identical macro interfaces across architectures, so the same kinds of code changes go into the Intel x86_64 versions of each macro, and x86_64 only gives us 14 usable GPRs (rax,rbx,rcx,rdx,rsi,rdi,r8-15; rsp and rbp are reserved for OS and compiler use). To mitigate the resulting not-enough-GPRs problem, I resorted in a couple places to what I thought was a nifty hack: when needed, copy one or more selected GPR contents into the otherwise-unused legacy MMX registers, which are the same 64-bit width and of which there 8, rather than spilling to and later reading back from memory. Note that the spill issue in my case didn't affect the SSE2 version of the macro, as that uses a less register-dense way of structuring the DFT algorithm in question, since we're not targeting Intel FMA3 instructions with their 2-per-cycle throughput and 4-5-cycle latency.

For my AVX, AVX2 and AVX-512 builds of the new code I was using an Intel NUC8I3CYSM mini with a i3-8121U-Processor for build & test. Convenient, because once I do the first-cut proof-of-principle recode of a given DFT macro on my Mac, I port to the AVX, AVX2 and AVX-512 versions of same, and can test all 3 of those on the same machine. For the code involving the spill-to-MMX trick, all the timings looked good. Yesterday I figured, better also build and test on the old Haswell quad just to see how things look there - uh-oh, there spill-to-MMX is a complete disaster, performance-wise. So spill-to-memory it is.

Second example of same lesson: AVX512 is probably the best-designed ISA in first-release form Intel has done - when I was first porting my Mlucas asm macros to it in 2H2017 using the GIMPS KNL for builds, the AVX512F (F =foundation) instructions gave me more or less everything I needed, just a few small "this particular piece of code only needs a 256-bit vector width, but AVX512F only supports full-width ZMM operands, not YMM, so we use just the low 256 bits of an ZMM for our data" instances. Integer support in AVX512F not quite so good, especially for wide-multiply, but that's not an issue for FFT code. So after hitting the above don't-use-this-instruction-on-Haswell issue, I also built the new code in AVX-512 mode on the barebones 68-core KNL I bought late last year. That crashed with a SIGILL, illegal instruction exception, very early in build test-test. I'd been careful to avoid later-release-than-AVX512F instructions in those versions of the recoded macros, did I maybe miss something? Nope - gdb revealed the problem was GCC-generated code for a simple section of C code consisting of nothing more than some simple pointer-arithmetic adds - here are the relevant snips of C code and the roughly corresponding disassembly, with the gdb-added ==> indicating the offender:
Code:
		nisrt2	= tmp + 0x00;	// For the +- isrt2 pair put the - datum first, thus cc0 satisfies
		 isrt2	= tmp + 0x01;	// the same "cc-1 gets you isrt2" property as do the other +-[cc,ss] pairs.
	...
   0x0000000000723126 <+1718>:	vmovdqa %xmm5,0x690(%rsp)
   0x000000000072312f <+1727>:	vmovq  %rsi,%xmm5
   0x0000000000723134 <+1732>:	lea    0x4540(%rbp),%rsi
   0x000000000072313b <+1739>:	vpinsrq $0x1,%rcx,%xmm5,%xmm13
   0x0000000000723141 <+1745>:	vmovq  %rsi,%xmm5
   0x0000000000723146 <+1750>:	lea    0x4580(%rbp),%rsi
=> 0x000000000072314d <+1757>:	vmovdqa64 %xmm16,%xmm6
   0x0000000000723153 <+1763>:	vpinsrq $0x1,%rsi,%xmm5,%xmm5
Note that there are lots of vmovdqa instructions in the complete disassembly, but the arrowed one is the only 64-suffixed one. This family of instructions is listed as "Move Aligned Packed Integer Values", comes in a bunch of different flavors depending on the precise ISA on's CPU uses - thanks, Intel - and ta da! Only the 512-bit ZMM-operand version is available in AVX512F; the XMM,YMM forms need AVX512VL, and GCC resorted to the XMM form above. Here is the list of 512-bit instruction subsets supported by the KNL, note no 'vl'-suffix in there:

avx512f avx512pf avx512er avx512cd

Here is the analogous list for my NUC, which explains why no problems occurred for that build:

avx512f avx512cd avx512dq avx512ifma avx512bw avx512vl avx512vbmi

On the KNL I specified '-O3 -march=knl' for my build (versus '-O3 -march=skylake-avx512' on the NUC), so it's clearly a GCC bug - but in sympathy, I'm guessing the same scattershot-ISA-release fun makes for a real headache for compiler writers. Anyway, what to do? This is not a bug in my assembly, it's simple gcc-compiled C code. Thankfully, the workaround proved to be to compiled just this 1 source file with -O2, the rest with usual -O3, and fortunately this was the only source file needing such compile-flag hackery.

Last fiddled with by ewmayer on 2021-01-30 at 22:29 Reason: Restored accidentally-deleted section in between "notorious for releasing first-cut ISAs..." and "...(DFTs) in the macro arglist"
ewmayer is offline   Reply With Quote
Old 2021-01-30, 14:34   #87
tdulcet
 
tdulcet's Avatar
 
"Teal Dulcet"
Jun 2018

3·7 Posts
Default

Quote:
Originally Posted by ewmayer View Post
Thanks for the list, T - while few or no users will be interested in those options, the issues need to get addressed on the way to the 19.1 release.

Edit:

...

All the above fixes will be in the soon-to-come 19.1 release.
No problem. There was also an issue with the ./Mlucas -s h command, where none of the FFT lengths (65536K - 196608K) and radix combos tested worked (see post #84). Sorry if you already fixed this, but it was not mentioned in your post.

Quote:
Originally Posted by ewmayer View Post
I'm still working on how to best get consistent timing data.
I have been using my Benchmarking Tool to verify the results of the install script, but I think it would also work for this. For example, if you had two binaries, gcc_Mlucas and clang_Mlucas compiled with the respective compilers, a command like this would run each 10 times by default and compute the mean, median and standard deviation of the runtimes among other info (Bash syntax):

./time.sh ./{gcc,clang}'_Mlucas -fftlen 6144 -iters 1000 -radset 48,16,16,16,16 -cpu 0,4'

Quote:
Originally Posted by ewmayer View Post
On the KNL I specified '-O3 -march=knl' for my build (versus '-O3 -march=skylake-avx512' on the NUC), so it's clearly a GCC bug - but in sympathy, I'm guessing the same scattershot-ISA-release fun makes for a real headache for compiler writers. Anyway, what to do? This is not a bug in my assembly, it's simple gcc-compiled C code.
I had a similar issues trying to correctly automatically set the -march= flag on AVX512 systems in early versions of my install script. The script is automatically tested after every commit with Travis CI, which uses Google Cloud for their x86 VMs. After Google Cloud started providing AVX512 systems, the script would sometimes fail. I tried a bunch of different solutions before I figured out I could set the -march= flag to native and the compiler would automatically set it and a few other flags to the correct value for the current system, reducing the complexity of the script. This is what the script does now for x86 systems and I have not had any issues on Travis CI or any of the other systems I have tested it on since. I am not sure if this would also workaround your GCC bug...
tdulcet is offline   Reply With Quote
Old 2021-01-31, 23:52   #88
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2×7×829 Posts
Default

Quote:
Originally Posted by tdulcet View Post
No problem. There was also an issue with the ./Mlucas -s h command, where none of the FFT lengths (65536K - 196608K) and radix combos tested worked (see post #84). Sorry if you already fixed this, but it was not mentioned in your post.
Ah, forgot to note - those incorrect reference residues for the huge-FFT self-test are low-priority, I've added them to my v20 to-do list.

Quote:
I have been using my Benchmarking Tool to verify the results of the install script, but I think it would also work for this. For example, if you had two binaries, gcc_Mlucas and clang_Mlucas compiled with the respective compilers, a command like this would run each 10 times by default and compute the mean, median and standard deviation of the runtimes among other info (Bash syntax):

./time.sh ./{gcc,clang}'_Mlucas -fftlen 6144 -iters 1000 -radset 48,16,16,16,16 -cpu 0,4'
I plan to release 19.1 in next few days, first need to play with your recently-enhanced build&tune script so I can add suitable text about that to the README page. More feedback soon.

Quote:
I had a similar issues trying to correctly automatically set the -march= flag on AVX512 systems in early versions of my install script. The script is automatically tested after every commit with Travis CI, which uses Google Cloud for their x86 VMs. After Google Cloud started providing AVX512 systems, the script would sometimes fail. I tried a bunch of different solutions before I figured out I could set the -march= flag to native and the compiler would automatically set it and a few other flags to the correct value for the current system, reducing the complexity of the script. This is what the script does now for x86 systems and I have not had any issues on Travis CI or any of the other systems I have tested it on since. I am not sure if this would also workaround your GCC bug...
-march=native is a good suggestion, alas it did not cure the illegal-instruction issue with that one .c file in my KNL build. However, it should allow me to simplify the manual-build instructions on the README page, for the same reason you note above. This is the first such GCC bug I've hit in my KNL builds of various Mlucas releases, so since very few people have a KNL and even fewer of them run Mlucas on it, one hopes this sort of issue continue to be a rare glitch over coming GCC releases.
ewmayer is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Mlucas v18 available ewmayer Mlucas 48 2019-11-28 02:53
Mlucas version 17 ewmayer Mlucas 3 2017-06-17 11:18
MLucas on IBM Mainframe Lorenzo Mlucas 52 2016-03-13 08:45
Mlucas on Sparc - Unregistered Mlucas 0 2009-10-27 20:35
mlucas on sun delta_t Mlucas 14 2007-10-04 05:45

All times are UTC. The time now is 01:22.

Sat Feb 27 01:22:18 UTC 2021 up 85 days, 21:33, 1 user, load averages: 2.29, 2.39, 2.61

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.