![]() |
Mlucas version 17.1
The most recent official release is always available at
[url]http://www.mersenneforum.org/mayer/README.html[/url] I'll announce major updates/bugfixes/new-prebuilt-binaries on this thread. ======================= [b][i]06 November 2009: An Alpha version of Mlucas 3.0 is available at the above page[/i][/b] Major new features: - SSE2 support for Win32 and 32-and-64-bit Linux. Thanks to my late-in-life conversion to assembly coding I'm a few years behind George on this, but I think it's not too shabby for a first go. It's a bit slower than Prime95 cycle-for-cycle, but I'd appreciate if some folks would be willing to give up a bit of throughput in order to help test the software. Suggestions for speedups from the ASM experts are especially welcome. - Platform-independent savefile support. - Coning soon: Trial-factoring support. - Coming soon: Primenet support. - Coming later: Multithreading support for SSE2 code. (This is more important for new-prime verify than for GIMPS users). - Coming later: QT-based GUI. Let me know if you have any download/test/build issues, -Ernst |
[crickets chirping]
|
Be happy to test. Note that an edit of an existing post is not caught by the "new post" mechanism. Send me a PM.
|
[QUOTE=garo;195277]Be happy to test. Note that an edit of an existing post is not caught by the "new post" mechanism. Send me a PM.[/QUOTE]
Yeah, I realized this morning that although I'd updated the thread, it needed an actual new post to advertise the fact. All the code and build/run instructions are at the above link, so let me know how it goes, if you think the readme page could be clearer about anything, etc. BTW, I've been running the new code (even as I continued to expand and improve the SSE2 support) more or less continuously for the past 18 months, first on my Win32 box, then (after its fan died and I bought a macbook for my 64-bit linux port work) on my 6-month-old macbook, so I have full confidence in the stability and functional correctness of the LL-test core. At this point the coming year will be all about adding primenet support, speeding the trial-factoring capability (also already thoroughly tested) enough to make it releaseworthy, and (hopefully) squeezing some extra speed out of the inline assembler by way of detailed profiling and playing with stuff like prefetch, TLB priming, etc. |
[QUOTE=ewmayer;195281]TLB priming.[/QUOTE]
TLB priming is only necessary on early versions of the Pentium 4. |
Yay, it is finally publicly available. I will take a look at this in a couple of weeks after things settle down a bit here.
|
Here is the result of -s m/l on my i7 920 (stock speed) running x86_64; the compiler is gcc 4.4.1.
[code] 1024 sec/iter = 0.028 ROE[min,max] = [0.250000000, 0.312500000] radices = 32 16 32 32 0 0 0 0 0 0 1152 sec/iter = 0.033 ROE[min,max] = [0.250000000, 0.250000000] radices = 36 32 32 16 0 0 0 0 0 0 1280 sec/iter = 0.037 ROE[min,max] = [0.250000000, 0.343750000] radices = 20 32 32 32 0 0 0 0 0 0 1408 sec/iter = 0.042 ROE[min,max] = [0.312500000, 0.312500000] radices = 44 16 32 32 0 0 0 0 0 0 1536 sec/iter = 0.045 ROE[min,max] = [0.265625000, 0.269042969] radices = 24 32 32 32 0 0 0 0 0 0 1792 sec/iter = 0.055 ROE[min,max] = [0.312500000, 0.312500000] radices = 28 32 32 32 0 0 0 0 0 0 2048 sec/iter = 0.061 ROE[min,max] = [0.281250000, 0.343750000] radices = 16 16 16 16 16 0 0 0 0 0 2304 sec/iter = 0.072 ROE[min,max] = [0.242187500, 0.281250000] radices = 36 32 32 32 0 0 0 0 0 0 2560 sec/iter = 0.078 ROE[min,max] = [0.281250000, 0.312500000] radices = 20 16 16 16 16 0 0 0 0 0 2816 sec/iter = 0.093 ROE[min,max] = [0.328125000, 0.343750000] radices = 44 32 32 32 0 0 0 0 0 0 3072 sec/iter = 0.098 ROE[min,max] = [0.250000000, 0.250000000] radices = 24 16 16 16 16 0 0 0 0 0 3584 sec/iter = 0.114 ROE[min,max] = [0.281250000, 0.281250000] radices = 28 16 16 16 16 0 0 0 0 0 4096 sec/iter = 0.122 ROE[min,max] = [0.250000000, 0.312500000] radices = 16 16 16 16 32 0 0 0 0 0 4608 sec/iter = 0.147 ROE[min,max] = [0.257812500, 0.257812500] radices = 36 16 16 16 16 0 0 0 0 0 5120 sec/iter = 0.157 ROE[min,max] = [0.281250000, 0.312500000] radices = 20 16 16 16 32 0 0 0 0 0 5632 sec/iter = 0.191 ROE[min,max] = [0.375000000, 0.375000000] radices = 44 16 16 16 16 0 0 0 0 0 6144 sec/iter = 0.198 ROE[min,max] = [0.250000000, 0.296875000] radices = 24 16 16 16 32 0 0 0 0 0 7168 sec/iter = 0.232 ROE[min,max] = [0.268554688, 0.281250000] radices = 28 16 16 16 32 0 0 0 0 0 8192 sec/iter = 0.253 ROE[min,max] = [0.281250000, 0.312500000] radices = 16 16 16 32 32 0 0 0 0 0[/code]EDIT : add results of -s l. |
[QUOTE=ldesnogu;195316]Here is the result of -s m/l on my i7 920 (stock speed) running x86_64; the compiler is gcc 4.4.1.[/QUOTE]
Thanks, Laurent - Interesting that FFT lengths of the form 11*2^k are actually (modestly) useful on your 920 ... on both my Core2-based machines (WinXP/32-bit/MSVC and MacOS/64-bit/GCC-4.2) those are slower than the next-larger FFT length, often by quite a lot - you can see this in the sample timing tables on my README page. Your timings are much closer to what I would expect based on arithmetic opcount -- since data access patterns are similar and memory footprints also, I expected opcount would be the major timing across a variety of platforms. (It is, except for the "surprise" I got with the 11*2^k data). |
At first I thought it could be some compiler issue but running your executable (compiled with gcc 4.2.1) gives very similar results:
[code] 1024 sec/iter = 0.028 ROE[min,max] = [0.250000000, 0.312500000] radices = 32 16 32 32 0 0 0 0 0 0 1152 sec/iter = 0.034 ROE[min,max] = [0.250000000, 0.250000000] radices = 36 32 32 16 0 0 0 0 0 0 1280 sec/iter = 0.037 ROE[min,max] = [0.250000000, 0.343750000] radices = 20 32 32 32 0 0 0 0 0 0 1408 sec/iter = 0.042 ROE[min,max] = [0.312500000, 0.312500000] radices = 44 16 32 32 0 0 0 0 0 0 1536 sec/iter = 0.045 ROE[min,max] = [0.265625000, 0.269042969] radices = 24 32 32 32 0 0 0 0 0 0 1792 sec/iter = 0.056 ROE[min,max] = [0.312500000, 0.312500000] radices = 28 32 32 32 0 0 0 0 0 0 2048 sec/iter = 0.060 ROE[min,max] = [0.281250000, 0.343750000] radices = 16 16 16 16 16 0 0 0 0 0 2304 sec/iter = 0.073 ROE[min,max] = [0.242187500, 0.281250000] radices = 36 32 32 32 0 0 0 0 0 0 2560 sec/iter = 0.077 ROE[min,max] = [0.281250000, 0.312500000] radices = 20 16 16 16 16 0 0 0 0 0 2816 sec/iter = 0.094 ROE[min,max] = [0.328125000, 0.343750000] radices = 44 32 32 32 0 0 0 0 0 0 3072 sec/iter = 0.097 ROE[min,max] = [0.250000000, 0.250000000] radices = 24 16 16 16 16 0 0 0 0 0 3584 sec/iter = 0.114 ROE[min,max] = [0.281250000, 0.281250000] radices = 28 16 16 16 16 0 0 0 0 0 4096 sec/iter = 0.122 ROE[min,max] = [0.250000000, 0.312500000] radices = 16 16 16 16 32 0 0 0 0 0 4608 sec/iter = 0.147 ROE[min,max] = [0.257812500, 0.257812500] radices = 36 16 16 16 16 0 0 0 0 0 5120 sec/iter = 0.156 ROE[min,max] = [0.281250000, 0.312500000] radices = 20 16 16 16 32 0 0 0 0 0 5632 sec/iter = 0.193 ROE[min,max] = [0.375000000, 0.375000000] radices = 44 16 16 16 16 0 0 0 0 0 6144 sec/iter = 0.196 ROE[min,max] = [0.250000000, 0.296875000] radices = 24 16 16 16 32 0 0 0 0 0 7168 sec/iter = 0.231 ROE[min,max] = [0.268554688, 0.281250000] radices = 28 16 16 16 32 0 0 0 0 0 8192 sec/iter = 0.252 ROE[min,max] = [0.281250000, 0.312500000] radices = 16 16 16 32 32 0 0 0 0 0 [/code] |
Congratulations on this milestone!
May I ask about the roadmap for the RISC versions of Mlucas? It is fully understandable why they wouldn't be a priority, but one can still hope, right? A feature like PrimeNet integration would be an awesome advance! -smoky |
While trying Mlucas 3.0x (binary download for Linux 64)
./Mlucas_AMD64 -s a on a AMD Sempron 64 on 2.6.26-2-amd64 x86_64 GNU/Linux model name : AMD Sempron(tm) Processor 2600+ stepping : 2 cpu MHz : 1600.059 cache size : 128 KB It run all thru the full set if sizes the first try but mprime was running in the background so I deleted mlucas.cfg and tried again just to see if it was different. It crashes now at : M4521557: using FFT length 224K = 229376 8-byte floats. this gives an average 19.712424142020090 bits per digit Using complex FFT radices 28 16 16 16 Segmentation fault 3 tries, always the same place. I tried again with mprime in background again and it crashes again, same place. Trying it now with -s m failed at: M34573867: using FFT length 1792K = 1835008 8-byte floats. this gives an average 18.841262272426061 bits per digit Using complex FFT radices 28 8 16 16 16 Segmentation fault with -s l M134113933: using FFT length 7168K = 7340032 8-byte floats. this gives an average 18.271573339189803 bits per digit Using complex FFT radices 28 32 16 16 16 Segmentation fault seems like a problem with the radix 28? |
[QUOTE=smoky;195554]Congratulations on this milestone!
May I ask about the roadmap for the RISC versions of Mlucas? It is fully understandable why they wouldn't be a priority, but one can still hope, right? A feature like PrimeNet integration would be an awesome advance! -smoky[/QUOTE] The code should build fine without modification on most RISC platforms - no SSE2 support for those, obviously - users may simply have to find the best set of compiler options for their individual platforms. Regarding Primenet support, my plan is to first get it working for x86-style platforms, then if the resulting code can be ported to support a wider variety of platforms without terrible difficulty, to proceed with that. I will likely ask for the open-source community's help with the latter, to encompass as broad a variety of platforms as possible, without requiring me to work on that aspect full-time. [QUOTE=lfm;195603]While trying Mlucas 3.0x (binary download for Linux 64) ./Mlucas_AMD64 -s a ... seems like a problem with the radix 28?[/QUOTE] More likely it's a sharad-library issue. Could you try building the source locally (just copy and past the one-line compile sequence on the README page) and retry the self-test? I may have to post a static binary instead. Thanks, -Ernst |
[QUOTE=ewmayer;195631]
More likely it's a sharad-library issue. Could you try building the source locally (just copy and past the one-line compile sequence on the README page) and retry the self-test? I may have to post a static binary instead. [/QUOTE] Seems like that was it. After a local build it runs OK (so far). |
[QUOTE=lfm;195690]Seems like that was it. After a local build it runs OK (so far).[/QUOTE]
I just replaced the Mlucas_AMD64.gz zipped binary with a new statically-linked one ... if you get the chance, please try it out and let me know if that solves the self-test issues you saw with the shared-lib build. Thanks, -Ernst |
[QUOTE=ewmayer;195734]I just replaced the Mlucas_AMD64.gz zipped binary with a new statically-linked one ... if you get the chance, please try it out and let me know if that solves the self-test issues you saw with the shared-lib build.
[/QUOTE] Very strange. Today when I tried a few more tests of the old(er) dynamically linked version it won't fail for me any more. Not sure exactly but I think Ubuntu sent out a libc/libm patch and now it doesn't fail (just a theory). For the sake of smaller downloads, so far as I am concerned, you can go back to dynamically linked. |
Hi, below are the results for AMD 6000
AMD Athlon(tm) 64 X2 Dual Core Processor 6000+ CPU speed: 1800.45 MHz, 2 cores CPU features: RDTSC, CMOV, Prefetch, 3DNow!, MMX, SSE, SSE2 L1 cache size: 64 KB L2 cache size: 1 MB L1 cache line size: 64 bytes L2 cache line size: 64 bytes L1 TLBS: 32 L2 TLBS: 512 [COLOR="DarkRed"]Running openSUSE 11.2 Linux athlon 2.6.32-rc5-git3-1-desktop #1 SMP PREEMPT 2009-11-03 15:41:35 +0100 x86_64 x86_64 x86_64 GNU/Linux[/COLOR] 3.0x 1024 sec/iter = 0.057 ROE[min,max] = [0.250000000, 0.312500000] radices = 32 16 32 32 0 0 0 0 0 0 1152 sec/iter = 0.067 ROE[min,max] = [0.250000000, 0.250000000] radices = 36 32 32 16 0 0 0 0 0 0 1280 sec/iter = 0.072 ROE[min,max] = [0.250000000, 0.343750000] radices = 40 16 32 32 0 0 0 0 0 0 1408 sec/iter = 0.081 ROE[min,max] = [0.312500000, 0.312500000] radices = 44 16 32 32 0 0 0 0 0 0 1536 sec/iter = 0.091 ROE[min,max] = [0.265625000, 0.269042969] radices = 24 8 16 16 16 0 0 0 0 0 1792 sec/iter = 0.111 ROE[min,max] = [0.312500000, 0.312500000] radices = 28 8 16 16 16 0 0 0 0 0 2048 sec/iter = 0.128 ROE[min,max] = [0.281250000, 0.343750000] radices = 32 32 32 32 0 0 0 0 0 0 2304 sec/iter = 0.142 ROE[min,max] = [0.242187500, 0.281250000] radices = 36 8 16 16 16 0 0 0 0 0 2560 sec/iter = 0.160 ROE[min,max] = [0.281250000, 0.312500000] radices = 40 8 16 16 16 0 0 0 0 0 2816 sec/iter = 0.181 ROE[min,max] = [0.328125000, 0.343750000] radices = 44 32 32 32 0 0 0 0 0 0 3072 sec/iter = 0.208 ROE[min,max] = [0.250000000, 0.250000000] radices = 24 16 16 16 16 0 0 0 0 0 3584 sec/iter = 0.248 ROE[min,max] = [0.281250000, 0.281250000] radices = 28 16 16 16 16 0 0 0 0 0 1024 sec/iter = 0.057 ROE[min,max] = [0.250000000, 0.312500000] radices = 32 16 32 32 0 0 0 0 0 0 1152 sec/iter = 0.068 ROE[min,max] = [0.250000000, 0.250000000] radices = 36 32 32 16 0 0 0 0 0 0 1280 sec/iter = 0.072 ROE[min,max] = [0.250000000, 0.343750000] radices = 40 16 32 32 0 0 0 0 0 0 1408 sec/iter = 0.082 ROE[min,max] = [0.312500000, 0.312500000] radices = 44 16 32 32 0 0 0 0 0 0 1536 sec/iter = 0.092 ROE[min,max] = [0.265625000, 0.269042969] radices = 24 8 16 16 16 0 0 0 0 0 1792 sec/iter = 0.110 ROE[min,max] = [0.312500000, 0.312500000] radices = 28 8 16 16 16 0 0 0 0 0 2048 sec/iter = 0.128 ROE[min,max] = [0.281250000, 0.343750000] radices = 32 32 32 32 0 0 0 0 0 0 2304 sec/iter = 0.142 ROE[min,max] = [0.242187500, 0.281250000] radices = 36 8 16 16 16 0 0 0 0 0 2560 sec/iter = 0.160 ROE[min,max] = [0.281250000, 0.312500000] radices = 40 8 16 16 16 0 0 0 0 0 2816 sec/iter = 0.182 ROE[min,max] = [0.328125000, 0.343750000] radices = 44 8 16 16 16 0 0 0 0 0 3072 sec/iter = 0.209 ROE[min,max] = [0.250000000, 0.250000000] radices = 24 16 16 16 16 0 0 0 0 0 3584 sec/iter = 0.249 ROE[min,max] = [0.281250000, 0.281250000] radices = 28 16 16 16 16 0 0 0 0 0 128 sec/iter = 0.006 ROE[min,max] = [0.312500000, 0.312500000] radices = 16 16 16 16 0 0 0 0 0 0 144 sec/iter = 0.007 ROE[min,max] = [0.273437500, 0.273437500] radices = 36 8 16 16 0 0 0 0 0 0 160 sec/iter = 0.008 ROE[min,max] = [0.265625000, 0.265625000] radices = 20 16 16 16 0 0 0 0 0 0 192 sec/iter = 0.009 ROE[min,max] = [0.250000000, 0.250000000] radices = 24 16 16 16 0 0 0 0 0 0 224 sec/iter = 0.011 ROE[min,max] = [0.312500000, 0.312500000] radices = 28 16 16 16 0 0 0 0 0 0 256 sec/iter = 0.012 ROE[min,max] = [0.257812500, 0.296875000] radices = 16 16 32 16 0 0 0 0 0 0 288 sec/iter = 0.015 ROE[min,max] = [0.312500000, 0.312500000] radices = 36 16 16 16 0 0 0 0 0 0 320 sec/iter = 0.016 ROE[min,max] = [0.250000000, 0.312500000] radices = 20 16 32 16 0 0 0 0 0 0 384 sec/iter = 0.020 ROE[min,max] = [0.234375000, 0.250000000] radices = 24 16 16 32 0 0 0 0 0 0 448 sec/iter = 0.024 ROE[min,max] = [0.281250000, 0.312500000] radices = 28 16 32 16 0 0 0 0 0 0 512 sec/iter = 0.026 ROE[min,max] = [0.281250000, 0.312500000] radices = 16 16 32 32 0 0 0 0 0 0 576 sec/iter = 0.030 ROE[min,max] = [0.250000000, 0.281250000] radices = 36 16 32 16 0 0 0 0 0 0 640 sec/iter = 0.035 ROE[min,max] = [0.281250000, 0.343750000] radices = 40 16 16 32 0 0 0 0 0 0 704 sec/iter = 0.040 ROE[min,max] = [0.312500000, 0.312500000] radices = 44 16 16 32 0 0 0 0 0 0 768 sec/iter = 0.043 ROE[min,max] = [0.250000000, 0.250000000] radices = 24 32 32 16 0 0 0 0 0 0 896 sec/iter = 0.053 ROE[min,max] = [0.312500000, 0.312500000] radices = 28 32 32 16 0 0 0 0 0 0 1024 sec/iter = 0.057 ROE[min,max] = [0.250000000, 0.312500000] radices = 32 16 32 32 0 0 0 0 0 0 1152 sec/iter = 0.068 ROE[min,max] = [0.250000000, 0.250000000] radices = 36 32 32 16 0 0 0 0 0 0 1280 sec/iter = 0.072 ROE[min,max] = [0.250000000, 0.343750000] radices = 40 16 32 32 0 0 0 0 0 0 1408 sec/iter = 0.082 ROE[min,max] = [0.312500000, 0.312500000] radices = 44 16 32 32 0 0 0 0 0 0 1536 sec/iter = 0.091 ROE[min,max] = [0.265625000, 0.269042969] radices = 24 32 32 32 0 0 0 0 0 0 1792 sec/iter = 0.109 ROE[min,max] = [0.312500000, 0.312500000] radices = 28 8 16 16 16 0 0 0 0 0 2048 sec/iter = 0.126 ROE[min,max] = [0.281250000, 0.343750000] radices = 32 32 32 32 0 0 0 0 0 0 2304 sec/iter = 0.140 ROE[min,max] = [0.242187500, 0.281250000] radices = 36 8 16 16 16 0 0 0 0 0 2560 sec/iter = 0.158 ROE[min,max] = [0.281250000, 0.312500000] radices = 40 8 16 16 16 0 0 0 0 0 2816 sec/iter = 0.179 ROE[min,max] = [0.328125000, 0.343750000] radices = 44 8 16 16 16 0 0 0 0 0 3072 sec/iter = 0.207 ROE[min,max] = [0.250000000, 0.250000000] radices = 24 16 16 16 16 0 0 0 0 0 3584 sec/iter = 0.246 ROE[min,max] = [0.281250000, 0.281250000] radices = 28 16 16 16 16 0 0 0 0 0 4096 sec/iter = 0.281 ROE[min,max] = [0.250000000, 0.312500000] radices = 16 16 16 16 32 0 0 0 0 0 4608 sec/iter = 0.314 ROE[min,max] = [0.257812500, 0.257812500] radices = 36 16 16 16 16 0 0 0 0 0 Best regards, Carlos |
[quote=ewmayer;195734]I just replaced the Mlucas_AMD64.gz zipped binary with a new statically-linked one ... if you get the chance, please try it out and let me know if that solves the self-test issues you saw with the shared-lib build.
Thanks, -Ernst[/quote] [SIZE=5]I wanted to try your software at a windows XP-32 bit system, but the FTP server does not seem to be up.[/SIZE] |
No need to shout!
|
[QUOTE=moebius;196487]I wanted to try your software at a windows XP-32 bit system, but the FTP server does not seem to be up.[/QUOTE]
It seems ftp service is down – I can view http pages, but not upload/download anything via ftp. I just sent e-mail to John Pierce (owner of the Hogranch) about the problem. This also made me realize that there is an inconsistency in my README - some files are linked via http, others (including the source tarball you are trying to get) via ftp. I made the needed changes so all files use http, but I can't upload the new file, since that needs ftp! :( As a workaround (while we wait for ftp to be revived), you can manually change over from ftp to http for any file you need by copying the URL and changing the leading [url]ftp://hogranch.com/pub/mayer...[/url] to [url]http://hogranch.com/mayer...[/url] For example to get the source tarball via http, use [url]http://hogranch.com/mayer/src/C/Mlucas_11.06.2009.zip[/url] To get the .vcproj file needed for Win32/Visual Studio builds, use [url]http://hogranch.com/mayer/bin/Mlucas.vcproj[/url] |
compile error (linux64)
I get these compilation errors... how do I compile it?
$ gcc -m64 -o Mlucas *.o -lm fermat_mod_square.o: In function `fermat_mod_square': fermat_mod_square.c:(.text+0x1c8a): undefined reference to `radix32_ditN_cy_dif1' fermat_mod_square.c:(.text+0x2072): undefined reference to `radix16_ditN_cy_dif1' fermat_mod_square.c:(.text+0x4ab5): undefined reference to `radix16_dif_pass1' fermat_mod_square.c:(.text+0x4b96): undefined reference to `radix32_dif_pass1' fermat_mod_square.c:(.text+0x4e0a): undefined reference to `radix32_dit_pass1' fermat_mod_square.c:(.text+0x4ed2): undefined reference to `radix16_dit_pass1' mers_mod_square.o: In function `mers_mod_square': mers_mod_square.c:(.text+0x173f): undefined reference to `radix32_dit_pass1' mers_mod_square.c:(.text+0x1807): undefined reference to `radix16_dit_pass1' mers_mod_square.c:(.text+0x19a2): undefined reference to `radix32_dif_pass1' mers_mod_square.c:(.text+0x1a6a): undefined reference to `radix16_dif_pass1' mers_mod_square.c:(.text+0x1dab): undefined reference to `radix32_ditN_cy_dif1' mers_mod_square.c:(.text+0x2199): undefined reference to `radix16_ditN_cy_dif1' secure5.o: In function `make_v5_client_key': secure5.c:(.text+0xe): undefined reference to `md5_raw_output' secure5.c:(.text+0x18e): undefined reference to `md5_raw_input' secure5.c:(.text+0x198): undefined reference to `strupper' secure5.o: In function `secure_v5_url': secure5.c:(.text+0x210): undefined reference to `md5' secure5.c:(.text+0x21a): undefined reference to `strupper' collect2: ld returned 1 exit status |
1 Attachment(s)
Hello!
I have the error at performing line carry_gcc64.h:687 which cause SIGILL at radix16_ditN_cy_dif1.c:2156 . [CODE] Program received signal SIGILL, Illegal instruction. 0x000000000047c953 in radix16_ditN_cy_dif1 (a=a@entry=0x7ffff61de080, n=n@entry=1048576, nwt=1024, nwt_bits=10, wt0=0x1, wt1=<optimized out>, si=0x9e1340, rn0=rn0@entry=0x0, rn1=rn1@entry=0x0, base=base@entry=0x9c11e0 <base.6704>, baseinv=baseinv@entry=0x9c11f0 <baseinv.6705>, iter=iter@entry=1, fracmax=fracmax@entry=0x7fffffffbc48, p=p@entry=20000047) at radix16_ditN_cy_dif1.c:2156 [/CODE][CODE] ┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐ │0x47c8f4 <radix16_ditN_cy_dif1+12540> add %rax,%rbx │ │0x47c8f7 <radix16_ditN_cy_dif1+12543> add %rax,%rdx │ │0x47c8fa <radix16_ditN_cy_dif1+12546> add %rax,%rcx │ │0x47c8fd <radix16_ditN_cy_dif1+12549> mulpd 0x100(%rax),%xmm2 │ │0x47c905 <radix16_ditN_cy_dif1+12557> mulpd 0x100(%rax),%xmm6 │ │0x47c90d <radix16_ditN_cy_dif1+12565> mulpd 0x110(%rax),%xmm3 │ │0x47c915 <radix16_ditN_cy_dif1+12573> mulpd 0x110(%rax),%xmm7 │ │0x47c91d <radix16_ditN_cy_dif1+12581> mulpd (%rdi),%xmm2 │ │0x47c921 <radix16_ditN_cy_dif1+12585> mulpd (%rbx),%xmm6 │ │0x47c925 <radix16_ditN_cy_dif1+12589> mulpd 0x40(%rdx),%xmm3 │ │0x47c92a <radix16_ditN_cy_dif1+12594> mulpd 0x40(%rcx),%xmm7 │ │0x47c92f <radix16_ditN_cy_dif1+12599> mov 0x545332(%rip),%rcx # 0x9c1c68 <cy_r01.6782> │ │0x47c936 <radix16_ditN_cy_dif1+12606> mov 0x54533b(%rip),%rdx # 0x9c1c78 <cy_r23.6783> │ │0x47c93d <radix16_ditN_cy_dif1+12613> mulpd %xmm3,%xmm1 │ │0x47c941 <radix16_ditN_cy_dif1+12617> mulpd %xmm7,%xmm5 │ │0x47c945 <radix16_ditN_cy_dif1+12621> addpd (%rcx),%xmm1 │ │0x47c949 <radix16_ditN_cy_dif1+12625> addpd (%rdx),%xmm5 │ │0x47c94d <radix16_ditN_cy_dif1+12629> movaps %xmm1,%xmm3 │ │0x47c950 <radix16_ditN_cy_dif1+12632> movaps %xmm5,%xmm7 │ >│0x47c953 <radix16_ditN_cy_dif1+12635> roundpd $0x0,%xmm3,%xmm3 │ │0x47c959 <radix16_ditN_cy_dif1+12641> roundpd $0x0,%xmm7,%xmm7 │ │0x47c95f <radix16_ditN_cy_dif1+12647> mov 0x54549a(%rip),%rbx # 0x9c1e00 <sign_mask.6724> │ │0x47c966 <radix16_ditN_cy_dif1+12654> subpd %xmm3,%xmm1 │ │0x47c96a <radix16_ditN_cy_dif1+12658> subpd %xmm7,%xmm5 │ │0x47c96e <radix16_ditN_cy_dif1+12662> andpd (%rbx),%xmm1 │ │0x47c972 <radix16_ditN_cy_dif1+12666> andpd (%rbx),%xmm5 │ │0x47c976 <radix16_ditN_cy_dif1+12670> maxpd %xmm5,%xmm1 │ │0x47c97a <radix16_ditN_cy_dif1+12674> maxpd -0x20(%rax),%xmm1 │ │0x47c97f <radix16_ditN_cy_dif1+12679> movaps %xmm1,-0x20(%rax) │ │0x47c983 <radix16_ditN_cy_dif1+12683> mov %rsi,%rdi │ │0x47c986 <radix16_ditN_cy_dif1+12686> mov %rsi,%rbx │ │0x47c989 <radix16_ditN_cy_dif1+12689> shr $0x14,%rdi │ │0x47c98d <radix16_ditN_cy_dif1+12693> shr $0x16,%rbx │ │0x47c991 <radix16_ditN_cy_dif1+12697> and $0x30,%rdi │ │0x47c995 <radix16_ditN_cy_dif1+12701> and $0x30,%rbx │ │0x47c999 <radix16_ditN_cy_dif1+12705> add %rax,%rdi │ │0x47c99c <radix16_ditN_cy_dif1+12708> add %rax,%rbx │ │0x47c99f <radix16_ditN_cy_dif1+12711> movaps %xmm3,%xmm1 │ │0x47c9a2 <radix16_ditN_cy_dif1+12714> movaps %xmm7,%xmm5 │ │0x47c9a5 <radix16_ditN_cy_dif1+12717> mulpd 0xc0(%rdi),%xmm3 │ │0x47c9ad <radix16_ditN_cy_dif1+12725> mulpd 0xc0(%rbx),%xmm7 │ └───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘ child process 24789 In: radix16_ditN_cy_dif1 Line: 2156 PC: 0x47c953 [/CODE]Output: attachment. Machine: sse sse2 sse4a |
It looks like roundpd is an SSE4.1 instruction which your Opteron 6124 doesn't seem to support (it's not part of SSE4a; see [URL="http://en.wikipedia.org/wiki/SSE4"]Wipedia[/URL]). I guess Ernst will have to explain why he pretends that Mlucas is an SSE2 program :smile:
|
[QUOTE=ldesnogu;348115]It looks like roundpd is an SSE4.1 instruction which your Opteron 6124 doesn't seem to support (it's not part of SSE4a; see [URL="http://en.wikipedia.org/wiki/SSE4"]Wipedia[/URL]). I guess Ernst will have to explain why he pretends that Mlucas is an SSE2 program :smile:[/QUOTE]
Been too long since I visited here, but note that the use of roundpd has been purged from the Mersenne-mod carry macros (Fermat-mod still have them, but I'm the only one using those) in "SSE2"-build-mode in all recent releases, and this will remain so. AVX mode of course makes free use of vroundpd, since there is no "which flavor of AVX do you have?" issue w.r.to that instruction. |
V14.1 is available - details via the readme-file link in the opening post.
|
How does the newer version compares with P95? I mean, I have read your "less than two times slower" stuff there, but I assume that is a figure of speech...
(hey, I am the guy who DC-ed Mike's work, remember? :razz:) |
[QUOTE=LaurV;389911]How does the newer version compares with P95? I mean, I have read your "less than two times slower" stuff there, but I assume that is a figure of speech...[/QUOTE]
Here are 4-thread per-iteration timings on my Haswell 4670K/ddr3-2400, all running at stock. These are all ~2% pessimistic due to startup/shutdown time (e.g. I get 82 msec/iter running @ 3072K in production mode): [code]FFT(K) msec/iter (4-threaded) ---- --------- 1024 2.65 1152 3.15 1280 3.43 1408 4.01 1536 4.19 1664 4.61 1792 4.81 1920 5.29 2048 5.35 2304 6.07 2560 6.51 2816 7.54 3072 8.40 3328 8.74 3584 9.13 3840 10.16 4096 10.54 4608 11.98 5120 13.80 5632 15.92 6144 17.54 6656 18.62 7168 19.69 7680 22.00[/code] |
For comparison, [URL]http://mersenneforum.org/showpost.php?p=382227&postcount=633[/URL]
i5-4670K @ 3.8 GHz, Dual DDR3 1600 [code]Best time for 1024K FFT length: 1.336 ms., avg: 1.374 ms. Best time for 1280K FFT length: 1.839 ms., avg: 1.865 ms. Best time for 1536K FFT length: 2.333 ms., avg: 2.370 ms. Best time for 1792K FFT length: 2.833 ms., avg: 3.277 ms. Best time for 2048K FFT length: 3.350 ms., avg: 3.374 ms. Best time for 2560K FFT length: 4.239 ms., avg: 4.276 ms. Best time for 3072K FFT length: 5.124 ms., avg: 5.155 ms. Best time for 3584K FFT length: 6.006 ms., avg: 6.042 ms. Best time for 4096K FFT length: 6.970 ms., avg: 7.000 ms. Best time for 5120K FFT length: 8.705 ms., avg: 8.745 ms. Best time for 6144K FFT length: 10.496 ms., avg: 10.543 ms. Best time for 7168K FFT length: 12.371 ms., avg: 12.451 ms. Best time for 8192K FFT length: 14.673 ms., avg: 14.735 ms.[/code] Nice result for Mlucas, congratulations :) |
[QUOTE=ldesnogu;389947]For comparison, [URL]http://mersenneforum.org/showpost.php?p=382227&postcount=633[/URL]
i5-4670K @ 3.8 GHz, Dual DDR3 1600 [code]Best time for 1024K FFT length: 1.336 ms., avg: 1.374 ms. Best time for 1280K FFT length: 1.839 ms., avg: 1.865 ms. Best time for 1536K FFT length: 2.333 ms., avg: 2.370 ms. Best time for 1792K FFT length: 2.833 ms., avg: 3.277 ms. Best time for 2048K FFT length: 3.350 ms., avg: 3.374 ms. Best time for 2560K FFT length: 4.239 ms., avg: 4.276 ms. Best time for 3072K FFT length: 5.124 ms., avg: 5.155 ms. Best time for 3584K FFT length: 6.006 ms., avg: 6.042 ms. Best time for 4096K FFT length: 6.970 ms., avg: 7.000 ms. Best time for 5120K FFT length: 8.705 ms., avg: 8.745 ms. Best time for 6144K FFT length: 10.496 ms., avg: 10.543 ms. Best time for 7168K FFT length: 12.371 ms., avg: 12.451 ms. Best time for 8192K FFT length: 14.673 ms., avg: 14.735 ms.[/code] Nice result for Mlucas, congratulations :)[/QUOTE] Thanks - a lot of work went into that "within a factor of 2x". My system runs @3.3GHz (slower than above) but with ddr3-2400 (faster), so not sure how those 2 differences net out. I've been using [url=http://www.mersenneforum.org/showpost.php?p=343173&postcount=99]George's early Haswell results[/url] as my guide, since we bought identical hardware (Mobo, CPU, RAM) and those timings were before George OCed his system. I apply a 15% reduction to his timings, since he says that's roughly what he gained from use of FMA. BTW, if anyone has access to a Broadwell system running Linux (or MingGW64 under Windoze), I'd very much appreciate tmings on such, and have some special preprocessor-flags-to-try-for-Broadwell, as well. |
[QUOTE=ewmayer;389973]My system runs @3.3GHz (slower than above) but with ddr3-2400 (faster), so not sure how those 2 differences net out.[/QUOTE]
Do you mean your system is underclocked? Because 4670K are supposed to run at base 3.4 GHz with turbo at 3.8 GHz (and I supposed the benchmark poster just stated turbo speed, might be a wrong assumption...). [quote]I've been using [URL="http://www.mersenneforum.org/showpost.php?p=343173&postcount=99"]George's early Haswell results[/URL] as my guide, since we bought identical hardware (Mobo, CPU, RAM) and those timings were before George OCed his system. I apply a 15% reduction to his timings, since he says that's roughly what he gained from use of FMA.[/quote]Silly question: why don't you run the latest Prime95 benchmark on your system? |
I gave Mlucas a try on my i7-4770K.
[code]gcc -c -Os -m64 -DUSE_AVX2 -DUSE_THREADS *.c rm -f rng*.o util.o qfloat.o gcc -c -O1 -m64 -DUSE_AVX2 -DUSE_THREADS rng*.c util.c qfloat.c gcc -o Mlucas *.o -lm -lpthread -lrt ./Mlucas -fftlen 192 -iters 100 -radset 0 -nthread 2 ... 100 iterations of M3888517 with FFT length 196608 = 192 K Res64: 579D593FCE0707B2. AvgMaxErr = 0.274916295. MaxErr = 0.343750000. Program: E14.1 Res mod 2^36 = 67881076658 Res mod 2^35 - 1 = 21674900403 Res mod 2^36 - 1 = 42893438228[/code]The README page says this should be output: [code] This particular testcase should produce the following 100-iteration residues, with some platform-dependent variability in the roundoff errors : 100 iterations of M3888509 with FFT length 196608 = 192 K Res64: 71E61322CCFB396C. AvgMaxErr = 0.226967076. MaxErr = 0.281250000. Program: E3.0x Res mod 2^36 = 12028950892 Res mod 2^35 - 1 = 29259839105 Res mod 2^36 - 1 = 50741070790[/code] I guess the README should be updated. How do you get an output similar to Prime95 benchmark? |
[QUOTE=ldesnogu;390014]I gave Mlucas a try on my i7-4770K.
[code]gcc -c -Os -m64 -DUSE_AVX2 -DUSE_THREADS *.c rm -f rng*.o util.o qfloat.o gcc -c -O1 -m64 -DUSE_AVX2 -DUSE_THREADS rng*.c util.c qfloat.c gcc -o Mlucas *.o -lm -lpthread -lrt ./Mlucas -fftlen 192 -iters 100 -radset 0 -nthread 2 ... 100 iterations of M3888517 with FFT length 196608 = 192 K Res64: 579D593FCE0707B2. AvgMaxErr = 0.274916295. MaxErr = 0.343750000. Program: E14.1 Res mod 2^36 = 67881076658 Res mod 2^35 - 1 = 21674900403 Res mod 2^36 - 1 = 42893438228[/code]The README page says this should be output: [code] This particular testcase should produce the following 100-iteration residues, with some platform-dependent variability in the roundoff errors : 100 iterations of M3888509 with FFT length 196608 = 192 K Res64: 71E61322CCFB396C. AvgMaxErr = 0.226967076. MaxErr = 0.281250000. Program: E3.0x Res mod 2^36 = 12028950892 Res mod 2^35 - 1 = 29259839105 Res mod 2^36 - 1 = 50741070790[/code] I guess the README should be updated.[/QUOTE] Ah, good catch - if you look closely you see the 2 exponents are slightly different (3888517 is the next-larger prime above 3888509). I believe I must have changed the self-test exponent computation formula sometime in the last year or so to take p as the smallest prime >= the number given by my continuous-function max_p(FFT length) formula, rather than the largest prime <= same. If you force a non-default self-test p via ./Mlucas -m 3888509 -fftlen 192 -iters 100 -radset 0 -nthread 2 you will see the result indicated on the webpage (which I have since corrected). Thanks for the catch. [QUOTE]How do you get an output similar to Prime95 benchmark?[/QUOTE] George and I do our self-tests differently ... If you want a best-FFT-params (as determined by the these self-tests) timings for range of FFT lengths relevant to current GIMPS 'wavefront' and DC work, pause any other CPU-heavy tasks on our system and run the 'medium' self-test range: ./Mlucas -s m -iters 1000 1000 iters gives cleaner timings (and better roundoff testing) than the "quick look" 100-iter tests. With no #threads specified the code will use all the physical cores on your system. The README page discusses all this stuff. |
[QUOTE=ldesnogu;390009]Do you mean your system is underclocked? Because 4670K are supposed to run at base 3.4 GHz with turbo at 3.8 GHz (and I supposed the benchmark poster just stated turbo speed, might be a wrong assumption...).[/QUOTE]
Ah, I mis-wrote - clock is indeed 3.40 GHz. Perusing the BIOS boot-menu info, I have Turbo Boost enabled (and Enhanced Turbo = [Auto], whatever that means). As I had not recently tried toggling Turbo Boost I tried disabling it - the current Mlucas build runs at the same speed (within the usual noise-based error bars) that way, so it seems to make no difference for my code. at least on my setup. [QUOTE]Silly question: why don't you run the latest Prime95 benchmark on your system?[/QUOTE] Like you say, 'tis a silly question. :) Here 4-threaded results for my Haswell system: [i] [Worker #1 Dec 19 16:21] Timing FFTs using 4 threads. [Worker #1 Dec 19 16:21] Timing 39 iterations of 1024K FFT length. Best time: 1.293 ms., avg time: 1.344 ms. [Worker #1 Dec 19 16:21] Timing 31 iterations of 1280K FFT length. Best time: 1.825 ms., avg time: 1.850 ms. [Worker #1 Dec 19 16:21] Timing 26 iterations of 1536K FFT length. Best time: 1.993 ms., avg time: 2.305 ms. [Worker #1 Dec 19 16:21] Timing 25 iterations of 1792K FFT length. Best time: 2.317 ms., avg time: 2.356 ms. [Worker #1 Dec 19 16:21] Timing 25 iterations of 2048K FFT length. Best time: 2.766 ms., avg time: 2.785 ms. [Worker #1 Dec 19 16:21] Timing 25 iterations of 2560K FFT length. Best time: 3.462 ms., avg time: 3.500 ms. [Worker #1 Dec 19 16:21] Timing 25 iterations of 3072K FFT length. Best time: 4.141 ms., avg time: 4.190 ms. [Worker #1 Dec 19 16:21] Timing 25 iterations of 3584K FFT length. Best time: 4.957 ms., avg time: 5.009 ms. [Worker #1 Dec 19 16:21] Timing 25 iterations of 4096K FFT length. Best time: 5.639 ms., avg time: 5.722 ms. [Worker #1 Dec 19 16:21] Timing 25 iterations of 5120K FFT length. Best time: 7.151 ms., avg time: 7.202 ms. [Worker #1 Dec 19 16:21] Timing 25 iterations of 6144K FFT length. Best time: 8.471 ms., avg time: 8.639 ms. [Worker #1 Dec 19 16:21] Timing 25 iterations of 7168K FFT length. Best time: 10.197 ms., avg time: 10.272 ms. [Worker #1 Dec 19 16:21] Timing 25 iterations of 8192K FFT length. Best time: 11.917 ms., avg time: 11.952 ms. [/i] Now assembling the average times for 4-threaded Prime95 and Mlucas (update of previous table, now using 10000-iter timings run after a reboot, right after which I ran the above Prime95 timing test) at the above FFT lengths (plus the intermediate radix-9/11/13/15-based ones supported by Mlucas) and supplementing with the resulting [Mlucas/Prime95] timing ratio (for cases where the FFT length in question is not supported by Prime95, use its timing at the next-higher length as the denominator): [code] FFTlen Prime95 Mlucas Timing Ratio (Kdbl) msec/iter msec/iter [Mlucas/P95] ------ --------- --------- ------------ 1024 1.344 2.60 1.93 1152 3.13 1.69 1280 1.850 3.56 1.92 1408 3.98 1.73 1536 2.305 4.02 1.74 1664 4.63 1.97 1792 2.356 4.70 1.99 1920 5.29 1.90 2048 2.785 5.29 1.90 2304 6.00 1.71 2560 3.500 6.44 1.84 2816 7.47 1.78 3072 4.190 8.25 1.97 3328 8.84 1.76 3584 5.009 9.02 1.80 3840 10.06 1.76 4096 5.722 10.46 1.83 4608 11.78 1.64 5120 7.202 13.47 1.87 5632 15.52 1.80 6144 8.639 17.40 2.01 6656 18.48 1.80 7168 10.272 19.02 1.85 7680 21.49 1.80 8192 11.952 22.33 1.87[/code] So George still kicks my butt, but now maybe with just one leg, rather than both. :) |
Here is the head-to-head comparison on my new Xyzzy-built Broadwell (i3) NUC, both programs run 4-threaded on the 2 physical cores of the system (that setup gives best per-iteration timing for both on this system) - these timings and ratios can be compared to the Haswell ones in the above post:
[code] FFTlen Prime95 Mlucas Timing Ratio (Kdbl) msec/iter msec/iter [Mlucas/P95] Comments ------ --------- --------- ------------ ------------ 1024 3.894 6.869 1.76 1152 4.634 8.294 1.79 1280 4.990 8.702 1.74 1408 5.502 10.118 1.84 [Prime95 1440K] 1536 6.203 10.298 1.66 1664 6.506 11.562 1.78 [Prime95: average of the 1600K and 1728K timings] 1792 7.473 11.904 1.59 1920 7.843 13.186 1.68 2048 7.898 13.946 1.77 2304 8.889 15.846 1.78 2560 9.930 17.281 1.74 2816 11.369 19.931 1.75 [Prime95 2880K] 3072 12.465 22.373 1.79 3328 13.688 23.541 1.72 [Prime95 3360K] 3584 14.567 25.318 1.74 3840 16.079 27.987 1.74 4096 16.917 29.488 1.74 4608 19.762 34.077 1.72 5120 21.736 37.573 1.73 5632 25.657 43.197 1.68 [Prime95 5760K] 6144 26.867 50.179 1.87 6656 30.958 51.091 1.65 [Prime95 6720K] 7168 32.399 54.929 1.70 7680 34.025 60.411 1.78 8192 34.791 65.911 1.89 Avg: 1.75[/code] |
Thanks to Alex Vong, who earlier this year spurred me to help him with a proposed Linux packaging of the code (Debian first, after which we shall see about other distros), we now have a beta version of a standard Linux-style automated build-tools setup. Feedback from [strike]human guinea pigs[/strike] beta testers appreciated - details at top of the 'News' section of the usual online [url=http://hogranch.com/mayer/README.html]README page[/url]. (And please let me know if the description there could be improved.)
|
Nitpicking:
The page is misleading in 2 places: 1. You should specify in the beginning the fact that the program is slower than P95, and "if you have windoze and intel cpu and you can run p95, then do so!". You specify that your program is intended for non-P95 users, but it is not transparent why (until reading much down, there is a comment about versioning where it specifies why you stop and resume the versioning numbers, and there you say something about speed). 2. Where you say about MLucas having no factoring abilities you suggest that the user who got a "not enough factored" assignment to use P95 to TF to the optimum limit. We know today that this is extremely inefficient, you should point the user to GPU factoring (possibly mfaktX and GPU72). |
[QUOTE=LaurV;409001]Nitpicking:
The page is misleading in 2 places: 1. You should specify in the beginning the fact that the program is slower than P95, and "if you have windoze and intel cpu and you can run p95, then do so!". You specify that your program is intended for non-P95 users, but it is not transparent why (until reading much down, there is a comment about versioning where it specifies why you stop and resume the versioning numbers, and there you say something about speed). 2. Where you say about MLucas having no factoring abilities you suggest that the user who got a "not enough factored" assignment to use P95 to TF to the optimum limit. We know today that this is extremely inefficient, you should point the user to GPU factoring (possibly mfaktX and GPU72).[/QUOTE] Thanks, LaurV: I actually told Alex to include just such a note re. 'slower than P95' in his manpage: [i] Mlucas is an open-source program for primality testing of Mersenne numbers in search of a world-record prime. You may use it to test any suitable number as you wish, but it is preferable that you do so in a coordinated fashion, as part of the Great Internet Mersenne Prime Search (GIMPS). For more information on GIMPS, see the Great Internet Mersenne Prime Search subsection within the NOTES section and SEE ALSO section. Note that Mlucas is not (yet) as efficient as the main GIMPS client, George Woltman's Prime95 program (a.k.a. mprime for the linux version), but that program is not truly open-source, and requires the user to abide by the prize-sharing rules set by its author, should a user be lucky enough to find a new prime eligible for one of the monetary prizes offered by the Electronic Freedom Foundation. [/i] But I neglected to add similar comments to the README.html updates. Will do, and also update the out-of-date P95-for-TF note. Is there a central link I can include for GPU-TF info, for all the various clients and OSes? |
Updated README.html posted, which also adds a link to the GPU72 homepage.
|
Did anyone try to build Mlucas in Windows on MSYS/MSYS2 with Mingw64 ?
The site appears to be down? [URL="http://hogranch.com/mayer/README.html"]http://hogranch.com/mayer/README.html[/URL] |
[QUOTE=ATH;422312]Did anyone try to build Mlucas in Windows on MSYS/MSYS2 with Mingw64 ?
The site appears to be down? [URL="http://hogranch.com/mayer/README.html"]http://hogranch.com/mayer/README.html[/URL][/QUOTE] Yah, I just noticed that as well - note whois shows no contact info, found the domain owner John Pierce's old e-mail [last name AT the same oinkish domain] in one of my saved online address books, as I note I hope he has his e-mail to the hogranch.com address aliased to some other newer address: [i] Begin forwarded message: From: "E. Mayer" Date: January 13, 2016 4:10:21 PM PST To: ************ Subject: hogranch.com down? Hi, John: Trying to reach you by the only e-mail I have for you ... which will fail if the hogranch.com server also handles your e-mail - anyhoo, hogranch.com appears to be down at the moment. I used my account there to stash some Mersenne-related stuff a few hours ago, went to do a few other cleanups just now, no go - ping also reports 100% packet loss. Hoping you have this e-mail aliased to at least one other account! Cheers, -Ernst[/i] [b]Edit:[/b] Just heard back from John, who thankfully does not use his ftp server to also handle his e-mail: [quote]last week, my landline and DSL was nearly totally down for the better part of the week awaiting AT&T service, who finally came saturday, and moved my trunk onto a different pair in the cable. it appears I'm still getting packet loss and the landline is hissy/staticy, so this pair has issues too, and I believe the telco guy said there aren't any more working pairs in the trunk on the street here, so if it goes down again, they'll need to pull a whole new trunk from the RT box 1/2 mile down the hill. ugh. I do have a alternate email, **** but I don't check it as often as I should because I have it forwarding to [above 'oink' address] I should move your stuff that's on hogranch.com to my /real/ server, freescruz.com ... this is colocated at a real ISP and is just all around a better server.[/quote] I ftp'ed in successfully, but 'get' of a sample file ended up hanging ... likely due to same issue he describes. Anyone who wants the latest release but fails to get it at the official ftp site, PM or e-mail me and I'll gladly provide you a copy. |
More from John regarding housing those bits, I sked "Would the migration you propose allow the URLs to remain intact?", his reply:
[quote]if you're using hogranch.com/..... then no, it would be freescruz.com/..... if you're using a DNS alias, then we can point that to the new server. its been SO long since I set this up, hah, I don't remember what all you'er doing and kinda forgot you were even there :) I'm trying to transfer your files now, and having problems going out the DSL, so what I think I need to do is move them to my PC over my LAN, then use my cable (yes, I have two internets here now!) to move them to freescruz [a few minutes later] oh. I see a link to anonymous FTP in your directory? I've killed off using ftp entirely on new.freescruz.com, we'd have to switch to http rather than ftp access for that. already onto phase two of the 2-step I just described, copied hogranch.com -> PC locally, and am now copying PC -> new.freescruz.com. your user account should be active on new.freescruz.com, password is copied straight across (its hashed so I dunno what it is).[/quote] Suggestions re. link-rot-resistant web-hosting, URLs, etc, appreciated. |
Mike has an excellent record of keeping mersenneforum.org up, together with a long list of passengers:
[URL]http://mersenneforum.org/mmff-gfn/[/URL] [URL]http://mersenneforum.org/mmff/[/URL] [URL]http://mersenneforum.org/mfaktc/[/URL] and on and on (and of course the mirror of [url]http://mersenneforum.org/gimps/[/url] ) |
[QUOTE=Batalov;422337]Mike has an excellent record of keeping mersenneforum.org up, together with a long list of passengers:
[URL]http://mersenneforum.org/mmff-gfn/[/URL] [URL]http://mersenneforum.org/mmff/[/URL] [URL]http://mersenneforum.org/mfaktc/[/URL] and on and on (and of course the mirror of [url]http://mersenneforum.org/gimps/[/url] )[/QUOTE] Yes, that thought had occurred to me - my more immediate concern is to how do do such a move without every later user who tries to access one of the old pages getting a 404 error. I suppose I could replace all the html files on hogranch.com with a simple redirect-stub file, perhaps including a tasteful piece of adult erotica (of the sexy-but-non-sexist variety: We would use only shirtless Hoff-pics which also include at least one Baywatch fem-babe) or funny cat-animation for the user to watch while waiting to be redirected. [i]If you experience a redirection lasting 4 hours or more, contact your doctor.[/i] |
[QUOTE=ewmayer;422338]Yes, that thought had occurred to me - my more immediate concern is to how do do such a move without every later user who tries to access one of the old pages getting a 404 error. I suppose I could replace all the html files on hogranch.com with a simple redirect-stub file, perhaps including a tasteful piece of adult erotica (of the sexy-but-non-sexist variety: We would use only shirtless Hoff-pics which also include at least one Baywatch fem-babe) or funny cat-animation for the user to watch while waiting to be redirected.
[i]If you experience a redirection lasting 4 hours or more, contact your doctor.[/i][/QUOTE] :davar55: |
All Ernst needs to do is (consistently) use relative links instead of absolute links.
[URL]http://www.mersenneforum.org/mayer/[/URL] :mike: |
[QUOTE=Xyzzy;422420]All Ernst needs to do is (consistently) use relative links instead of absolute links.
[URL]http://www.mersenneforum.org/mayer/[/URL] :mike:[/QUOTE] Thanks, Mike - I will do the needful (RIP Mally) when next I get round to updating my readme.html file. |
1 Attachment(s)
I successfully installed Gentoo 64bits on my Raspberry PI 3.
The system works like a charm with my code, and does not like threads on PERL [code] This Perl not built to support threads Compilation failed in require at Pgap_seg.pl line 8. BEGIN failed--compilation aborted at Pgap_seg.pl line 8. [/code] I downloaded the 3.4 MB tar.gz on Ernst page (the code worked fine on Raspbian 32bit) and [B]./configure[/B]d it. But when I try to make, I get te following: [code] demouser@pi64 ~/luigi/mlucas-14.1 $ make make all-am make[1]: Entering directory '/home/demouser/luigi/mlucas-14.1' CC $NORMAL_O $THREADS_O make[1]: *** [Makefile:2986: NORMAL_O-THREADS_O.stamp] Error 1 make[1]: Leaving directory '/home/demouser/luigi/mlucas-14.1' make: *** [Makefile:2084: all] Error 2 [/code] uname -a gives the following: [code] Linux pi64 4.10.0-rc5-v8 #1 SMP PREEMPT Wed Jan 25 20:13:50 GMT 2017 aarch64 GNU/Linux [/code] The config files are attached. Luigi |
[QUOTE=ET_;457660]I successfully installed Gentoo 64bits on my Raspberry PI 3.[/QUOTE]
How many months of compiling did that take? |
[QUOTE=Mark Rose;457673]How many months of compiling did that take?[/QUOTE]
Not much. The 64bit OS makes the PI run about 4-5 times slower than a normal PC. One core of a PC nearly equals 4 PI cores. Consider that I have a Gentoo image installed in a matter of minutes on the micrcoSD. |
I see several errors in config.log:
[i] configure:3324: gcc -V >&5 gcc: error: unrecognized command line option '-V' gcc: fatal error: no input files compilation terminated. configure:3324: gcc -qversion >&5 gcc: error: unrecognized command line option '-qversion' gcc: fatal error: no input files compilation terminated. [/i] Not sure what those are about ... I need to ping config-script author Alex Vong what input the '>&5' are intended to feed to the above 2 gcc flags. Nonetheless the config script continues, here is the next batch of errors: [i] configure:4053: checking for library containing ceil, log, pow, sqrt, sincos, floor, lrint, atan configure:4084: gcc -o conftest conftest.c >&5 conftest.c:18:6: warning: built-in function 'ceil' declared as non-function char ceil, log, pow, sqrt, sincos, floor, lrint, atan (); ^ conftest.c:18:12: warning: built-in function 'log' declared as non-function char ceil, log, pow, sqrt, sincos, floor, lrint, atan (); ^ conftest.c:18:17: warning: built-in function 'pow' declared as non-function char ceil, log, pow, sqrt, sincos, floor, lrint, atan (); ^ conftest.c:18:22: warning: built-in function 'sqrt' declared as non-function char ceil, log, pow, sqrt, sincos, floor, lrint, atan (); ^ conftest.c:18:28: warning: built-in function 'sincos' declared as non-function char ceil, log, pow, sqrt, sincos, floor, lrint, atan (); ^ conftest.c:18:36: warning: built-in function 'floor' declared as non-function char ceil, log, pow, sqrt, sincos, floor, lrint, atan (); ^ conftest.c:18:43: warning: built-in function 'lrint' declared as non-function char ceil, log, pow, sqrt, sincos, floor, lrint, atan (); ^ conftest.c:18:50: warning: conflicting types for built-in function 'atan' char ceil, log, pow, sqrt, sincos, floor, lrint, atan (); ^ /tmp/ccSv91Yj.o: In function `main': conftest.c:(.text+0x8): undefined reference to `atan' collect2: error: ld returned 1 exit status [/i] That looks like something related to the math libs. Again, I'll ask Alex to have a look. At some point in the not-too-distant future I need to take ownership (or at least curatorship) of the config/make system, but now is not a good time for that effort. |
Ernst, if you run configure on your x86 machine you'll get the same errors in your config.log file ;)
|
[QUOTE=ewmayer;457734]I see several errors in config.log:
[...] That looks like something related to the math libs. Again, I'll ask Alex to have a look. At some point in the not-too-distant future I need to take ownership (or at least curatorship) of the config/make system, but now is not a good time for that effort.[/QUOTE] I will be in New York and Boston for vacation and come back after May 7th. My PI system is actually sieving for the DoubleMersennes project, and the Pico-Cluster is still in its box. Take your time... :smile: Luigi |
[QUOTE=ldesnogu;457756]Ernst, if you run configure on your x86 machine you'll get the same errors in your config.log file ;)[/QUOTE]
Indeed - before posting them to the ftp site I used the install scripts to install on my Intel Broadwell NUC, but since that install was successful I never looked at the resulting config.log until now. As you say, same errors, all the way down [there are around a dozen in total.] |
My understanding is that it's the normal process: configure scripts test various ways to do things until they find one that works. So I guess there's nothing wrong to be found in these logs.
|
[QUOTE=ldesnogu;457864]My understanding is that it's the normal process: configure scripts test various ways to do things until they find one that works. So I guess there's nothing wrong to be found in these logs.[/QUOTE]
Yes, that makes sense. Luigi, could you try going directly into the src directory of the install tree and trying a manual build as per the README page? |
[QUOTE=ewmayer;457920]Yes, that makes sense.
Luigi, could you try going directly into the src directory of the install tree and trying a manual build as per the README page?[/QUOTE] Sure, as Soon as I get Back home next week. |
David Stanfill (airsquirrels) kindly gave me a user account on his Ryzen in order to do Mlucas builds/tests using my current code snapshot which I am preparing for release. Here are the first 2 sets of timing results, unthreaded builds (what I call '0-thread' to differentiate from multithread-capable builds run with just 1 thread).
Note the radices in the rightmost columns are *complex* FFT radices, thus their product in each case equals one-half the real-vector length (in Kdoubles) in the leftmost column. There was no AMD-specific optimization involved - this is all code developed and tuned for Intel CPUs. [b][Edit: See ||-build notes below about the 100 iters used for these timings likely being insufficient][/b] [code]Ryzen, AVX/0-thread: 1024 msec/iter = 16.143230 ROE[avg,max] = [0.237048340, 0.269531250] radices = 32 16 32 32 1152 msec/iter = 18.393270 ROE[avg,max] = [0.273577009, 0.312500000] radices = 36 16 32 32 1280 msec/iter = 20.434270 ROE[avg,max] = [0.278939383, 0.343750000] radices = 40 16 32 32 1408 msec/iter = 23.969040 ROE[avg,max] = [0.311523438, 0.406250000] radices = 44 16 32 32 1536 msec/iter = 23.938600 ROE[avg,max] = [0.251722935, 0.281250000] radices = 48 16 32 32 1664 msec/iter = 28.809070 ROE[avg,max] = [0.308928571, 0.375000000] radices = 52 16 32 32 1792 msec/iter = 30.127000 ROE[avg,max] = [0.351534598, 0.437500000] radices = 56 16 32 32 1920 msec/iter = 33.393400 ROE[avg,max] = [0.297321429, 0.406250000] radices = 60 16 32 32 2048 msec/iter = 34.487110 ROE[avg,max] = [0.240848214, 0.281250000] radices = 64 16 32 32 2304 msec/iter = 40.226720 ROE[avg,max] = [0.249302455, 0.281250000] radices = 36 32 32 32 2560 msec/iter = 44.287860 ROE[avg,max] = [0.256849888, 0.312500000] radices = 160 16 16 32 2816 msec/iter = 50.539970 ROE[avg,max] = [0.281724330, 0.328125000] radices = 176 16 16 32 3072 msec/iter = 52.569620 ROE[avg,max] = [0.245962960, 0.281250000] radices = 48 32 32 32 3328 msec/iter = 60.861210 ROE[avg,max] = [0.316964286, 0.375000000] radices = 52 32 32 32 3584 msec/iter = 62.958160 ROE[avg,max] = [0.286432757, 0.343750000] radices = 224 16 16 32 3840 msec/iter = 69.900850 ROE[avg,max] = [0.253655134, 0.281250000] radices = 240 16 16 32 4096 msec/iter = 73.305030 ROE[avg,max] = [0.259765625, 0.312500000] radices = 256 16 16 32 4608 msec/iter = 82.375850 ROE[avg,max] = [0.279478237, 0.375000000] radices = 288 16 16 32 5120 msec/iter = 92.422200 ROE[avg,max] = [0.303348214, 0.375000000] radices = 160 16 32 32 5632 msec/iter = 103.692050 ROE[avg,max] = [0.287374442, 0.343750000] radices = 176 16 32 32 6144 msec/iter = 114.081960 ROE[avg,max] = [0.279017857, 0.312500000] radices = 192 16 32 32 6656 msec/iter = 141.714380 ROE[avg,max] = [0.347767857, 0.375000000] radices = 52 16 16 16 16 7168 msec/iter = 131.530090 ROE[avg,max] = [0.286830357, 0.328125000] radices = 224 16 32 32 7680 msec/iter = 140.589520 ROE[avg,max] = [0.265318080, 0.312500000] radices = 240 16 32 32[/code] [code]Ryzen, AVX2/0-thread: 1024 msec/iter = 14.473480 ROE[avg,max] = [0.249674770, 0.312500000] radices = 32 16 32 32 1152 msec/iter = 16.941660 ROE[avg,max] = [0.304101562, 0.375000000] radices = 36 16 32 32 1280 msec/iter = 18.400400 ROE[avg,max] = [0.285825893, 0.375000000] radices = 40 16 32 32 1408 msec/iter = 21.812400 ROE[avg,max] = [0.299107143, 0.375000000] radices = 44 16 32 32 1536 msec/iter = 22.641650 ROE[avg,max] = [0.264965820, 0.312500000] radices = 48 16 32 32 1664 msec/iter = 26.051310 ROE[avg,max] = [0.303417969, 0.375000000] radices = 52 16 32 32 1792 msec/iter = 27.311240 ROE[avg,max] = [0.305301339, 0.375000000] radices = 56 16 32 32 1920 msec/iter = 30.567500 ROE[avg,max] = [0.323883929, 0.437500000] radices = 60 16 32 32 2048 msec/iter = 31.450460 ROE[avg,max] = [0.258858817, 0.312500000] radices = 64 16 32 32 2304 msec/iter = 35.497940 ROE[avg,max] = [0.365848214, 0.437500000] radices = 144 16 16 32 2560 msec/iter = 39.911440 ROE[avg,max] = [0.294642857, 0.375000000] radices = 40 32 32 32 2816 msec/iter = 46.300510 ROE[avg,max] = [0.286802455, 0.343750000] radices = 176 16 16 32 3072 msec/iter = 48.691550 ROE[avg,max] = [0.235825893, 0.281250000] radices = 48 32 32 32 3328 msec/iter = 55.515420 ROE[avg,max] = [0.278913225, 0.343750000] radices = 208 16 16 32 3584 msec/iter = 55.566890 ROE[avg,max] = [0.286143276, 0.328125000] radices = 224 16 16 32 3840 msec/iter = 62.801760 ROE[avg,max] = [0.288204520, 0.347656250] radices = 240 16 16 32 4096 msec/iter = 64.375370 ROE[avg,max] = [0.295214844, 0.343750000] radices = 256 16 16 32 4608 msec/iter = 72.954530 ROE[avg,max] = [0.311607143, 0.375000000] radices = 288 16 16 32 5120 msec/iter = 82.275550 ROE[avg,max] = [0.306975446, 0.375000000] radices = 160 16 32 32 5632 msec/iter = 95.040700 ROE[avg,max] = [0.255600412, 0.281250000] radices = 176 16 32 32 6144 msec/iter = 103.228320 ROE[avg,max] = [0.273018973, 0.343750000] radices = 192 16 32 32 6656 msec/iter = 115.045360 ROE[avg,max] = [0.268750000, 0.312500000] radices = 208 16 32 32 7168 msec/iter = 114.919310 ROE[avg,max] = [0.273074777, 0.312500000] radices = 224 16 32 32 7680 msec/iter = 128.601060 ROE[avg,max] = [0.289223807, 0.343750000] radices = 240 16 32 32[/code] |
Here are benchmark timings for multithreaded builds of Mlucas on Ryzen. Some notes:
[b]1.[/b] My above 'unthreaded' timings were for 100-iteration runs. It seems that was insufficient on Ryzen, because when I went to 1000-iter timings to allow for the timing decreases which accompany use of more than 1 thread, even the 1-thread timings drop significantly versus the 100-iteration ones. For example, the per-iteration time for the AVX build @7168K drops from the 131 msec in the unthreaded-build-100-iter table to just 91 msec in the 1-thread column of the threaded-build-1000-iter table which follows. [b]2.[/b] Again due to the deeper 1000-iter runs, the roundoff errors captured in the table are larger. It's clear that I also need to fiddle my timing-test code to omit results having ROEs appreciably > 0.4 from the best-radix-set entries that get printed to the mlucas.cfg file. 0.40625 is probably OK (though maybe not for 100-iter runs), but 0.4375 is dangerously high, and e.g. 0.46875 is "right out", as the Monty Pythons would say. [Cf. Holy Hand Grenade scene in MP & The Holy Grail.] [b]3.[/b] Mlucas allows non-power-of-2 threadcounts but greatly prefers the power-of-2 ones, so I only did the latter. [b]4.[/b] AMD apparently has a different core numbering scheme than Intel - when I ran the first 2-thread benchmarks using the '-nthread 2' option, which sets affinities to cores 0 and 1, the timings were slower than 1-thread. Using the new-in-the-coming-release -cpu option I forced affinities to cores 0 and 2 via '-cpu 0,2', and got the expected 2-thread speedup. For 4 and 8-threads I used '-cpu 0:7:2' [equivalent to '-cpu 0,2,4,6'] and '-cpu 0:15:2' [equivalent to '-cpu 0,2,4,6,8,10,12,14'], respectively. [b]5.[/b] The 8-thread timings, especially for the smaller FFT lengths, are likely pessimistic, since startup overhead is non-neglible for that many threads even using 1000 iterations. Ryzen, AVX build, msec/iter vs FFT length (Kdouble) for various threadcounts: [code]FFTlen 1-thr 2-thr 4-thr 8-thr 1024 11.67 6.24 3.77 2.40 ROE[avg,max] = [0.242096600, 0.312500000] 1152 13.47 7.14 4.13 2.96 ROE[avg,max] = [0.275115778, 0.375000000] 1280 14.81 7.88 4.64 3.24 ROE[avg,max] = [0.284061770, 0.406250000] 1408 16.94 8.88 5.26 3.51 ROE[avg,max] = [0.310743194, 0.468750000] 1536 17.75 9.34 5.20 3.57 ROE[avg,max] = [0.252182723, 0.343750000] 1664 20.45 10.74 6.30 4.29 ROE[avg,max] = [0.310800580, 0.406250000] 1792 21.24 11.13 6.10 4.20 ROE[avg,max] = [0.348934528, 0.468750000] 1920 23.62 12.28 6.87 4.74 ROE[avg,max] = [0.295699098, 0.406250000] 2048 24.02 12.59 6.95 4.83 ROE[avg,max] = [0.248437626, 0.320312500] 2304 27.91 14.72 7.96 5.43 ROE[avg,max] = [0.248899291, 0.312500000] 2560 30.55 16.07 8.90 5.94 ROE[avg,max] = [0.302806862, 0.375000000] 2816 34.92 18.18 10.09 6.61 ROE[avg,max] = [0.284329255, 0.375000000] 3072 36.52 19.12 10.83 7.23 ROE[avg,max] = [0.244108896, 0.312500000] 3328 42.00 22.02 12.74 8.42 ROE[avg,max] = [0.316897552, 0.437500000] 3584 43.51 22.70 12.51 8.34 ROE[avg,max] = [0.289033555, 0.437500000] 3840 48.30 25.20 13.63 8.83 ROE[avg,max] = [0.301240335, 0.375000000] 4096 50.49 26.29 14.41 10.01 ROE[avg,max] = [0.293798325, 0.437500000] 4608 57.54 29.73 16.26 10.75 ROE[avg,max] = [0.301216173, 0.406250000] 5120 64.50 33.36 18.01 12.04 ROE[avg,max] = [0.321669620, 0.406250000] 5632 72.71 37.62 20.33 13.39 ROE[avg,max] = [0.284785005, 0.375000000] 6144 77.17 40.38 22.42 14.81 ROE[avg,max] = [0.254623948, 0.343750000] 6656 88.26 46.11 27.05 18.87 ROE[avg,max] = [0.353221649, 0.437500000] 7168 90.96 47.11 25.80 16.98 ROE[avg,max] = [0.289598351, 0.375000000] 7680 99.00 50.93 27.62 18.63 ROE[avg,max] = [0.267126056, 0.437500000[/code] |
Ryzen, AVX2/FMA3 build, msec/iter vs FFT length (Kdouble) for various threadcounts:
[code]FFTlen 1-thr 2-thr 4-thr 8-thr 1024 10.42 5.34 3.36 2.20 ROE[avg,max] = [0.249404939, 0.328125000] 1152 12.14 6.40 3.72 2.80 ROE[avg,max] = [0.302253644, 0.375000000] 1280 13.23 6.84 4.07 2.88 ROE[avg,max] = [0.285753262, 0.375000000] 1408 15.40 8.01 4.87 3.09 ROE[avg,max] = [0.300879913, 0.375000000] 1536 15.96 8.31 4.80 3.11 ROE[avg,max] = [0.265940841, 0.375000000] 1664 18.57 9.60 5.64 3.92 ROE[avg,max] = [0.310388813, 0.406250000] 1792 18.67 9.77 5.47 3.83 ROE[avg,max] = [0.310203065, 0.437500000] 1920 21.53 11.29 6.25 4.26 ROE[avg,max] = [0.324257007, 0.437500000] 2048 21.68 11.34 6.39 4.39 ROE[avg,max] = [0.241334140, 0.312500000] 2304 25.47 13.26 7.37 5.02 ROE[avg,max] = [0.234688230, 0.281250000] 2560 27.57 14.42 8.03 5.35 ROE[avg,max] = [0.297289787, 0.406250000] 2816 32.14 16.67 9.26 6.24 ROE[avg,max] = [0.241656117, 0.343750000] 3072 33.18 17.27 9.93 6.80 ROE[avg,max] = [0.234802388, 0.289062500] 3328 38.69 20.04 10.95 7.34 ROE[avg,max] = [0.308062178, 0.375000000] 3584 39.13 20.18 11.10 7.49 ROE[avg,max] = [0.287800268, 0.375000000] 3840 44.07 22.50 12.36 8.48 ROE[avg,max] = [0.288700568, 0.355468750] 4096 44.67 23.29 13.33 9.33 ROE[avg,max] = [0.284906635, 0.359375000] 4608 51.83 26.58 14.70 9.78 ROE[avg,max] = [0.294995369, 0.375000000] 5120 56.91 29.36 16.57 11.11 ROE[avg,max] = [0.340822043, 0.437500000] 5632 66.01 34.16 18.99 12.39 ROE[avg,max] = [0.296337954, 0.406250000] 6144 68.74 35.72 20.63 13.88 ROE[avg,max] = [0.303176707, 0.390625000] 6656 79.48 40.54 22.12 15.02 ROE[avg,max] = [0.270511965, 0.375000000] 7168 80.03 40.97 23.16 15.74 ROE[avg,max] = [0.272298848, 0.343750000] 7680 89.57 45.75 25.08 17.17 ROE[avg,max] = [0.287253405, 0.375000000] [/code] In particular note the AVX2-mode 2816K timings - 2-threaded I benchmark at 16.7 msec/iter. After my benchmarks finished I fired up 4 exponents near the upper limit of 53.8M for 2816K. With all four 2-threaded jobs running and thus all 8 physical cores busy, I get ~20 msec/iter for each of the 4 side-by-side runs. Will play with thread counts and affinities some more in coming days to see if I can improve on that. Off to bed ... |
[QUOTE=ewmayer;458889]Here are benchmark timings for multithreaded builds of Mlucas on Ryzen. Some notes:
[b]1.[/b] My above 'unthreaded' timings were for 100-iteration runs. It seems that was insufficient on Ryzen, because when I went to 1000-iter timings to allow for the timing decreases which accompany use of more than 1 thread, even the 1-thread timings drop significantly versus the 100-iteration ones. For example, the per-iteration time for the AVX build @7168K drops from the 131 msec in the unthreaded-build-100-iter table to just 91 msec in the 1-thread column of the threaded-build-1000-iter table which follows.[/QUOTE] Zen uses a neural network in its branch predictor. A lot of people benchmarking when Ryzen came out found that second, third, and additional runs, often resulted in better times. You may get better timings still using longer iterations. |
In relation to the verify runs of the new M-prime candidate, forumites Andreas Höglund [ATH] and Gord Palameta [GP2] both hit errors in building 17.1 for avx-512 - turns out some preprocessor-logic I added in relation to supporting ARMv8 SIMD (see the "ARM builds..." thread) broke an assumption implicit in several of the carry-radix files when built in avx-512 mode. Clearly, I need to do more thorough QA work going forward.
Patched 17.1 version has been successfully built by Andreas and uploaded by me. |
[QUOTE=Mark Rose;458909]Zen uses a neural network in its branch predictor.[/QUOTE]
There was a wonderful quote on BBC radio: "I'm less concerned about Artificial Intelligence than Artificial Stupidity. |
compiling mlucas 17.1 fails on my raspberry pi 3 running raspbian stretch:
[CODE]../src/util.c: In function ‘has_asimd’: ../src/util.c:1806:16: error: ‘HWCAP_ASIMD’ undeclared (first use in this function) if (hwcaps & HWCAP_ASIMD) { ^~~~~~~~~~~ ../src/util.c:1806:16: note: each undeclared identifier is reported only once for each function it appears in[/CODE] Any hint what might be wrong? |
Is raspbian 64-bit? If not then it's possible HWCAP_ASIMD might not be defined.
|
No it's 32-Bit. I've read that Raspbian sticks to 32-Bit, so no 64-Bit Raspbian in near future.
|
[QUOTE=heliosh;475280]No it's 32-Bit. I've read that Raspbian sticks to 32-Bit, so no 64-Bit Raspbian in near future.[/QUOTE]
I am using gentoo 64-bits from here: [url]https://github.com/sakaki-/gentoo-on-rpi3-64bit[/url] It made me compile Mlucas and helped another forumite to perform the compilation. |
[QUOTE=ldesnogu;475279]Is raspbian 64-bit? If not then it's possible HWCAP_ASIMD might not be defined.[/QUOTE]
Yes, that reminds me that I should have included the patch for this issue - aready in my dev-branch code as of a few months ago - in the patched 17.1 tarball, but I posted the latter specifically to fix build issues for avx-512 code. @heliosh: Quick patch - in util.c, replace the has_asimd() function (at line 1882) with the following one, which adds a bit of preprocessor-hackery: [code] int has_asimd(void) { unsigned long hwcaps = getauxval(AT_HWCAP); #ifndef HWCAP_ASIMD // This is not def'd on pre-ASIMD platforms const unsigned long HWCAP_ASIMD = 0; #endif if (hwcaps & HWCAP_ASIMD) { return 1; } return 0; } [/code] Will post updated patched tarball shortly - I want to be ready in case we get a bunch of new downloader/builders as a result of the imminent new-prime announcement. |
That did the job, thank you!
|
Hello!) Not sure ... bu looks like something wrong with txz archive. I see here only src folder. Nothing else for compile :confused2:
|
[QUOTE=Lorenzo;475480]Hello!) Not sure ... bu looks like something wrong with txz archive. I see here only src folder. Nothing else for compile :confused2:[/QUOTE]
Lorenzo, you need to look *inside* the src folder. |
ahhh, sorry. Just i have read instruction again. Got it how to install. :smile:
|
Another day, another problem :smile:
mlucas -s m -cpu 0:3 resulting in an error with both, the self-compiled 17.1 as well as the raspbian precompiled 14.1-2. It runs fine until: [CODE]NTHREADS = 4 M32156581: using FFT length 1664K = 1703936 8-byte floats. this gives an average 18.871941786545975 bits per digit Using complex FFT radices 26 8 16 16 16 mers_mod_square: radix0/2 not exactly divisible by NTHREADS - This will hurt performance. pthread_create:: Cannot allocate memory pthread_join: : Cannot allocate memory pthread_join: : Cannot allocate memory pthread_join: : Cannot allocate memory ERROR: at line 1473 of file ../src/mers_mod_square.c Assertion failed: threadpool_init failed![/CODE] Full log: [url]https://pastebin.com/raw/XvNsn2wD[/url] There was plenty of free RAM at that time. Single-threaded is running fine so far. And a minor issue: The URL printed at startup isn't working. |
[QUOTE=heliosh;475560]Another day, another problem :smile:
mlucas -s m -cpu 0:3 resulting in an error with both, the self-compiled 17.1 as well as the raspbian precompiled 14.1-2. It runs fine until: [CODE]NTHREADS = 4 M32156581: using FFT length 1664K = 1703936 8-byte floats. this gives an average 18.871941786545975 bits per digit Using complex FFT radices 26 8 16 16 16 mers_mod_square: radix0/2 not exactly divisible by NTHREADS - This will hurt performance. pthread_create:: Cannot allocate memory pthread_join: : Cannot allocate memory pthread_join: : Cannot allocate memory pthread_join: : Cannot allocate memory ERROR: at line 1473 of file ../src/mers_mod_square.c Assertion failed: threadpool_init failed![/CODE] Full log: [url]https://pastebin.com/raw/XvNsn2wD[/url] There was plenty of free RAM at that time. Single-threaded is running fine so far. And a minor issue: The URL printed at startup isn't working.[/QUOTE] I haven't seen this particlular alloc error before, but have encountered not-dissimilar errors during self-tests on some systems, apparently due to the OS not being able to properly recover memory freed up by completed tests. Does re-running just the 1664K self-test (./Mlucas -fftlen 1664 -cpu 0:3) work? If so, on your system you may simply have to complete the self-tests in this one-length-at-a-time fashion, e.g. after 1664 finishes, paste the block below into your shell: ./Mlucas -fftlen 1792 -cpu 0:3 ./Mlucas -fftlen 1920 -cpu 0:3 ./Mlucas -fftlen 2048 -cpu 0:3 ./Mlucas -fftlen 2304 -cpu 0:3 ./Mlucas -fftlen 2560 -cpu 0:3 ./Mlucas -fftlen 2816 -cpu 0:3 ./Mlucas -fftlen 3072 -cpu 0:3 ./Mlucas -fftlen 3328 -cpu 0:3 ./Mlucas -fftlen 3584 -cpu 0:3 ./Mlucas -fftlen 3840 -cpu 0:3 ./Mlucas -fftlen 4096 -cpu 0:3 ./Mlucas -fftlen 4608 -cpu 0:3 ./Mlucas -fftlen 5120 -cpu 0:3 ./Mlucas -fftlen 5632 -cpu 0:3 ./Mlucas -fftlen 6144 -cpu 0:3 ./Mlucas -fftlen 6656 -cpu 0:3 ./Mlucas -fftlen 7168 -cpu 0:3 ./Mlucas -fftlen 7680 -cpu 0:3 Each should add 1 line to your mlucas.cfg file. |
Yep. Manually running the individual tests seems to work. Thanks again.
|
[QUOTE=heliosh;475564]Yep. Manually running the individual tests seems to work. Thanks again.[/QUOTE]
Can you post your final cfg file and basic system details here? Note in single-fft-length-selftest mode the code appends a bunch of stuff to right of the usual cfg-file line as shown below, suggest stripping off the non-bolded fields of each line created via single-length-test: [code] [b] 4608 msec/iter = 3.07 ROE[avg,max] = [0.238615989, 0.312500000] radices = 144 16 32 32 0 0 0 0 0 0[/b] 10000-iteration Res mod 2^64, 2^35-1, 2^36-1 = 13BB5C9DDF0CD3D6, 15982066709, 51703797107[/code] |
Voilà:
[url]https://pastebin.com/raw/tMesEMUi[/url] It's a Raspberry Pi 3 running raspbian 32 with a steady background load of about 80% on one core that I don't want to stop right now. mlucas was compiled with -O3 -mcpu=cortex-a53 -mfpu=neon-fp-armv8 |
[QUOTE=heliosh;475706]Voilà:
[url]https://pastebin.com/raw/tMesEMUi[/url] It's a Raspberry Pi 3 running raspbian 32 with a steady background load of about 80% on one core that I don't want to stop right now. mlucas was compiled with -O3 -mcpu=cortex-a53 -mfpu=neon-fp-armv8[/QUOTE] Thanks - your 1664K timing looks highly anomalous, could you redo that one and see if you get a significantly different result on attempt 2? [Details: 1664K uses a front-end-radix (52 or 208) based on a prime-radix-13 DFT macro, but the opcount for that is actually decently low - and the next-larger radix-13-based FFT length, 3328K, only shows a modestly worse runtime than the 2 lengths bracketing it. Also, 3328K should give a runtime >= 2x that of 1664K, but the ratio is quite a bit less than 2, again pointing to a re-do of the 1664K timing.] [b]Edit:[/b] Just as a point of comparison, my A53 (odroid c2) is DCing right now, 120 ms/iter @2560K using all 4 cores. So if at all possible, I urge you to switch to a 64-bit version of Raspbian and a SIMD build at your earliest convenience. |
Yep, this one seems more reasonable.
[CODE]1664 msec/iter = 541.97 ROE[avg,max] = [0.237590681, 0.312500000] radices = 208 16 16 16 0 0 0 0 0 0[/CODE] |
I've installed debian:arm64, recompiled mlucas and ran the tests again (which now went through without memory allocation problem). Still with 80% background load on one core.
In average the tests were sped up by a factor of 1.5 compared to raspbian 32-Bit.. Possibly some thermal throttling was involved since I don't have a heatsink yet. The clock was always at 1.2GHz when I checked, however with the temps around 75-77°C. I've read that thermal throttling starts at 80°C. 64-Bit mlucas.cfg: [url]https://pastebin.com/raw/H2H9dkWH[/url] |
[QUOTE=heliosh;475882]I've installed debian:arm64, recompiled mlucas and ran the tests again (which now went through without memory allocation problem). Still with 80% background load on one core.
In average the tests were sped up by a factor of 1.5 compared to raspbian 32-Bit.. Possibly some thermal throttling was involved since I don't have a heatsink yet. The clock was always at 1.2GHz when I checked, however with the temps around 75-77°C. I've read that thermal throttling starts at 80°C. 64-Bit mlucas.cfg: [url]https://pastebin.com/raw/H2H9dkWH[/url][/QUOTE] Thanks! I've not looked much at the Paspberry Pi series of micro-PCs ... what do you think acounts for the large speed difference between your Pi3 and my [url=http://www.hardkernel.com/main/products/prdt_info.php?g_code=G145457216438]Odroid C2[/url]. Clock speed difference is only 1.2 vs 1.5 GHz - does Pi3 use a substantially different implementation-in-Silicon of the A53 processor? Or perhaps more pertinently, compare your timings vs ET_'s [url=http://mersenneforum.org/showpost.php?p=472110&postcount=129]on his Pi3[/url] - his timings are only modestly slower - roughly 70% the throughput - than my C2. |
[QUOTE=ewmayer;475907]Thanks! I've not looked much at the Paspberry Pi series of micro-PCs ... what do you think acounts for the large speed difference between your Pi3 and my [url=http://www.hardkernel.com/main/products/prdt_info.php?g_code=G145457216438]Odroid C2[/url]. Clock speed difference is only 1.2 vs 1.5 GHz - does Pi3 use a substantially different implementation-in-Silicon of the A53 processor?
Or perhaps more pertinently, compare your timings vs ET_'s [url=http://mersenneforum.org/showpost.php?p=472110&postcount=129]on his Pi3[/url] - his timings are only modestly slower - roughly 70% the throughput - than my C2.[/QUOTE] AFAIK, the Odroid-C2 has 2GB of RAM instead of just 1 (but I have no idea about the controller), and a faster access to the "disk". |
I have rerun the 4096k-Test without any background load and it is still a lot slower than ET_'s:
[CODE]4096 msec/iter = 683.82 ROE[avg,max] = [0.254464286, 0.312500000] radices = 256 8 8 8 16 0 0 0 0 0[/CODE] I can only think of two things where the significant discrepancy is coming from: - my compiler-flags were horribly wrong (plus i use GCC-6, while ET_ has used GCC-5) - Thermal issue None of those sound like a plausible explanation to me, but who knows. [B]Edit:[/B] I was checking pi64-config for the CPU-frequency and it said: "Throttling occured [under-voltage throttled], your RPI doesn't perform well under load. This usually happens because of a suboptimal power supply cable." I'll check that. |
[QUOTE=heliosh;475997]I have rerun the 4096k-Test without any background load and it is still a lot slower than ET_'s:
[CODE]4096 msec/iter = 683.82 ROE[avg,max] = [0.254464286, 0.312500000] radices = 256 8 8 8 16 0 0 0 0 0[/CODE] I can only think of two things where the significant discrepancy is coming from: - my compiler-flags were horribly wrong (plus i use GCC-6, while ET_ has used GCC-5) - Thermal issue None of those sound like a plausible explanation to me, but who knows.[/QUOTE] Did you use the following command? [code] gcc -c -O3 -DUSE_ARM_V8_SIMD -DUSE_THREADS ../src/*.c >& build.log [/code] Did you start from a clean evironment, with no precompiled object files from previous building tests? Luigi |
I've had "-O3 -mcpu=cortex-a53"
I've compiled it now with -DUSE_ARM_V8_SIMD, but I instantly get a segfault when running the tests. And yes, the object files were deleted. |
If I were you I'd use the gentoo distro me and ET use. I initially tried a debian 64 distro and encountered very similar problems you have, down to the seg faults (check the other thread). The timings you posted are slower probably due to being scalar, but also much lower than my scalar timings so the undervolting is a big part of the issue.
Definitely replace the power supply, it could be a cheap cable or an insufficient transformer. If you're using an old phone charger that's probably the issue, I have one that's only rated for 5V @ 300mA, which fails at ~1000mA. Fine for light use, but under full load my pi 3 oscillates around 1000mA +-250mA. Any modern phone transformer is probably fine, most seem to be rated for 2000mA or 2400mA. Go for a heatsink on the SoC, I have an aluminium one which I think is just about keeping up but would get copper if doing it again. You don't need to put a heatsink on the io chip, but I would recommend one on the RAM chip on the underside, or at least drilling a hole in the case (if you use one), as otherwise it's getting nearly no airflow. |
I'm using a 1.5A psu that came with a Raspi2 I've had earlier. I've now ordered a 5.1V 2.5A "official Raspberry Pi 3 Power supply" and a copper heatsink.
I have USB devices attached which also draw a significant amount of power, so 1.5A might be a bit tight. Thermal imaging shows that the RAM isn't getting very hot, just the SoC is glowing. But it's getting offtopic here. I'll post an update if I get a significant improvement. |
[code]Victor@PCVICTOR MINGW64 ~
$ pacman -S mingw-w64-x86_64-gcc afhankelijkheden oplossen... zoeken naar conflicterende pakketten... Pakketten (2) mingw-w64-x86_64-gcc-libs-7.3.0-1 mingw-w64-x86_64-gcc-7.3.0-1 Totale Geïnstalleerde Grootte: 116,40 MiB Netto Upgrade Grootte: 14,16 MiB :: Doorgaan met de installatie? [J/n] j (2/2) sleutels in sleutelbos controleren [#####################] 100% (2/2) pakketintegriteit controleren [#####################] 100% (2/2) pakketbestanden laden [#####################] 100% (2/2) controleren van conflicterende bestanden [#####################] 100% (2/2) beschikbare schijfruimte controleren [#####################] 100% (1/2) upgraden mingw-w64-x86_64-gcc-libs [#####################] 100% (2/2) upgraden mingw-w64-x86_64-gcc [#####################] 100% Victor@PCVICTOR MINGW64 ~ $ which gcc /mingw64/bin/gcc Victor@PCVICTOR MINGW64 ~ $ gcc -v Using built-in specs. COLLECT_GCC=C:\msys64\mingw64\bin\gcc.exe COLLECT_LTO_WRAPPER=C:/msys64/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/7.3.0/lto-wrapper.exe Target: x86_64-w64-mingw32 Configured with: ../gcc-7.3.0/configure --prefix=/mingw64 --with-local-prefix=/mingw64/local --build=x86_64-w64-mingw32 --host=x86_64-w64-mingw32 --target=x86_64-w64-mingw32 --with-native-system-header-dir=/mingw64/x86_64-w64-mingw32/include --libexecdir=/mingw64/lib --enable-bootstrap --with-arch=x86-64 --with-tune=generic --enable-languages=c,lto,c++,objc,obj-c++,fortran,ada --enable-shared --enable-static --enable-libatomic --enable-threads=posix --enable-graphite --enable-fully-dynamic-string --enable-libstdcxx-time=yes --enable-libstdcxx-filesystem-ts=yes --disable-libstdcxx-pch --disable-libstdcxx-debug --disable-isl-version-check --enable-lto --enable-libgomp --disable-multilib --enable-checking=release --disable-rpath --disable-win32-registry --disable-nls --disable-werror --disable-symvers --with-libiconv --with-system-zlib --with-gmp=/mingw64 --with-mpfr=/mingw64 --with-mpc=/mingw64 --with-isl=/mingw64 --with-pkgversion='Rev1, Built by MSYS2 project' --with-bugurl=https://sourceforge.net/projects/msys2 --with-gnu-as --with-gnu-ld Thread model: posix gcc version 7.3.0 (Rev1, Built by MSYS2 project) Victor@PCVICTOR MINGW64 ~ $ cd .. Victor@PCVICTOR MINGW64 /home $ cd mlucas_v17.1-20180123/ Victor@PCVICTOR MINGW64 /home/mlucas_v17.1-20180123 $ cd AVX Victor@PCVICTOR MINGW64 /home/mlucas_v17.1-20180123/AVX $ gcc -c -O3 -DUSE_AVX *.c>& build.log Victor@PCVICTOR MINGW64 /home/mlucas_v17.1-20180123/AVX $ grep -i error build.log Victor@PCVICTOR MINGW64 /home/mlucas_v17.1-20180123/AVX $ gcc -o Mlucas *.o -lm -lpthread -lrt C:/msys64/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/7.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe: cannot find -lrt collect2.exe: error: ld returned 1 exit status Victor@PCVICTOR MINGW64 /home/mlucas_v17.1-20180123/AVX $ gcc -o Mlucas *.o -lm -lpthread -lrt Victor@PCVICTOR MINGW64 /home/mlucas_v17.1-20180123/AVX $ cd .. Victor@PCVICTOR MINGW64 /home/mlucas_v17.1-20180123 $ cd SSE2/ Victor@PCVICTOR MINGW64 /home/mlucas_v17.1-20180123/SSE2 $ gcc -c -O3 -DUSE_SSE2 *.c>& build.log Victor@PCVICTOR MINGW64 /home/mlucas_v17.1-20180123/SSE2 $ grep -i error build.log Victor@PCVICTOR MINGW64 /home/mlucas_v17.1-20180123/SSE2 $ gcc -o Mlucas *.o -lm -lpthread -lrt[/code]Compiled without errors (only lots of warnings) with your excelent guide ([URL]http://www.mersenneforum.org/mayer/README.html#windows[/URL]) MSYS2 + MINGW64 [COLOR=#000080][COLOR=Black]GCC version 7.3.0 (Rev1, Built by MSYS2 project) Win7 64bit Intel Core i5 2500k @4.0GHz Somehow the librt wasn't included in the folder gcc was looking for it, but that was a simple copy paste fix by coping it from a (seperate) mingw64-6.3.0 installation SSE2 (1 core) [code] 17.1 128 msec/iter = 1.80 ROE[avg,max] = [0.243858937, 0.312500000] radices = 16 16 16 16 0 0 0 0 0 0 160 msec/iter = 2.30 ROE[avg,max] = [0.275809152, 0.312500000] radices = 20 16 16 16 0 0 0 0 0 0 192 msec/iter = 2.70 ROE[avg,max] = [0.255859375, 0.304687500] radices = 24 16 16 16 0 0 0 0 0 0 208 msec/iter = 3.20 ROE[avg,max] = [0.287562779, 0.343750000] radices = 208 16 32 0 0 0 0 0 0 0 224 msec/iter = 3.50 ROE[avg,max] = [0.302427455, 0.375000000] radices = 28 16 16 16 0 0 0 0 0 0 240 msec/iter = 3.90 ROE[avg,max] = [0.259737723, 0.312500000] radices = 60 8 16 16 0 0 0 0 0 0 256 msec/iter = 3.70 ROE[avg,max] = [0.303571429, 0.375000000] radices = 32 16 16 16 0 0 0 0 0 0 288 msec/iter = 4.60 ROE[avg,max] = [0.246065848, 0.312500000] radices = 144 32 32 0 0 0 0 0 0 0 320 msec/iter = 4.80 ROE[avg,max] = [0.275948661, 0.375000000] radices = 40 16 16 16 0 0 0 0 0 0 352 msec/iter = 5.70 ROE[avg,max] = [0.292622811, 0.375000000] radices = 44 16 16 16 0 0 0 0 0 0 384 msec/iter = 5.70 ROE[avg,max] = [0.260909598, 0.312500000] radices = 24 16 16 32 0 0 0 0 0 0 416 msec/iter = 6.70 ROE[avg,max] = [0.264285714, 0.296875000] radices = 52 16 16 16 0 0 0 0 0 0 448 msec/iter = 6.90 ROE[avg,max] = [0.290206473, 0.343750000] radices = 28 16 16 32 0 0 0 0 0 0 480 msec/iter = 7.60 ROE[avg,max] = [0.280245536, 0.375000000] radices = 60 16 16 16 0 0 0 0 0 0 512 msec/iter = 7.60 ROE[avg,max] = [0.248214286, 0.312500000] radices = 16 16 32 32 0 0 0 0 0 0 576 msec/iter = 9.10 ROE[avg,max] = [0.263337054, 0.375000000] radices = 36 16 16 32 0 0 0 0 0 0 640 msec/iter = 9.70 ROE[avg,max] = [0.261049107, 0.312500000] radices = 20 16 32 32 0 0 0 0 0 0 704 msec/iter = 11.50 ROE[avg,max] = [0.299386161, 0.359375000] radices = 44 16 16 32 0 0 0 0 0 0 768 msec/iter = 11.70 ROE[avg,max] = [0.285895647, 0.375000000] radices = 48 16 16 32 0 0 0 0 0 0 832 msec/iter = 13.40 ROE[avg,max] = [0.267006138, 0.328125000] radices = 52 16 16 32 0 0 0 0 0 0 896 msec/iter = 14.20 ROE[avg,max] = [0.291106306, 0.343750000] radices = 56 16 16 32 0 0 0 0 0 0 960 msec/iter = 15.40 ROE[avg,max] = [0.285044643, 0.375000000] radices = 60 16 16 32 0 0 0 0 0 0 1024 msec/iter = 15.80 ROE[avg,max] = [0.271428571, 0.375000000] radices = 32 16 32 32 0 0 0 0 0 0 1152 msec/iter = 18.80 ROE[avg,max] = [0.259458705, 0.312500000] radices = 36 16 32 32 0 0 0 0 0 0 1280 msec/iter = 20.50 ROE[avg,max] = [0.265569196, 0.328125000] radices = 40 16 32 32 0 0 0 0 0 0 1408 msec/iter = 25.10 ROE[avg,max] = [0.302511161, 0.375000000] radices = 44 32 32 16 0 0 0 0 0 0 1536 msec/iter = 24.40 ROE[avg,max] = [0.287500000, 0.343750000] radices = 48 16 32 32 0 0 0 0 0 0 1664 msec/iter = 28.70 ROE[avg,max] = [0.254003906, 0.281250000] radices = 52 16 32 32 0 0 0 0 0 0 1792 msec/iter = 29.80 ROE[avg,max] = [0.288364955, 0.343750000] radices = 56 16 32 32 0 0 0 0 0 0 1920 msec/iter = 33.20 ROE[avg,max] = [0.258398438, 0.312500000] radices = 60 16 32 32 0 0 0 0 0 0 2048 msec/iter = 34.00 ROE[avg,max] = [0.246616908, 0.312500000] radices = 32 32 32 32 0 0 0 0 0 0 2304 msec/iter = 40.00 ROE[avg,max] = [0.298158482, 0.375000000] radices = 144 16 16 32 0 0 0 0 0 0 2560 msec/iter = 43.51 ROE[avg,max] = [0.264843750, 0.312500000] radices = 40 32 32 32 0 0 0 0 0 0 2816 msec/iter = 51.00 ROE[avg,max] = [0.317815290, 0.375000000] radices = 176 16 16 32 0 0 0 0 0 0 3072 msec/iter = 51.90 ROE[avg,max] = [0.243532017, 0.296875000] radices = 48 32 32 32 0 0 0 0 0 0 3328 msec/iter = 60.11 ROE[avg,max] = [0.252553013, 0.312500000] radices = 52 32 32 32 0 0 0 0 0 0 3584 msec/iter = 63.40 ROE[avg,max] = [0.292243304, 0.375000000] radices = 56 32 32 32 0 0 0 0 0 0 3840 msec/iter = 69.50 ROE[avg,max] = [0.267271205, 0.375000000] radices = 240 16 16 32 0 0 0 0 0 0 4096 msec/iter = 69.00 ROE[avg,max] = [0.244712612, 0.281250000] radices = 16 16 16 16 32 0 0 0 0 0 4608 msec/iter = 81.20 ROE[avg,max] = [0.268268694, 0.343750000] radices = 288 16 16 32 0 0 0 0 0 0 5120 msec/iter = 90.60 ROE[avg,max] = [0.344419643, 0.375000000] radices = 20 16 16 16 32 0 0 0 0 0 5632 msec/iter = 105.40 ROE[avg,max] = [0.324665179, 0.375000000] radices = 176 16 32 32 0 0 0 0 0 0 6144 msec/iter = 106.60 ROE[avg,max] = [0.252887835, 0.289062500] radices = 24 16 16 16 32 0 0 0 0 0 6656 msec/iter = 122.59 ROE[avg,max] = [0.281138393, 0.312500000] radices = 208 16 32 32 0 0 0 0 0 0 7168 msec/iter = 132.00 ROE[avg,max] = [0.289226423, 0.343750000] radices = 28 16 16 16 32 0 0 0 0 0 7680 msec/iter = 145.20 ROE[avg,max] = [0.260156250, 0.312500000] radices = 240 16 32 32 0 0 0 0 0 0 8192 msec/iter = 146.80 ROE[avg,max] = [0.244656808, 0.281250000] radices = 16 16 16 32 32 0 0 0 0 0 9216 msec/iter = 170.30 ROE[avg,max] = [0.254994420, 0.316406250] radices = 36 16 16 16 32 0 0 0 0 0 10240 msec/iter = 185.20 ROE[avg,max] = [0.284905134, 0.343750000] radices = 40 16 16 16 32 0 0 0 0 0 11264 msec/iter = 215.90 ROE[avg,max] = [0.284776088, 0.328125000] radices = 44 16 16 16 32 0 0 0 0 0 12288 msec/iter = 222.20 ROE[avg,max] = [0.247877720, 0.312500000] radices = 48 16 16 16 32 0 0 0 0 0 13312 msec/iter = 251.81 ROE[avg,max] = [0.304910714, 0.343750000] radices = 52 16 16 16 32 0 0 0 0 0 14336 msec/iter = 279.70 ROE[avg,max] = [0.275892857, 0.312500000] radices = 28 16 16 32 32 0 0 0 0 0 15360 msec/iter = 295.60 ROE[avg,max] = [0.288839286, 0.343750000] radices = 60 16 16 16 32 0 0 0 0 0 16384 msec/iter = 306.90 ROE[avg,max] = [0.253431920, 0.312500000] radices = 32 16 16 32 32 0 0 0 0 0 18432 msec/iter = 365.40 ROE[avg,max] = [0.261997768, 0.296875000] radices = 36 16 16 32 32 0 0 0 0 0 20480 msec/iter = 397.50 ROE[avg,max] = [0.269196429, 0.312500000] radices = 40 16 16 32 32 0 0 0 0 0 22528 msec/iter = 465.71 ROE[avg,max] = [0.284232003, 0.343750000] radices = 44 16 16 32 32 0 0 0 0 0 24576 msec/iter = 479.50 ROE[avg,max] = [0.293080357, 0.343750000] radices = 48 16 16 32 32 0 0 0 0 0 26624 msec/iter = 547.20 ROE[avg,max] = [0.267187500, 0.312500000] radices = 52 16 16 32 32 0 0 0 0 0 28672 msec/iter = 576.00 ROE[avg,max] = [0.309486607, 0.343750000] radices = 56 16 16 32 32 0 0 0 0 0 30720 msec/iter = 633.21 ROE[avg,max] = [0.264899554, 0.312500000] radices = 60 16 16 32 32 0 0 0 0 0 32768 msec/iter = 666.00 ROE[avg,max] = [0.255109515, 0.312500000] radices = 32 32 32 16 32 0 0 0 0 0 36864 msec/iter = 757.10 ROE[avg,max] = [0.273688616, 0.312500000] radices = 144 16 16 16 32 0 0 0 0 0 40960 msec/iter = 857.64 ROE[avg,max] = [0.262755476, 0.296875000] radices = 160 16 16 16 32 0 0 0 0 0 45056 msec/iter = 960.50 ROE[avg,max] = [0.295835658, 0.343750000] radices = 176 16 16 16 32 0 0 0 0 0 49152 msec/iter = 1057.21 ROE[avg,max] = [0.280859375, 0.312500000] radices = 48 16 32 32 32 0 0 0 0 0 53248 msec/iter = 1205.40 ROE[avg,max] = [0.258314732, 0.312500000] radices = 52 16 32 32 32 0 0 0 0 0 57344 msec/iter = 1231.01 ROE[avg,max] = [0.282653373, 0.312500000] radices = 224 16 16 16 32 0 0 0 0 0 61440 msec/iter = 1322.60 ROE[avg,max] = [0.264676339, 0.343750000] radices = 240 16 16 16 32 0 0 0 0 0 [/code]AVX (1 core) [code] 17.1 128 msec/iter = 1.40 ROE[avg,max] = [0.278125000, 0.375000000] radices = 16 16 16 16 0 0 0 0 0 0 144 msec/iter = 1.60 ROE[avg,max] = [0.257686942, 0.328125000] radices = 144 16 32 0 0 0 0 0 0 0 160 msec/iter = 1.80 ROE[avg,max] = [0.283258929, 0.343750000] radices = 160 32 16 0 0 0 0 0 0 0 192 msec/iter = 2.10 ROE[avg,max] = [0.276339286, 0.343750000] radices = 48 8 16 16 0 0 0 0 0 0 224 msec/iter = 2.50 ROE[avg,max] = [0.285142299, 0.343750000] radices = 28 16 16 16 0 0 0 0 0 0 240 msec/iter = 2.80 ROE[avg,max] = [0.259054129, 0.312500000] radices = 240 16 32 0 0 0 0 0 0 0 256 msec/iter = 2.80 ROE[avg,max] = [0.247427150, 0.281250000] radices = 32 16 16 16 0 0 0 0 0 0 288 msec/iter = 3.30 ROE[avg,max] = [0.294754464, 0.375000000] radices = 36 16 16 16 0 0 0 0 0 0 320 msec/iter = 3.50 ROE[avg,max] = [0.256869071, 0.312500000] radices = 20 16 16 32 0 0 0 0 0 0 384 msec/iter = 4.20 ROE[avg,max] = [0.259472656, 0.312500000] radices = 24 16 16 32 0 0 0 0 0 0 416 msec/iter = 5.00 ROE[avg,max] = [0.258949498, 0.312500000] radices = 208 32 32 0 0 0 0 0 0 0 448 msec/iter = 5.00 ROE[avg,max] = [0.279471261, 0.328125000] radices = 28 16 16 32 0 0 0 0 0 0 480 msec/iter = 5.70 ROE[avg,max] = [0.268457031, 0.312500000] radices = 60 16 16 16 0 0 0 0 0 0 512 msec/iter = 5.80 ROE[avg,max] = [0.243409947, 0.312500000] radices = 32 16 16 32 0 0 0 0 0 0 576 msec/iter = 6.70 ROE[avg,max] = [0.302343750, 0.375000000] radices = 36 16 16 32 0 0 0 0 0 0 640 msec/iter = 7.20 ROE[avg,max] = [0.281138393, 0.375000000] radices = 40 16 16 32 0 0 0 0 0 0 768 msec/iter = 8.70 ROE[avg,max] = [0.252845982, 0.296875000] radices = 48 16 16 32 0 0 0 0 0 0 832 msec/iter = 10.20 ROE[avg,max] = [0.299107143, 0.375000000] radices = 52 16 16 32 0 0 0 0 0 0 896 msec/iter = 10.70 ROE[avg,max] = [0.280482701, 0.375000000] radices = 28 16 32 32 0 0 0 0 0 0 960 msec/iter = 11.60 ROE[avg,max] = [0.266210938, 0.312500000] radices = 60 16 16 32 0 0 0 0 0 0 1024 msec/iter = 12.13 ROE[avg,max] = [0.237806920, 0.312500000] radices = 32 16 32 32 0 0 0 0 0 0 1152 msec/iter = 14.00 ROE[avg,max] = [0.277790179, 0.312500000] radices = 36 16 32 32 0 0 0 0 0 0 1280 msec/iter = 15.41 ROE[avg,max] = [0.286830357, 0.343750000] radices = 40 16 32 32 0 0 0 0 0 0 1408 msec/iter = 18.15 ROE[avg,max] = [0.308140346, 0.390625000] radices = 176 16 16 16 0 0 0 0 0 0 1536 msec/iter = 18.38 ROE[avg,max] = [0.254910714, 0.343750000] radices = 48 16 32 32 0 0 0 0 0 0 1664 msec/iter = 21.22 ROE[avg,max] = [0.282310268, 0.343750000] radices = 208 16 16 16 0 0 0 0 0 0 1792 msec/iter = 22.84 ROE[avg,max] = [0.271777344, 0.312500000] radices = 28 32 32 32 0 0 0 0 0 0 1920 msec/iter = 24.30 ROE[avg,max] = [0.296428571, 0.375000000] radices = 60 16 32 32 0 0 0 0 0 0 2048 msec/iter = 25.74 ROE[avg,max] = [0.247865513, 0.312500000] radices = 32 32 32 32 0 0 0 0 0 0 2304 msec/iter = 28.84 ROE[avg,max] = [0.275669643, 0.312500000] radices = 144 16 16 32 0 0 0 0 0 0 2560 msec/iter = 32.86 ROE[avg,max] = [0.300000000, 0.375000000] radices = 40 32 32 32 0 0 0 0 0 0 2816 msec/iter = 36.79 ROE[avg,max] = [0.291238839, 0.343750000] radices = 176 16 16 32 0 0 0 0 0 0 3072 msec/iter = 38.61 ROE[avg,max] = [0.245962960, 0.281250000] radices = 48 32 32 32 0 0 0 0 0 0 3328 msec/iter = 43.28 ROE[avg,max] = [0.284221540, 0.343750000] radices = 208 16 16 32 0 0 0 0 0 0 3584 msec/iter = 46.51 ROE[avg,max] = [0.290764509, 0.343750000] radices = 224 16 16 32 0 0 0 0 0 0 3840 msec/iter = 49.49 ROE[avg,max] = [0.258475167, 0.296875000] radices = 240 16 16 32 0 0 0 0 0 0 4096 msec/iter = 53.52 ROE[avg,max] = [0.284402902, 0.312500000] radices = 16 16 16 16 32 0 0 0 0 0 4608 msec/iter = 59.81 ROE[avg,max] = [0.249079241, 0.281250000] radices = 144 16 32 32 0 0 0 0 0 0 5120 msec/iter = 66.84 ROE[avg,max] = [0.257080078, 0.312500000] radices = 20 16 16 16 32 0 0 0 0 0 5632 msec/iter = 76.25 ROE[avg,max] = [0.282209124, 0.343750000] radices = 176 16 32 32 0 0 0 0 0 0 6144 msec/iter = 79.79 ROE[avg,max] = [0.277678571, 0.312500000] radices = 24 16 16 16 32 0 0 0 0 0 6656 msec/iter = 88.71 ROE[avg,max] = [0.264644950, 0.312500000] radices = 208 16 32 32 0 0 0 0 0 0 7168 msec/iter = 96.36 ROE[avg,max] = [0.294782366, 0.375000000] radices = 224 16 32 32 0 0 0 0 0 0 7680 msec/iter = 103.47 ROE[avg,max] = [0.267142160, 0.312500000] radices = 240 16 32 32 0 0 0 0 0 0 8192 msec/iter = 112.06 ROE[avg,max] = [0.257198661, 0.312500000] radices = 256 16 32 32 0 0 0 0 0 0 9216 msec/iter = 124.63 ROE[avg,max] = [0.293973214, 0.343750000] radices = 36 16 16 16 32 0 0 0 0 0 10240 msec/iter = 136.20 ROE[avg,max] = [0.280468750, 0.375000000] radices = 40 16 16 16 32 0 0 0 0 0 11264 msec/iter = 160.42 ROE[avg,max] = [0.283238002, 0.328125000] radices = 44 16 16 16 32 0 0 0 0 0 12288 msec/iter = 162.44 ROE[avg,max] = [0.261104911, 0.312500000] radices = 48 16 16 16 32 0 0 0 0 0 13312 msec/iter = 188.98 ROE[avg,max] = [0.289564732, 0.343750000] radices = 208 32 32 32 0 0 0 0 0 0 14336 msec/iter = 197.28 ROE[avg,max] = [0.287133789, 0.343750000] radices = 56 16 16 16 32 0 0 0 0 0 15360 msec/iter = 218.04 ROE[avg,max] = [0.262025670, 0.296875000] radices = 60 16 16 16 32 0 0 0 0 0 16384 msec/iter = 232.43 ROE[avg,max] = [0.239365932, 0.281250000] radices = 32 16 16 32 32 0 0 0 0 0 18432 msec/iter = 269.14 ROE[avg,max] = [0.246674456, 0.281250000] radices = 288 32 32 32 0 0 0 0 0 0 20480 msec/iter = 297.32 ROE[avg,max] = [0.325000000, 0.375000000] radices = 40 16 16 32 32 0 0 0 0 0 22528 msec/iter = 345.23 ROE[avg,max] = [0.304185268, 0.367187500] radices = 176 16 16 16 16 0 0 0 0 0 24576 msec/iter = 352.60 ROE[avg,max] = [0.257749721, 0.312500000] radices = 48 16 16 32 32 0 0 0 0 0 26624 msec/iter = 414.03 ROE[avg,max] = [0.284179688, 0.343750000] radices = 52 16 16 32 32 0 0 0 0 0 28672 msec/iter = 420.70 ROE[avg,max] = [0.302594866, 0.343750000] radices = 56 16 16 32 32 0 0 0 0 0 30720 msec/iter = 466.00 ROE[avg,max] = [0.291629464, 0.375000000] radices = 240 16 16 16 16 0 0 0 0 0 32768 msec/iter = 498.50 ROE[avg,max] = [0.267689732, 0.343750000] radices = 128 16 16 16 32 0 0 0 0 0 36864 msec/iter = 535.63 ROE[avg,max] = [0.254352679, 0.312500000] radices = 144 16 16 16 32 0 0 0 0 0 40960 msec/iter = 594.90 ROE[avg,max] = [0.297098214, 0.343750000] radices = 160 16 16 16 32 0 0 0 0 0 45056 msec/iter = 674.78 ROE[avg,max] = [0.299944196, 0.343750000] radices = 176 16 16 16 32 0 0 0 0 0 49152 msec/iter = 784.09 ROE[avg,max] = [0.254603795, 0.281250000] radices = 192 16 16 16 32 0 0 0 0 0 53248 msec/iter = 904.71 ROE[avg,max] = [0.271316964, 0.312500000] radices = 52 16 32 32 32 0 0 0 0 0 57344 msec/iter = 857.03 ROE[avg,max] = [0.319642857, 0.375000000] radices = 224 16 16 16 32 0 0 0 0 0 61440 msec/iter = 897.44 ROE[avg,max] = [0.276255580, 0.312500000] radices = 240 16 16 16 32 0 0 0 0 0 [/code]Strangely the scalar build (without AVX/SSE2) builds without errors (lots of warning though), but at the selftest the round-off-errors at every single iteration are HUGE, like in the millions (with all the tests by -s m). [/COLOR][/COLOR] |
Thanks, Victor - you clearly spent a lot of time running the full self-test ranges, is this an otherwise-idle AVX system of yours?
The huge-roundoff-errors-in-scalar-build sound like a bad nearest-int emulation ... if you recompile just a single small file (say br.c) and add -DVERBOSE_HEADERS to the compile command, that will tell you which version of gcc's rint() is being used, e.g. on my Core macbook: [code]In file included from ../br.c:23: In file included from ../Mlucas.h:29: In file included from ../align.h:29: [b]../types.h:225:3: warning: #warning Using lrint() for DNINT [-W#warnings] #warning Using lrint() for DNINT[/b][/code] If you're interested, I can guide you in adding some simple printf's to the scalar-double-mode carry macro which could help pinpoint the issue. Do all radix combos at all FFT lengths in your build suffer the huge ROEs, or do some run OK? |
I haven't tried all the FFT sizes with the scalar build, but all the ones I tried all failed. It is when compiling with MINGW64 for Windows, which could introduce some strange behaviour. I wouldn't put it high on the priority list, as the AVX and SSE2 build successfully and even an old Pentium4 has SSE2.
Anyway I ran the verbose header function on br.c: [code] Victor@PCVICTOR MINGW64 /home/mlucas_v17.1-20180123/SCALAR $ gcc -c -DVERBOSE_HEADERS br.c >& build.log[/code]This is the resulting build.log [code]In file included from types.h:30:0, from align.h:29, from Mlucas.h:29, from br.c:23: platform.h:1518:3: warning: #warning platform.h: Defining both X64_ASM and X32_ASM [-Wcpp] #warning platform.h: Defining both X64_ASM and X32_ASM ^~~~~~~ In file included from align.h:29:0, from Mlucas.h:29, from br.c:23: types.h:225:3: warning: #warning Using lrint() for DNINT [-Wcpp] #warning Using lrint() for DNINT ^~~~~~~ In file included from imul_macro.h:29:0, from mi64.h:30, from Mdata.h:31, from carry.h:29, from Mlucas.h:30, from br.c:23: imul_macro0.h:309:3: warning: #warning X86_64-type CPU detected [-Wcpp] #warning X86_64-type CPU detected ^~~~~~~ [/code] |
@Victor:
Thanks - same preprocessor diagnostics and lrint() inlining as my Core2 64-bit OS X build, so you're probably right re. the strange behavior of the Mingw64-based scalar-code compilation. As long as the SIMD builds work, a low-priority issue, to be sure. |
Hello! Tried compile mlucas on Intel(R) Xeon(R) Gold 5120 CPU @ 2.20GHz with AVX512 but without success. For now i haven't access to this system however i would like to post my experience:
[CODE]processor : 55 vendor_id : GenuineIntel cpu family : 6 model : 85 model name : Intel(R) Xeon(R) Gold 5120 CPU @ 2.20GHz stepping : 4 microcode : 0x200003a cpu MHz : 2200.000 cache size : 19712 KB physical id : 1 siblings : 28 core id : 11 cpu cores : 14 apicid : 55 initial apicid : 55 fpu : yes fpu_exception : yes cpuid level : 22 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 invpcid_single intel_pt spec_ctrl ibpb_support tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts bogomips : 4404.53 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management: [root@lorenzo3 src]# gcc --version gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-16) Copyright (C) 2015 Free Software Foundation, Inc. Это свободно распространяемое программное обеспечение. Условия копирования приведены в исходных текстах. Без гарантии каких-либо качеств, включая коммерческую ценность и применимость для каких-либо целей. [root@lorenzo3 src]# gcc -c -O3 [B]-DUSE_AVX512 -march=skylake-avx512[/B] -DUSE_THREADS ../src/*.c >& build.log [root@lorenzo3 src]# cat build.log ../src/br.c:1:0: ошибка: [B]bad value (skylake-avx512) for -march= switch[/B] /******************************************************************************* ^ ../src/dft_macro.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/factor.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/fermat_mod_square.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/fgt_m61.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/gcd_lehmer.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/get_cpuid.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/get_fft_radices.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/get_fp_rnd_const.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/get_preferred_fft_radix.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/getRealTime.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch // EWM: June 2014 - Code from http://nadeausoftware.com/articles/2012/04/c_c_tip_how_measure_elapsed_real_time_benchmarking ^ ../src/imul_macro.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/mers_mod_square.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/mi64.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/Mlucas.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/pairFFT_mul.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/qfloat.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/radix1008_ditN_cy_dif1.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/radix1024_ditN_cy_dif1.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/radix104_ditN_cy_dif1.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/radix10_ditN_cy_dif1.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/radix112_ditN_cy_dif1.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/radix11_ditN_cy_dif1.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/radix120_ditN_cy_dif1.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/radix128_ditN_cy_dif1.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/radix12_ditN_cy_dif1.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/radix13_ditN_cy_dif1.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/radix144_ditN_cy_dif1.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/radix14_ditN_cy_dif1.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/radix15_ditN_cy_dif1.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/radix160_ditN_cy_dif1.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/radix16_dif_dit_pass.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/radix16_ditN_cy_dif1.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/radix16_dyadic_square.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/radix16_pairFFT_mul.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/radix16_wrapper_ini.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/radix16_wrapper_square.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/radix176_ditN_cy_dif1.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/radix18_ditN_cy_dif1.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/radix192_ditN_cy_dif1.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/radix208_ditN_cy_dif1.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/radix20_ditN_cy_dif1.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/radix224_ditN_cy_dif1.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/radix22_ditN_cy_dif1.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/radix240_ditN_cy_dif1.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/radix24_ditN_cy_dif1.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/radix256_ditN_cy_dif1.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/radix26_ditN_cy_dif1.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/radix288_ditN_cy_dif1.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/radix28_ditN_cy_dif1.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/radix30_ditN_cy_dif1.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/radix31_ditN_cy_dif1.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/radix32_dif_dit_pass.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/radix32_ditN_cy_dif1.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/radix32_dyadic_square.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/radix32_wrapper_ini.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/radix32_wrapper_square.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/radix36_ditN_cy_dif1.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/radix4032_ditN_cy_dif1.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/radix40_ditN_cy_dif1.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/radix44_ditN_cy_dif1.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/radix48_ditN_cy_dif1.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/radix512_ditN_cy_dif1.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/radix52_ditN_cy_dif1.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/radix56_ditN_cy_dif1.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/radix5_ditN_cy_dif1.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/radix60_ditN_cy_dif1.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/radix63_ditN_cy_dif1.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/radix64_ditN_cy_dif1.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/radix6_ditN_cy_dif1.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/radix72_ditN_cy_dif1.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/radix768_ditN_cy_dif1.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/radix7_ditN_cy_dif1.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/radix80_ditN_cy_dif1.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/radix88_ditN_cy_dif1.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/radix8_dif_dit_pass.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/radix8_ditN_cy_dif1.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/radix960_ditN_cy_dif1.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/radix96_ditN_cy_dif1.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/radix992_ditN_cy_dif1.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/radix9_ditN_cy_dif1.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/rng_isaac.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /* ^ ../src/test_fft_radix.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/threadpool.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch // EWM: This threadpool file has "more history" than the other Mlucas sources, ^ ../src/twopmodq100.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/twopmodq128_96.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/twopmodq128.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/twopmodq160.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/twopmodq192.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/twopmodq256.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/twopmodq64_test.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/twopmodq80.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/twopmodq96.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/twopmodq.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/types.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ ../src/util.c:1:0: ошибка: bad value (skylake-avx512) for -march= switch /******************************************************************************* ^ [/CODE] |
Is that version of GCC too old to support avx512 on skylake xeons?
|
I'm pretty sure your gcc doesn't support -march=skylake-avx512 as I get a similar output with -march=randomjunk
You probably need to update to the latest gcc, I don't know if they backport tuning for new architectures. |
Skylake AVX512 support was introduced in GCC 6 according to the release note:
[url]https://gcc.gnu.org/gcc-6/changes.html[/url] |
It may not help with pre-v5 gcc, but might be worth trying -march=knl instead ... the code performance is dominated by my custom asm macros which are all restricted to the avx512 Foundation Instructions subset supported by the KNL, so whether the compiler thinks it's compiling for skylake-avx512 or KNL shouldn't matter, as long as the arch flag enables avx512 instruction generation in the back end.
FYI, the version of gcc on the KNL I used for avx-512 code development is 5.4, i.e. pre-v6. |
It's no longer the newest version. Please change thread title.
|
[QUOTE=kriesel;528037]It's no longer the newest version. Please change thread title.[/QUOTE]
Alas, I don't have thread-edit permissions, not even in this particular subforum. I've e-mailed Mike to ask if that is intended. |
[QUOTE=ewmayer;528102]Alas, I don't have thread-edit permissions, not even in this particular subforum. I've e-mailed Mike to ask if that is intended.[/QUOTE]It now appears the V17.1 thread title and V18 sticky have been handled. (Now if Mike would just authorize 30-hour-day lengths, V19 could soon be ready, and repeat.)
|
| All times are UTC. The time now is 04:26. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.