mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software > Mlucas

Reply
 
Thread Tools
Old 2017-05-13, 01:12   #56
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

22×2,939 Posts
Default

David Stanfill (airsquirrels) kindly gave me a user account on his Ryzen in order to do Mlucas builds/tests using my current code snapshot which I am preparing for release. Here are the first 2 sets of timing results, unthreaded builds (what I call '0-thread' to differentiate from multithread-capable builds run with just 1 thread).

Note the radices in the rightmost columns are *complex* FFT radices, thus their product in each case equals one-half the real-vector length (in Kdoubles) in the leftmost column. There was no AMD-specific optimization involved - this is all code developed and tuned for Intel CPUs.

[Edit: See ||-build notes below about the 100 iters used for these timings likely being insufficient]

Code:
Ryzen, AVX/0-thread:
  1024  msec/iter =  16.143230  ROE[avg,max] = [0.237048340, 0.269531250]  radices =  32 16 32 32
  1152  msec/iter =  18.393270  ROE[avg,max] = [0.273577009, 0.312500000]  radices =  36 16 32 32
  1280  msec/iter =  20.434270  ROE[avg,max] = [0.278939383, 0.343750000]  radices =  40 16 32 32
  1408  msec/iter =  23.969040  ROE[avg,max] = [0.311523438, 0.406250000]  radices =  44 16 32 32
  1536  msec/iter =  23.938600  ROE[avg,max] = [0.251722935, 0.281250000]  radices =  48 16 32 32
  1664  msec/iter =  28.809070  ROE[avg,max] = [0.308928571, 0.375000000]  radices =  52 16 32 32
  1792  msec/iter =  30.127000  ROE[avg,max] = [0.351534598, 0.437500000]  radices =  56 16 32 32
  1920  msec/iter =  33.393400  ROE[avg,max] = [0.297321429, 0.406250000]  radices =  60 16 32 32
  2048  msec/iter =  34.487110  ROE[avg,max] = [0.240848214, 0.281250000]  radices =  64 16 32 32
  2304  msec/iter =  40.226720  ROE[avg,max] = [0.249302455, 0.281250000]  radices =  36 32 32 32
  2560  msec/iter =  44.287860  ROE[avg,max] = [0.256849888, 0.312500000]  radices = 160 16 16 32
  2816  msec/iter =  50.539970  ROE[avg,max] = [0.281724330, 0.328125000]  radices = 176 16 16 32
  3072  msec/iter =  52.569620  ROE[avg,max] = [0.245962960, 0.281250000]  radices =  48 32 32 32
  3328  msec/iter =  60.861210  ROE[avg,max] = [0.316964286, 0.375000000]  radices =  52 32 32 32
  3584  msec/iter =  62.958160  ROE[avg,max] = [0.286432757, 0.343750000]  radices = 224 16 16 32
  3840  msec/iter =  69.900850  ROE[avg,max] = [0.253655134, 0.281250000]  radices = 240 16 16 32
  4096  msec/iter =  73.305030  ROE[avg,max] = [0.259765625, 0.312500000]  radices = 256 16 16 32
  4608  msec/iter =  82.375850  ROE[avg,max] = [0.279478237, 0.375000000]  radices = 288 16 16 32
  5120  msec/iter =  92.422200  ROE[avg,max] = [0.303348214, 0.375000000]  radices = 160 16 32 32
  5632  msec/iter = 103.692050  ROE[avg,max] = [0.287374442, 0.343750000]  radices = 176 16 32 32
  6144  msec/iter = 114.081960  ROE[avg,max] = [0.279017857, 0.312500000]  radices = 192 16 32 32
  6656  msec/iter = 141.714380  ROE[avg,max] = [0.347767857, 0.375000000]  radices =  52 16 16 16 16
  7168  msec/iter = 131.530090  ROE[avg,max] = [0.286830357, 0.328125000]  radices = 224 16 32 32
  7680  msec/iter = 140.589520  ROE[avg,max] = [0.265318080, 0.312500000]  radices = 240 16 32 32
Code:
Ryzen, AVX2/0-thread:
  1024  msec/iter =  14.473480  ROE[avg,max] = [0.249674770, 0.312500000]  radices =  32 16 32 32
  1152  msec/iter =  16.941660  ROE[avg,max] = [0.304101562, 0.375000000]  radices =  36 16 32 32
  1280  msec/iter =  18.400400  ROE[avg,max] = [0.285825893, 0.375000000]  radices =  40 16 32 32
  1408  msec/iter =  21.812400  ROE[avg,max] = [0.299107143, 0.375000000]  radices =  44 16 32 32
  1536  msec/iter =  22.641650  ROE[avg,max] = [0.264965820, 0.312500000]  radices =  48 16 32 32
  1664  msec/iter =  26.051310  ROE[avg,max] = [0.303417969, 0.375000000]  radices =  52 16 32 32
  1792  msec/iter =  27.311240  ROE[avg,max] = [0.305301339, 0.375000000]  radices =  56 16 32 32
  1920  msec/iter =  30.567500  ROE[avg,max] = [0.323883929, 0.437500000]  radices =  60 16 32 32
  2048  msec/iter =  31.450460  ROE[avg,max] = [0.258858817, 0.312500000]  radices =  64 16 32 32
  2304  msec/iter =  35.497940  ROE[avg,max] = [0.365848214, 0.437500000]  radices = 144 16 16 32
  2560  msec/iter =  39.911440  ROE[avg,max] = [0.294642857, 0.375000000]  radices =  40 32 32 32
  2816  msec/iter =  46.300510  ROE[avg,max] = [0.286802455, 0.343750000]  radices = 176 16 16 32
  3072  msec/iter =  48.691550  ROE[avg,max] = [0.235825893, 0.281250000]  radices =  48 32 32 32
  3328  msec/iter =  55.515420  ROE[avg,max] = [0.278913225, 0.343750000]  radices = 208 16 16 32
  3584  msec/iter =  55.566890  ROE[avg,max] = [0.286143276, 0.328125000]  radices = 224 16 16 32
  3840  msec/iter =  62.801760  ROE[avg,max] = [0.288204520, 0.347656250]  radices = 240 16 16 32
  4096  msec/iter =  64.375370  ROE[avg,max] = [0.295214844, 0.343750000]  radices = 256 16 16 32
  4608  msec/iter =  72.954530  ROE[avg,max] = [0.311607143, 0.375000000]  radices = 288 16 16 32
  5120  msec/iter =  82.275550  ROE[avg,max] = [0.306975446, 0.375000000]  radices = 160 16 32 32
  5632  msec/iter =  95.040700  ROE[avg,max] = [0.255600412, 0.281250000]  radices = 176 16 32 32
  6144  msec/iter = 103.228320  ROE[avg,max] = [0.273018973, 0.343750000]  radices = 192 16 32 32
  6656  msec/iter = 115.045360  ROE[avg,max] = [0.268750000, 0.312500000]  radices = 208 16 32 32
  7168  msec/iter = 114.919310  ROE[avg,max] = [0.273074777, 0.312500000]  radices = 224 16 32 32
  7680  msec/iter = 128.601060  ROE[avg,max] = [0.289223807, 0.343750000]  radices = 240 16 32 32

Last fiddled with by ewmayer on 2017-05-14 at 04:47
ewmayer is offline   Reply With Quote
Old 2017-05-13, 06:34   #57
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

22×2,939 Posts
Default

Here are benchmark timings for multithreaded builds of Mlucas on Ryzen. Some notes:

1. My above 'unthreaded' timings were for 100-iteration runs. It seems that was insufficient on Ryzen, because when I went to 1000-iter timings to allow for the timing decreases which accompany use of more than 1 thread, even the 1-thread timings drop significantly versus the 100-iteration ones. For example, the per-iteration time for the AVX build @7168K drops from the 131 msec in the unthreaded-build-100-iter table to just 91 msec in the 1-thread column of the threaded-build-1000-iter table which follows.

2. Again due to the deeper 1000-iter runs, the roundoff errors captured in the table are larger. It's clear that I also need to fiddle my timing-test code to omit results having ROEs appreciably > 0.4 from the best-radix-set entries that get printed to the mlucas.cfg file. 0.40625 is probably OK (though maybe not for 100-iter runs), but 0.4375 is dangerously high, and e.g. 0.46875 is "right out", as the Monty Pythons would say. [Cf. Holy Hand Grenade scene in MP & The Holy Grail.]

3. Mlucas allows non-power-of-2 threadcounts but greatly prefers the power-of-2 ones, so I only did the latter.

4. AMD apparently has a different core numbering scheme than Intel - when I ran the first 2-thread benchmarks using the '-nthread 2' option, which sets affinities to cores 0 and 1, the timings were slower than 1-thread. Using the new-in-the-coming-release -cpu option I forced affinities to cores 0 and 2 via '-cpu 0,2', and got the expected 2-thread speedup. For 4 and 8-threads I used '-cpu 0:7:2' [equivalent to '-cpu 0,2,4,6'] and '-cpu 0:15:2' [equivalent to '-cpu 0,2,4,6,8,10,12,14'], respectively.

5. The 8-thread timings, especially for the smaller FFT lengths, are likely pessimistic, since startup overhead is non-neglible for that many threads even using 1000 iterations.

Ryzen, AVX build, msec/iter vs FFT length (Kdouble) for various threadcounts:
Code:
FFTlen	1-thr	2-thr	4-thr	8-thr
  1024	11.67	 6.24	 3.77	 2.40	ROE[avg,max] = [0.242096600, 0.312500000]
  1152	13.47	 7.14	 4.13	 2.96	ROE[avg,max] = [0.275115778, 0.375000000]
  1280	14.81	 7.88	 4.64	 3.24	ROE[avg,max] = [0.284061770, 0.406250000]
  1408	16.94	 8.88	 5.26	 3.51	ROE[avg,max] = [0.310743194, 0.468750000]
  1536	17.75	 9.34	 5.20	 3.57	ROE[avg,max] = [0.252182723, 0.343750000]
  1664	20.45	10.74	 6.30	 4.29	ROE[avg,max] = [0.310800580, 0.406250000]
  1792	21.24	11.13	 6.10	 4.20	ROE[avg,max] = [0.348934528, 0.468750000]
  1920	23.62	12.28	 6.87	 4.74	ROE[avg,max] = [0.295699098, 0.406250000]
  2048	24.02	12.59	 6.95	 4.83	ROE[avg,max] = [0.248437626, 0.320312500]
  2304	27.91	14.72	 7.96	 5.43	ROE[avg,max] = [0.248899291, 0.312500000]
  2560	30.55	16.07	 8.90	 5.94	ROE[avg,max] = [0.302806862, 0.375000000]
  2816	34.92	18.18	10.09	 6.61	ROE[avg,max] = [0.284329255, 0.375000000]
  3072	36.52	19.12	10.83	 7.23	ROE[avg,max] = [0.244108896, 0.312500000]
  3328	42.00	22.02	12.74	 8.42	ROE[avg,max] = [0.316897552, 0.437500000]
  3584	43.51	22.70	12.51	 8.34	ROE[avg,max] = [0.289033555, 0.437500000]
  3840	48.30	25.20	13.63	 8.83	ROE[avg,max] = [0.301240335, 0.375000000]
  4096	50.49	26.29	14.41	10.01	ROE[avg,max] = [0.293798325, 0.437500000]
  4608	57.54	29.73	16.26	10.75	ROE[avg,max] = [0.301216173, 0.406250000]
  5120	64.50	33.36	18.01	12.04	ROE[avg,max] = [0.321669620, 0.406250000]
  5632	72.71	37.62	20.33	13.39	ROE[avg,max] = [0.284785005, 0.375000000]
  6144	77.17	40.38	22.42	14.81	ROE[avg,max] = [0.254623948, 0.343750000]
  6656	88.26	46.11	27.05	18.87	ROE[avg,max] = [0.353221649, 0.437500000]
  7168	90.96	47.11	25.80	16.98	ROE[avg,max] = [0.289598351, 0.375000000]
  7680	99.00	50.93	27.62	18.63	ROE[avg,max] = [0.267126056, 0.437500000
ewmayer is offline   Reply With Quote
Old 2017-05-13, 09:43   #58
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

22·2,939 Posts
Default

Ryzen, AVX2/FMA3 build, msec/iter vs FFT length (Kdouble) for various threadcounts:
Code:
FFTlen	1-thr	2-thr	4-thr	8-thr
  1024	10.42	 5.34	 3.36	 2.20	ROE[avg,max] = [0.249404939, 0.328125000]
  1152	12.14	 6.40	 3.72	 2.80	ROE[avg,max] = [0.302253644, 0.375000000]
  1280	13.23	 6.84	 4.07	 2.88	ROE[avg,max] = [0.285753262, 0.375000000]
  1408	15.40	 8.01	 4.87	 3.09	ROE[avg,max] = [0.300879913, 0.375000000]
  1536	15.96	 8.31	 4.80	 3.11	ROE[avg,max] = [0.265940841, 0.375000000]
  1664	18.57	 9.60	 5.64	 3.92	ROE[avg,max] = [0.310388813, 0.406250000]
  1792	18.67	 9.77	 5.47	 3.83	ROE[avg,max] = [0.310203065, 0.437500000]
  1920	21.53	11.29	 6.25	 4.26	ROE[avg,max] = [0.324257007, 0.437500000]
  2048	21.68	11.34	 6.39	 4.39	ROE[avg,max] = [0.241334140, 0.312500000]
  2304	25.47	13.26	 7.37	 5.02	ROE[avg,max] = [0.234688230, 0.281250000]
  2560	27.57	14.42	 8.03	 5.35	ROE[avg,max] = [0.297289787, 0.406250000]
  2816	32.14	16.67	 9.26	 6.24	ROE[avg,max] = [0.241656117, 0.343750000]
  3072	33.18	17.27	 9.93	 6.80	ROE[avg,max] = [0.234802388, 0.289062500]
  3328	38.69	20.04	10.95	 7.34	ROE[avg,max] = [0.308062178, 0.375000000]
  3584	39.13	20.18	11.10	 7.49	ROE[avg,max] = [0.287800268, 0.375000000]
  3840	44.07	22.50	12.36	 8.48	ROE[avg,max] = [0.288700568, 0.355468750]
  4096	44.67	23.29	13.33	 9.33	ROE[avg,max] = [0.284906635, 0.359375000]
  4608	51.83	26.58	14.70	 9.78	ROE[avg,max] = [0.294995369, 0.375000000]
  5120	56.91	29.36	16.57	11.11	ROE[avg,max] = [0.340822043, 0.437500000]
  5632	66.01	34.16	18.99	12.39	ROE[avg,max] = [0.296337954, 0.406250000]
  6144	68.74	35.72	20.63	13.88	ROE[avg,max] = [0.303176707, 0.390625000]
  6656	79.48	40.54	22.12	15.02	ROE[avg,max] = [0.270511965, 0.375000000]
  7168	80.03	40.97	23.16	15.74	ROE[avg,max] = [0.272298848, 0.343750000]
  7680	89.57	45.75	25.08	17.17	ROE[avg,max] = [0.287253405, 0.375000000]
In particular note the AVX2-mode 2816K timings - 2-threaded I benchmark at 16.7 msec/iter. After my benchmarks finished I fired up 4 exponents near the upper limit of 53.8M for 2816K. With all four 2-threaded jobs running and thus all 8 physical cores busy, I get ~20 msec/iter for each of the 4 side-by-side runs. Will play with thread counts and affinities some more in coming days to see if I can improve on that.

Off to bed ...

Last fiddled with by ewmayer on 2017-05-14 at 04:48
ewmayer is offline   Reply With Quote
Old 2017-05-13, 14:54   #59
Mark Rose
 
Mark Rose's Avatar
 
"/X\(‘-‘)/X\"
Jan 2013
https://pedan.tech/

24×199 Posts
Default

Quote:
Originally Posted by ewmayer View Post
Here are benchmark timings for multithreaded builds of Mlucas on Ryzen. Some notes:

1. My above 'unthreaded' timings were for 100-iteration runs. It seems that was insufficient on Ryzen, because when I went to 1000-iter timings to allow for the timing decreases which accompany use of more than 1 thread, even the 1-thread timings drop significantly versus the 100-iteration ones. For example, the per-iteration time for the AVX build @7168K drops from the 131 msec in the unthreaded-build-100-iter table to just 91 msec in the 1-thread column of the threaded-build-1000-iter table which follows.
Zen uses a neural network in its branch predictor. A lot of people benchmarking when Ryzen came out found that second, third, and additional runs, often resulted in better times. You may get better timings still using longer iterations.
Mark Rose is offline   Reply With Quote
Old 2017-12-29, 03:01   #60
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

1175610 Posts
Default

In relation to the verify runs of the new M-prime candidate, forumites Andreas Höglund [ATH] and Gord Palameta [GP2] both hit errors in building 17.1 for avx-512 - turns out some preprocessor-logic I added in relation to supporting ARMv8 SIMD (see the "ARM builds..." thread) broke an assumption implicit in several of the carry-radix files when built in avx-512 mode. Clearly, I need to do more thorough QA work going forward.

Patched 17.1 version has been successfully built by Andreas and uploaded by me.
ewmayer is offline   Reply With Quote
Old 2017-12-29, 03:25   #61
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

1137410 Posts
Default

Quote:
Originally Posted by Mark Rose View Post
Zen uses a neural network in its branch predictor.
There was a wonderful quote on BBC radio: "I'm less concerned about Artificial Intelligence than Artificial Stupidity.
chalsall is offline   Reply With Quote
Old 2017-12-29, 08:27   #62
heliosh
 
Oct 2017
++41

53 Posts
Default

compiling mlucas 17.1 fails on my raspberry pi 3 running raspbian stretch:

Code:
../src/util.c: In function ‘has_asimd’:
../src/util.c:1806:16: error: ‘HWCAP_ASIMD’ undeclared (first use in this function)
   if (hwcaps & HWCAP_ASIMD) {
                ^~~~~~~~~~~
../src/util.c:1806:16: note: each undeclared identifier is reported only once for each function it appears in
Any hint what might be wrong?
heliosh is offline   Reply With Quote
Old 2017-12-29, 09:11   #63
ldesnogu
 
ldesnogu's Avatar
 
Jan 2008
France

25516 Posts
Default

Is raspbian 64-bit? If not then it's possible HWCAP_ASIMD might not be defined.
ldesnogu is offline   Reply With Quote
Old 2017-12-29, 09:28   #64
heliosh
 
Oct 2017
++41

53 Posts
Default

No it's 32-Bit. I've read that Raspbian sticks to 32-Bit, so no 64-Bit Raspbian in near future.
heliosh is offline   Reply With Quote
Old 2017-12-29, 11:31   #65
ET_
Banned
 
ET_'s Avatar
 
"Luigi"
Aug 2002
Team Italia

5·7·139 Posts
Default

Quote:
Originally Posted by heliosh View Post
No it's 32-Bit. I've read that Raspbian sticks to 32-Bit, so no 64-Bit Raspbian in near future.
I am using gentoo 64-bits from here:
https://github.com/sakaki-/gentoo-on-rpi3-64bit

It made me compile Mlucas and helped another forumite to perform the compilation.
ET_ is offline   Reply With Quote
Old 2017-12-29, 23:26   #66
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

22·2,939 Posts
Default

Quote:
Originally Posted by ldesnogu View Post
Is raspbian 64-bit? If not then it's possible HWCAP_ASIMD might not be defined.
Yes, that reminds me that I should have included the patch for this issue - aready in my dev-branch code as of a few months ago - in the patched 17.1 tarball, but I posted the latter specifically to fix build issues for avx-512 code. @heliosh: Quick patch - in util.c, replace the has_asimd() function (at line 1882) with the following one, which adds a bit of preprocessor-hackery:
Code:
	int has_asimd(void)
	{
		unsigned long hwcaps = getauxval(AT_HWCAP);
	#ifndef HWCAP_ASIMD	// This is not def'd on pre-ASIMD platforms
		const unsigned long HWCAP_ASIMD = 0;
	#endif
		if (hwcaps & HWCAP_ASIMD) {
			return 1;
		}
		return 0;
	}
Will post updated patched tarball shortly - I want to be ready in case we get a bunch of new downloader/builders as a result of the imminent new-prime announcement.

Last fiddled with by ewmayer on 2017-12-29 at 23:27
ewmayer is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
Mlucas v18 available ewmayer Mlucas 48 2019-11-28 02:53
Mlucas on ubuntu Damian Mlucas 17 2017-11-13 18:12
Mlucas version 17 ewmayer Mlucas 3 2017-06-17 11:18
MLucas on IBM Mainframe Lorenzo Mlucas 52 2016-03-13 08:45
mlucas on sun delta_t Mlucas 14 2007-10-04 05:45

All times are UTC. The time now is 04:26.


Fri Jul 7 04:26:25 UTC 2023 up 323 days, 1:54, 0 users, load averages: 2.88, 1.99, 1.69

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔