mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Mlucas (https://www.mersenneforum.org/forumdisplay.php?f=118)
-   -   Mlucas version 17.1 (https://www.mersenneforum.org/showthread.php?t=2977)

ewmayer 2014-09-20 00:10

[QUOTE=ldesnogu;348115]It looks like roundpd is an SSE4.1 instruction which your Opteron 6124 doesn't seem to support (it's not part of SSE4a; see [URL="http://en.wikipedia.org/wiki/SSE4"]Wipedia[/URL]). I guess Ernst will have to explain why he pretends that Mlucas is an SSE2 program :smile:[/QUOTE]

Been too long since I visited here, but note that the use of roundpd has been purged from the Mersenne-mod carry macros (Fermat-mod still have them, but I'm the only one using those) in "SSE2"-build-mode in all recent releases, and this will remain so. AVX mode of course makes free use of vroundpd, since there is no "which flavor of AVX do you have?" issue w.r.to that instruction.

ewmayer 2014-12-13 05:05

V14.1 is available - details via the readme-file link in the opening post.

LaurV 2014-12-13 06:02

How does the newer version compares with P95? I mean, I have read your "less than two times slower" stuff there, but I assume that is a figure of speech...

(hey, I am the guy who DC-ed Mike's work, remember? :razz:)

ewmayer 2014-12-13 07:39

[QUOTE=LaurV;389911]How does the newer version compares with P95? I mean, I have read your "less than two times slower" stuff there, but I assume that is a figure of speech...[/QUOTE]

Here are 4-thread per-iteration timings on my Haswell 4670K/ddr3-2400, all running at stock. These are all ~2% pessimistic due to startup/shutdown time (e.g. I get 82 msec/iter running @ 3072K in production mode):
[code]FFT(K) msec/iter (4-threaded)
---- ---------
1024 2.65
1152 3.15
1280 3.43
1408 4.01
1536 4.19
1664 4.61
1792 4.81
1920 5.29
2048 5.35
2304 6.07
2560 6.51
2816 7.54
3072 8.40
3328 8.74
3584 9.13
3840 10.16
4096 10.54
4608 11.98
5120 13.80
5632 15.92
6144 17.54
6656 18.62
7168 19.69
7680 22.00[/code]

ldesnogu 2014-12-13 12:32

For comparison, [URL]http://mersenneforum.org/showpost.php?p=382227&postcount=633[/URL]
i5-4670K @ 3.8 GHz, Dual DDR3 1600
[code]Best time for 1024K FFT length: 1.336 ms., avg: 1.374 ms.
Best time for 1280K FFT length: 1.839 ms., avg: 1.865 ms.
Best time for 1536K FFT length: 2.333 ms., avg: 2.370 ms.
Best time for 1792K FFT length: 2.833 ms., avg: 3.277 ms.
Best time for 2048K FFT length: 3.350 ms., avg: 3.374 ms.
Best time for 2560K FFT length: 4.239 ms., avg: 4.276 ms.
Best time for 3072K FFT length: 5.124 ms., avg: 5.155 ms.
Best time for 3584K FFT length: 6.006 ms., avg: 6.042 ms.
Best time for 4096K FFT length: 6.970 ms., avg: 7.000 ms.
Best time for 5120K FFT length: 8.705 ms., avg: 8.745 ms.
Best time for 6144K FFT length: 10.496 ms., avg: 10.543 ms.
Best time for 7168K FFT length: 12.371 ms., avg: 12.451 ms.
Best time for 8192K FFT length: 14.673 ms., avg: 14.735 ms.[/code]

Nice result for Mlucas, congratulations :)

ewmayer 2014-12-13 22:12

[QUOTE=ldesnogu;389947]For comparison, [URL]http://mersenneforum.org/showpost.php?p=382227&postcount=633[/URL]
i5-4670K @ 3.8 GHz, Dual DDR3 1600
[code]Best time for 1024K FFT length: 1.336 ms., avg: 1.374 ms.
Best time for 1280K FFT length: 1.839 ms., avg: 1.865 ms.
Best time for 1536K FFT length: 2.333 ms., avg: 2.370 ms.
Best time for 1792K FFT length: 2.833 ms., avg: 3.277 ms.
Best time for 2048K FFT length: 3.350 ms., avg: 3.374 ms.
Best time for 2560K FFT length: 4.239 ms., avg: 4.276 ms.
Best time for 3072K FFT length: 5.124 ms., avg: 5.155 ms.
Best time for 3584K FFT length: 6.006 ms., avg: 6.042 ms.
Best time for 4096K FFT length: 6.970 ms., avg: 7.000 ms.
Best time for 5120K FFT length: 8.705 ms., avg: 8.745 ms.
Best time for 6144K FFT length: 10.496 ms., avg: 10.543 ms.
Best time for 7168K FFT length: 12.371 ms., avg: 12.451 ms.
Best time for 8192K FFT length: 14.673 ms., avg: 14.735 ms.[/code]

Nice result for Mlucas, congratulations :)[/QUOTE]

Thanks - a lot of work went into that "within a factor of 2x". My system runs @3.3GHz (slower than above) but with ddr3-2400 (faster), so not sure how those 2 differences net out. I've been using [url=http://www.mersenneforum.org/showpost.php?p=343173&postcount=99]George's early Haswell results[/url] as my guide, since we bought identical hardware (Mobo, CPU, RAM) and those timings were before George OCed his system. I apply a 15% reduction to his timings, since he says that's roughly what he gained from use of FMA.

BTW, if anyone has access to a Broadwell system running Linux (or MingGW64 under Windoze), I'd very much appreciate tmings on such, and have some special preprocessor-flags-to-try-for-Broadwell, as well.

ldesnogu 2014-12-14 11:03

[QUOTE=ewmayer;389973]My system runs @3.3GHz (slower than above) but with ddr3-2400 (faster), so not sure how those 2 differences net out.[/QUOTE]
Do you mean your system is underclocked? Because 4670K are supposed to run at base 3.4 GHz with turbo at 3.8 GHz (and I supposed the benchmark poster just stated turbo speed, might be a wrong assumption...).

[quote]I've been using [URL="http://www.mersenneforum.org/showpost.php?p=343173&postcount=99"]George's early Haswell results[/URL] as my guide, since we bought identical hardware (Mobo, CPU, RAM) and those timings were before George OCed his system. I apply a 15% reduction to his timings, since he says that's roughly what he gained from use of FMA.[/quote]Silly question: why don't you run the latest Prime95 benchmark on your system?

ldesnogu 2014-12-14 12:05

I gave Mlucas a try on my i7-4770K.
[code]gcc -c -Os -m64 -DUSE_AVX2 -DUSE_THREADS *.c
rm -f rng*.o util.o qfloat.o
gcc -c -O1 -m64 -DUSE_AVX2 -DUSE_THREADS rng*.c util.c qfloat.c
gcc -o Mlucas *.o -lm -lpthread -lrt
./Mlucas -fftlen 192 -iters 100 -radset 0 -nthread 2
...
100 iterations of M3888517 with FFT length 196608 = 192 K
Res64: 579D593FCE0707B2. AvgMaxErr = 0.274916295. MaxErr = 0.343750000. Program: E14.1
Res mod 2^36 = 67881076658
Res mod 2^35 - 1 = 21674900403
Res mod 2^36 - 1 = 42893438228[/code]The README page says this should be output:
[code]
This particular testcase should produce the following 100-iteration residues,
with some platform-dependent variability in the roundoff errors :

100 iterations of M3888509 with FFT length 196608 = 192 K
Res64: 71E61322CCFB396C. AvgMaxErr = 0.226967076. MaxErr = 0.281250000. Program: E3.0x
Res mod 2^36 = 12028950892
Res mod 2^35 - 1 = 29259839105
Res mod 2^36 - 1 = 50741070790[/code]

I guess the README should be updated.

How do you get an output similar to Prime95 benchmark?

ewmayer 2014-12-17 02:30

[QUOTE=ldesnogu;390014]I gave Mlucas a try on my i7-4770K.
[code]gcc -c -Os -m64 -DUSE_AVX2 -DUSE_THREADS *.c
rm -f rng*.o util.o qfloat.o
gcc -c -O1 -m64 -DUSE_AVX2 -DUSE_THREADS rng*.c util.c qfloat.c
gcc -o Mlucas *.o -lm -lpthread -lrt
./Mlucas -fftlen 192 -iters 100 -radset 0 -nthread 2
...
100 iterations of M3888517 with FFT length 196608 = 192 K
Res64: 579D593FCE0707B2. AvgMaxErr = 0.274916295. MaxErr = 0.343750000. Program: E14.1
Res mod 2^36 = 67881076658
Res mod 2^35 - 1 = 21674900403
Res mod 2^36 - 1 = 42893438228[/code]The README page says this should be output:
[code]
This particular testcase should produce the following 100-iteration residues,
with some platform-dependent variability in the roundoff errors :

100 iterations of M3888509 with FFT length 196608 = 192 K
Res64: 71E61322CCFB396C. AvgMaxErr = 0.226967076. MaxErr = 0.281250000. Program: E3.0x
Res mod 2^36 = 12028950892
Res mod 2^35 - 1 = 29259839105
Res mod 2^36 - 1 = 50741070790[/code]

I guess the README should be updated.[/QUOTE]
Ah, good catch - if you look closely you see the 2 exponents are slightly different (3888517 is the next-larger prime above 3888509). I believe I must have changed the self-test exponent computation formula sometime in the last year or so to take p as the smallest prime >= the number given by my continuous-function max_p(FFT length) formula, rather than the largest prime <= same. If you force a non-default self-test p via

./Mlucas -m 3888509 -fftlen 192 -iters 100 -radset 0 -nthread 2

you will see the result indicated on the webpage (which I have since corrected). Thanks for the catch.

[QUOTE]How do you get an output similar to Prime95 benchmark?[/QUOTE]
George and I do our self-tests differently ... If you want a best-FFT-params (as determined by the these self-tests) timings for range of FFT lengths relevant to current GIMPS 'wavefront' and DC work, pause any other CPU-heavy tasks on our system and run the 'medium' self-test range:

./Mlucas -s m -iters 1000

1000 iters gives cleaner timings (and better roundoff testing) than the "quick look" 100-iter tests. With no #threads specified the code will use all the physical cores on your system. The README page discusses all this stuff.

ewmayer 2014-12-21 07:34

[QUOTE=ldesnogu;390009]Do you mean your system is underclocked? Because 4670K are supposed to run at base 3.4 GHz with turbo at 3.8 GHz (and I supposed the benchmark poster just stated turbo speed, might be a wrong assumption...).[/QUOTE]
Ah, I mis-wrote - clock is indeed 3.40 GHz. Perusing the BIOS boot-menu info, I have Turbo Boost enabled (and Enhanced Turbo = [Auto], whatever that means). As I had not recently tried toggling Turbo Boost I tried disabling it - the current Mlucas build runs at the same speed (within the usual noise-based error bars) that way, so it seems to make no difference for my code. at least on my setup.

[QUOTE]Silly question: why don't you run the latest Prime95 benchmark on your system?[/QUOTE]
Like you say, 'tis a silly question. :)

Here 4-threaded results for my Haswell system:
[i]
[Worker #1 Dec 19 16:21] Timing FFTs using 4 threads.
[Worker #1 Dec 19 16:21] Timing 39 iterations of 1024K FFT length. Best time: 1.293 ms., avg time: 1.344 ms.
[Worker #1 Dec 19 16:21] Timing 31 iterations of 1280K FFT length. Best time: 1.825 ms., avg time: 1.850 ms.
[Worker #1 Dec 19 16:21] Timing 26 iterations of 1536K FFT length. Best time: 1.993 ms., avg time: 2.305 ms.
[Worker #1 Dec 19 16:21] Timing 25 iterations of 1792K FFT length. Best time: 2.317 ms., avg time: 2.356 ms.
[Worker #1 Dec 19 16:21] Timing 25 iterations of 2048K FFT length. Best time: 2.766 ms., avg time: 2.785 ms.
[Worker #1 Dec 19 16:21] Timing 25 iterations of 2560K FFT length. Best time: 3.462 ms., avg time: 3.500 ms.
[Worker #1 Dec 19 16:21] Timing 25 iterations of 3072K FFT length. Best time: 4.141 ms., avg time: 4.190 ms.
[Worker #1 Dec 19 16:21] Timing 25 iterations of 3584K FFT length. Best time: 4.957 ms., avg time: 5.009 ms.
[Worker #1 Dec 19 16:21] Timing 25 iterations of 4096K FFT length. Best time: 5.639 ms., avg time: 5.722 ms.
[Worker #1 Dec 19 16:21] Timing 25 iterations of 5120K FFT length. Best time: 7.151 ms., avg time: 7.202 ms.
[Worker #1 Dec 19 16:21] Timing 25 iterations of 6144K FFT length. Best time: 8.471 ms., avg time: 8.639 ms.
[Worker #1 Dec 19 16:21] Timing 25 iterations of 7168K FFT length. Best time: 10.197 ms., avg time: 10.272 ms.
[Worker #1 Dec 19 16:21] Timing 25 iterations of 8192K FFT length. Best time: 11.917 ms., avg time: 11.952 ms.
[/i]
Now assembling the average times for 4-threaded Prime95 and Mlucas (update of previous table, now using 10000-iter timings run after a reboot, right after which I ran the above Prime95 timing test) at the above FFT lengths (plus the intermediate radix-9/11/13/15-based ones supported by Mlucas) and supplementing with the resulting [Mlucas/Prime95] timing ratio (for cases where the FFT length in question is not supported by Prime95, use its timing at the next-higher length as the denominator):
[code]
FFTlen Prime95 Mlucas Timing Ratio
(Kdbl) msec/iter msec/iter [Mlucas/P95]
------ --------- --------- ------------
1024 1.344 2.60 1.93
1152 3.13 1.69
1280 1.850 3.56 1.92
1408 3.98 1.73
1536 2.305 4.02 1.74
1664 4.63 1.97
1792 2.356 4.70 1.99
1920 5.29 1.90
2048 2.785 5.29 1.90
2304 6.00 1.71
2560 3.500 6.44 1.84
2816 7.47 1.78
3072 4.190 8.25 1.97
3328 8.84 1.76
3584 5.009 9.02 1.80
3840 10.06 1.76
4096 5.722 10.46 1.83
4608 11.78 1.64
5120 7.202 13.47 1.87
5632 15.52 1.80
6144 8.639 17.40 2.01
6656 18.48 1.80
7168 10.272 19.02 1.85
7680 21.49 1.80
8192 11.952 22.33 1.87[/code]
So George still kicks my butt, but now maybe with just one leg, rather than both. :)

ewmayer 2015-05-22 06:40

Here is the head-to-head comparison on my new Xyzzy-built Broadwell (i3) NUC, both programs run 4-threaded on the 2 physical cores of the system (that setup gives best per-iteration timing for both on this system) - these timings and ratios can be compared to the Haswell ones in the above post:
[code]
FFTlen Prime95 Mlucas Timing Ratio
(Kdbl) msec/iter msec/iter [Mlucas/P95] Comments
------ --------- --------- ------------ ------------
1024 3.894 6.869 1.76
1152 4.634 8.294 1.79
1280 4.990 8.702 1.74
1408 5.502 10.118 1.84 [Prime95 1440K]
1536 6.203 10.298 1.66
1664 6.506 11.562 1.78 [Prime95: average of the 1600K and 1728K timings]
1792 7.473 11.904 1.59
1920 7.843 13.186 1.68
2048 7.898 13.946 1.77
2304 8.889 15.846 1.78
2560 9.930 17.281 1.74
2816 11.369 19.931 1.75 [Prime95 2880K]
3072 12.465 22.373 1.79
3328 13.688 23.541 1.72 [Prime95 3360K]
3584 14.567 25.318 1.74
3840 16.079 27.987 1.74
4096 16.917 29.488 1.74
4608 19.762 34.077 1.72
5120 21.736 37.573 1.73
5632 25.657 43.197 1.68 [Prime95 5760K]
6144 26.867 50.179 1.87
6656 30.958 51.091 1.65 [Prime95 6720K]
7168 32.399 54.929 1.70
7680 34.025 60.411 1.78
8192 34.791 65.911 1.89
Avg: 1.75[/code]


All times are UTC. The time now is 04:26.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.