mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Mlucas (https://www.mersenneforum.org/forumdisplay.php?f=118)
-   -   ARM builds and SIMD-assembler prospects (https://www.mersenneforum.org/showthread.php?t=21992)

ldesnogu 2017-03-13 20:08

[QUOTE=ewmayer;454769]I'm not familiar enough with ARM to understand why -m64 is unsupported in GCC, but correctly handling aarch64 in platform.h will cause the build to be in 64-bit mode. (I had assumed -m64 was needed to trigger the aarch64-related predefs, but your output from [1] will settle that.)[/QUOTE]
gcc ARM comes in 2 flavors: one that targets 64-bit code while the other targets 32-bit code, so there's no need for -m64 or -m32.

ldesnogu 2017-03-13 20:22

[QUOTE=ewmayer;454710]gcc -c -Os -m64 -DUSE_THREADS ../Mlucas.c[/QUOTE]
For that to succeed, you need this:
[code]$ diff platform.h~ platform.h
714a715,728
> #elif defined(__AARCH64EL__)
> #ifndef OS_BITS
> #define OS_BITS 32
> #endif
> #define CPU_TYPE
> #define CPU_IS_ARM_EABI
> #if(defined(__GNUC__) || defined(__GNUG__))
> #define COMPILER_TYPE
> #define COMPILER_TYPE_GCC
> #else
> #define COMPILER_TYPE
> #define COMPILER_TYPE_UNKNOWN
> #endif
>
[/code]And it compiles:
[code]$ aarch64-none-linux-gnu-gcc -Os -DUSE_THREADS -c *.c
$ aarch64-none-linux-gnu-gcc -Os -DUSE_THREADS *.o -o mlucas64 -lm -lpthread
$ file mlucas64
mlucas64: ELF 64-bit LSB executable, ARM aarch64, version 1 (SYSV), dynamically linked, interpreter /lib/ld-linux-aarch64.so.1, for GNU/Linux 3.7.0, not stripped
[/code]Tested with QEMU, it starts but I have no clue how I should launch the binary to do something sensible that doesn't take forever :)

ewmayer 2017-03-13 21:41

[QUOTE=Lorenzo;454809]Ok! I have done!
[CODE]ubuntu@pine64:~/Solaris2/mlucas-14.1$ gcc -dM -E - < /dev/null
[snip][/QUOTE]

Thanks! The key predefine there is __aarch64__, which is also the trigger in the .h file I posted ... so the latter should allow you to build. So I don't understand the raft of 'stray character' errors you get with that one - here are line 88-90 of that header:
[code]#elif(defined(_AIX))
#define OS_TYPE
#define OS_TYPE_AIX[/code]
Can you open both the original and new .h in an editor, and compare the file encodings? If those are the same, can you diff your local copies of those 2 file versions? Maybe that will reveal something relevant to the stary-octals errors you are getting.

[QUOTE=ldesnogu;454816]For that to succeed, you need this:
[code]$ diff platform.h~ platform.h
714a715,728
> #elif defined(__AARCH64EL__)
> #ifndef OS_BITS
> #define OS_BITS 32
> #endif
> #define CPU_TYPE
> #define CPU_IS_ARM_EABI
> #if(defined(__GNUC__) || defined(__GNUG__))
> #define COMPILER_TYPE
> #define COMPILER_TYPE_GCC
> #else
> #define COMPILER_TYPE
> #define COMPILER_TYPE_UNKNOWN
> #endif
>
[/code]And it compiles:
[code]$ aarch64-none-linux-gnu-gcc -Os -DUSE_THREADS -c *.c
$ aarch64-none-linux-gnu-gcc -Os -DUSE_THREADS *.o -o mlucas64 -lm -lpthread
$ file mlucas64
mlucas64: ELF 64-bit LSB executable, ARM aarch64, version 1 (SYSV), dynamically linked, interpreter /lib/ld-linux-aarch64.so.1, for GNU/Linux 3.7.0, not stripped
[/code]Tested with QEMU, it starts but I have no clue how I should launch the binary to do something sensible that doesn't take forever :)[/QUOTE]

That sets the wrong value of OS_BITS - for basic C-code Mlucas builds that won't matter much except for various utility functions which make heavy use of 64-bit-int math (e.g. the quad-float library used for high-precision inits of double constants), but for future asm-code builds we need the right bitness to be set. The predef section beginning at line 792 in the .h I posted should work just fine for Lorenzo, and you as well - did you try building with that, or did you just make your mod above and use it? Please try the unmodified .h file - the one with the __aarch64__ predef stuff at line 792 and let me know if you get the same unrecognized-char errors as Lorenzo.

You can quick-test the binary by trying some timing runs at a specific FFT length, say

./Mlucas -fftlen 1024 -nthread 1

will try all radix combos available @1024K and write the best-timing one to the mlucas.cfg file. You can also play with the threadcount - note the default there is to try to use all available cores.

ldesnogu 2017-03-13 22:35

[QUOTE=ewmayer;454822]That sets the wrong value of OS_BITS - for basic C-code Mlucas builds that won't matter much except for various utility functions which make heavy use of 64-bit-int math (e.g. the quad-float library used for high-precision inits of double constants), but for future asm-code builds we need the right bitness to be set. The predef section beginning at line 792 in the .h I posted should work just fine for Lorenzo, and you as well - did you try building with that, or did you just make your mod above and use it? Please try the unmodified .h file - the one with the __aarch64__ predef stuff at line 792 and let me know if you get the same unrecognized-char errors as Lorenzo.[/QUOTE]Silly me, I had missed your attachment. It compiles fine with it. So Lorenzo's error comes from somewhere else.

[quote]You can quick-test the binary by trying some timing runs at a specific FFT length, say

./Mlucas -fftlen 1024 -nthread 1

will try all radix combos available @1024K and write the best-timing one to the mlucas.cfg file. You can also play with the threadcount - note the default there is to try to use all available cores.[/quote][code]/work/qemu/qemu/aarch64-linux-user/qemu-aarch64 -L /work/Cross/fsf-6.169/aarch64-none-linux-gnu/libc ./mlucas64 -fftlen 1024 -nthread 1 -iters 1

Mlucas 14.1

http://hogranch.com/mayer/README.html

INFO: testing qfloat routines...
CPU Family = ARM Embedded ABI, OS = Linux, 64-bit Version, compiled with Gnu C [or other compatible], Version 6.3.1 20170118.
INFO: Using inline-macro form of MUL_LOHI64.
INFO: MLUCAS_PATH is set to ""
INFO: using 53-bit-significand form of floating-double rounding constant for scalar-mode DNINT emulation.
INFO: testing IMUL routines...
INFO: System has 4 available processor cores.
INFO: testing FFT radix tables...[/code]All MaxErr are at 0.

ewmayer 2017-03-13 22:46

Thanks, Laurent - so I suspect a file-encoding issue, with Lorenzo's .h file downloaded from my post, or perhaps his unzip utility inserted a bunch of garbage chars.

Lorenzo 2017-03-14 06:56

[QUOTE=ewmayer;454827]Thanks, Laurent - so I suspect a file-encoding issue, with Lorenzo's .h file downloaded from my post, or perhaps his unzip utility inserted a bunch of garbage chars.[/QUOTE]
RIght! Sorry, found issue.
It's working nice!) So withoit SIMD optimization it looks like:
[CODE]ubuntu@pine64:~/Solaris2/mlucas-14.1$ cat mlucas.cfg
14.1
1024 msec/iter = 114.57 ROE[avg,max] = [0.250000000, 0.250000000] radices = 32 32 16 32 0 0 0 0 0 0
1152 msec/iter = 109.04 ROE[avg,max] = [0.206808036, 0.250000000] radices = 288 8 16 16 0 0 0 0 0 0
1280 msec/iter = 133.03 ROE[avg,max] = [0.236600167, 0.281250000] radices = 160 16 16 16 0 0 0 0 0 0
1408 msec/iter = 140.47 ROE[avg,max] = [0.273688616, 0.343750000] radices = 176 16 16 16 0 0 0 0 0 0
1536 msec/iter = 161.30 ROE[avg,max] = [0.223493304, 0.281250000] radices = 192 16 16 16 0 0 0 0 0 0
1664 msec/iter = 166.09 ROE[avg,max] = [0.246149554, 0.312500000] radices = 208 16 16 16 0 0 0 0 0 0
1792 msec/iter = 180.60 ROE[avg,max] = [0.220703125, 0.281250000] radices = 224 16 16 16 0 0 0 0 0 0
1920 msec/iter = 198.81 ROE[avg,max] = [0.222460938, 0.250000000] radices = 240 16 16 16 0 0 0 0 0 0
2048 msec/iter = 206.38 ROE[avg,max] = [0.278125000, 0.281250000] radices = 256 16 16 16 0 0 0 0 0 0
2304 msec/iter = 242.52 ROE[avg,max] = [0.208269392, 0.250000000] radices = 288 16 16 16 0 0 0 0 0 0
2560 msec/iter = 308.94 ROE[avg,max] = [0.243164062, 0.281250000] radices = 160 16 16 32 0 0 0 0 0 0
2816 msec/iter = 329.54 ROE[avg,max] = [0.272896903, 0.343750000] radices = 176 16 16 32 0 0 0 0 0 0
3072 msec/iter = 371.71 ROE[avg,max] = [0.225892857, 0.281250000] radices = 192 16 16 32 0 0 0 0 0 0
3328 msec/iter = 388.66 ROE[avg,max] = [0.241322545, 0.281250000] radices = 208 16 16 32 0 0 0 0 0 0
3584 msec/iter = 414.33 ROE[avg,max] = [0.220870536, 0.250000000] radices = 224 16 16 32 0 0 0 0 0 0
3840 msec/iter = 453.97 ROE[avg,max] = [0.213636998, 0.265625000] radices = 240 16 16 32 0 0 0 0 0 0
4096 msec/iter = 472.52 ROE[avg,max] = [0.247321429, 0.250000000] radices = 256 16 16 32 0 0 0 0 0 0
4608 msec/iter = 544.08 ROE[avg,max] = [0.201870292, 0.222656250] radices = 288 16 16 32 0 0 0 0 0 0
5120 msec/iter = 673.79 ROE[avg,max] = [0.239508929, 0.312500000] radices = 160 16 32 32 0 0 0 0 0 0
5632 msec/iter = 693.38 ROE[avg,max] = [0.278264509, 0.343750000] radices = 176 16 32 32 0 0 0 0 0 0
6144 msec/iter = 776.30 ROE[avg,max] = [0.213504464, 0.250000000] radices = 192 16 32 32 0 0 0 0 0 0
6656 msec/iter = 814.97 ROE[avg,max] = [0.242299107, 0.281250000] radices = 208 16 32 32 0 0 0 0 0 0
7168 msec/iter = 870.94 ROE[avg,max] = [0.219768415, 0.312500000] radices = 224 16 32 32 0 0 0 0 0 0
7680 msec/iter = 955.79 ROE[avg,max] = [0.222209821, 0.250000000] radices = 240 16 32 32 0 0 0 0 0 0
[/CODE]

ewmayer 2017-03-14 07:22

[QUOTE=Lorenzo;454841]RIght! Sorry, found issue.
It's working nice!) So withoit SIMD optimization it looks like:
[CODE]ubuntu@pine64:~/Solaris2/mlucas-14.1$ cat mlucas.cfg
14.1
1024 msec/iter = 114.57 ROE[avg,max] = [0.250000000, 0.250000000] radices = 32 32 16 32 0 0 0 0 0 0
1152 msec/iter = 109.04 ROE[avg,max] = [0.206808036, 0.250000000] radices = 288 8 16 16 0 0 0 0 0 0
1280 msec/iter = 133.03 ROE[avg,max] = [0.236600167, 0.281250000] radices = 160 16 16 16 0 0 0 0 0 0
[snip][/CODE][/QUOTE]
Glad to hear it - what was the issue with the updated .h file? I'd like to know in case another user hits similar in future.

The only timing that really pops out is the anomalously low one @1152K ... but SIMD timings will be the ones of real interest.

How many threads did you run your self-test with? (Your screen output will indicate that, e.g. NTHREADS = {some value >= 1}.

Lorenzo 2017-03-14 07:46

Issue was in that file was unzipped not correctly by me. So in generally it's ok.

I ran ./mlucas -s m. So looks like Mlucas used 4 cores (threads ) correctly. I didn't play with threads yet.

So in generally very slow :mike:

ewmayer 2017-03-14 09:12

[QUOTE=Lorenzo;454845]I ran ./mlucas -s m. So looks like Mlucas used 4 cores (threads ) correctly. I didn't play with threads yet.

So in generally very slow :mike:[/QUOTE]

Yes - even with a 2-3x speedup from use of SIMD, the ARM will be more about performance per watt (and per hardware $) than speed-per-core.

ET_ 2017-03-14 10:19

[QUOTE=ewmayer;454847]Yes - even with a 2-3x speedup from use of SIMD, the ARM will be more about performance per watt (and per hardware $) than speed-per-core.[/QUOTE]

The following mlucas.cfg file was generated on a 2.8 GHz AMD Opteron running RedHat 64-bit linux.
[code]
2048 sec/iter = 0.134 ROE[min,max] = [0.281250000, 0.343750000] radices = 32 32 32 32 0 0 0 0 0 0 [Any text offset from the list-ending 0 by whitespace is ignored]
2304 sec/iter = 0.148 ROE[min,max] = [0.242187500, 0.281250000] radices = 36 8 16 16 16 0 0 0 0 0
2560 sec/iter = 0.166 ROE[min,max] = [0.281250000, 0.312500000] radices = 40 8 16 16 16 0 0 0 0 0
2816 sec/iter = 0.188 ROE[min,max] = [0.328125000, 0.343750000] radices = 44 8 16 16 16 0 0 0 0 0
3072 sec/iter = 0.222 ROE[min,max] = [0.250000000, 0.250000000] radices = 24 16 16 16 16 0 0 0 0 0
3584 sec/iter = 0.264 ROE[min,max] = [0.281250000, 0.281250000] radices = 28 16 16 16 16 0 0 0 0 0
4096 sec/iter = 0.300 ROE[min,max] = [0.250000000, 0.312500000] radices = 16 16 16 16 32 0 0 0 0 0
[/code]

The following mlucas.cfg file was generated on a 1.4 GHz ARM running 64-bit linux.
[code]
2048 msec/iter = 206.38 ROE[avg,max] = [0.278125000, 0.281250000] radices = 256 16 16 16 0 0 0 0 0 0
2304 msec/iter = 242.52 ROE[avg,max] = [0.208269392, 0.250000000] radices = 288 16 16 16 0 0 0 0 0 0
2560 msec/iter = 308.94 ROE[avg,max] = [0.243164062, 0.281250000] radices = 160 16 16 32 0 0 0 0 0 0
2816 msec/iter = 329.54 ROE[avg,max] = [0.272896903, 0.343750000] radices = 176 16 16 32 0 0 0 0 0 0
3072 msec/iter = 371.71 ROE[avg,max] = [0.225892857, 0.281250000] radices = 192 16 16 32 0 0 0 0 0 0
3328 msec/iter = 388.66 ROE[avg,max] = [0.241322545, 0.281250000] radices = 208 16 16 32 0 0 0 0 0 0
3584 msec/iter = 414.33 ROE[avg,max] = [0.220870536, 0.250000000] radices = 224 16 16 32 0 0 0 0 0 0
3840 msec/iter = 453.97 ROE[avg,max] = [0.213636998, 0.265625000] radices = 240 16 16 32 0 0 0 0 0 0
4096 msec/iter = 472.52 ROE[avg,max] = [0.247321429, 0.250000000] radices = 256 16 16 32 0 0 0 0 0 0
[/code]

In other words, a 4 threaded ARM is about 1.5x slower than one core of a 2.8 GHz Opteron.
With a 3x SIMD speedup its efficiency would be 0.5x on a per-core comparison, and 1:1 on a per-core-and-GHz comparison with the Opteron.

That's to say, a 20 ARM cores minicluster would be 20x faster on a per GHz measurement and 10x faster on a per-core measurement. And also as cheap as the single Opteron system. Not to speak about the energy saving...

VictordeHolland 2017-03-14 18:42

You got it working, nice!
That is a Pine64 with 4x ARM Cortex A53 cores (@1.4GHz) right?

I'm a little bit surprised it is about as fast as my
Odroid-U2 (4x ARM Cortex A9 cores @1.7Ghz)
which is only 32bit and an much older architecture.
[URL]http://mersenneforum.org/showpost.php?p=426575&postcount=94[/URL]
[code]
1024 msec/iter = 121.70 ROE[avg,max] = [0.298214286, 0.312500000] radices = 128 16 16 16 0 0 0 0 0 0
1152 msec/iter = 142.69 ROE[avg,max] = [0.225310407, 0.250000000] radices = 144 16 16 16 0 0 0 0 0 0
1280 msec/iter = 161.44 ROE[avg,max] = [0.251618304, 0.312500000] radices = 160 16 16 16 0 0 0 0 0 0
1408 msec/iter = 185.52 ROE[avg,max] = [0.297056362, 0.375000000] radices = 176 16 16 16 0 0 0 0 0 0
1536 msec/iter = 195.56 ROE[avg,max] = [0.234742955, 0.312500000] radices = 192 16 16 16 0 0 0 0 0 0
1664 msec/iter = 208.36 ROE[avg,max] = [0.254631696, 0.312500000] radices = 208 16 16 16 0 0 0 0 0 0
1792 msec/iter = 222.32 ROE[avg,max] = [0.234012277, 0.250000000] radices = 224 16 16 16 0 0 0 0 0 0
1920 msec/iter = 243.65 ROE[avg,max] = [0.235016741, 0.281250000] radices = 240 16 16 16 0 0 0 0 0 0
2048 msec/iter = 255.25 ROE[avg,max] = [0.310714286, 0.312500000] radices = 256 16 16 16 0 0 0 0 0 0
2304 msec/iter = 297.26 ROE[avg,max] = [0.228341239, 0.281250000] radices = 288 16 16 16 0 0 0 0 0 0
2560 msec/iter = 339.70 ROE[avg,max] = [0.256682478, 0.312500000] radices = 160 16 16 32 0 0 0 0 0 0
2816 msec/iter = 384.56 ROE[avg,max] = [0.296219308, 0.375000000] radices = 176 16 16 32 0 0 0 0 0 0
3072 msec/iter = 413.85 ROE[avg,max] = [0.239704241, 0.281250000] radices = 192 16 16 32 0 0 0 0 0 0
3584 msec/iter = 370.28 ROE[avg,max] = [0.231487165, 0.281250000] radices = 224 16 16 32 0 0 0 0 0 0
4096 msec/iter = 455.10 ROE[avg,max] = [0.282142857, 0.312500000] radices = 128 16 32 32 0 0 0 0 0 0
[/code]In that post I also made the comparison with a Intel Core2Duo E7400 @2.8GHz, running Mprime28.7 . Looking back at it, that comparison might not have been entirely fair (Mlucas vs. Mprime) .
So I dusted off the machine and also ran Mlucas:

Intel Core2Duo E7400 @2.8GHz
NTHREADS = 1
[code]
14.1
1024 msec/iter = 33.76 ROE[avg,max] = [0.264564732, 0.265625000] radices = 32 32 16 32 0 0 0 0 0 0
1152 msec/iter = 40.30 ROE[avg,max] = [0.237220982, 0.273437500] radices = 36 16 32 32 0 0 0 0 0 0
1280 msec/iter = 45.42 ROE[avg,max] = [0.251841518, 0.296875000] radices = 40 16 32 32 0 0 0 0 0 0
1408 msec/iter = 52.31 ROE[avg,max] = [0.285110910, 0.375000000] radices = 44 16 32 32 0 0 0 0 0 0
1536 msec/iter = 53.31 ROE[avg,max] = [0.239299665, 0.281250000] radices = 24 32 32 32 0 0 0 0 0 0
1664 msec/iter = 61.81 ROE[avg,max] = [0.261802455, 0.312500000] radices = 52 16 32 32 0 0 0 0 0 0
1792 msec/iter = 65.81 ROE[avg,max] = [0.267229353, 0.312500000] radices = 28 32 32 32 0 0 0 0 0 0
1920 msec/iter = 70.98 ROE[avg,max] = [0.243638393, 0.281250000] radices = 60 16 32 32 0 0 0 0 0 0
2048 msec/iter = 71.88 ROE[avg,max] = [0.257366071, 0.257812500] radices = 32 32 32 32 0 0 0 0 0 0
2304 msec/iter = 81.60 ROE[avg,max] = [0.236948940, 0.281250000] radices = 36 32 32 32 0 0 0 0 0 0
2560 msec/iter = 90.96 ROE[avg,max] = [0.255691964, 0.312500000] radices = 40 32 32 32 0 0 0 0 0 0
2816 msec/iter = 102.69 ROE[avg,max] = [0.283956473, 0.343750000] radices = 44 32 32 32 0 0 0 0 0 0
3072 msec/iter = 112.85 ROE[avg,max] = [0.233879743, 0.265625000] radices = 48 32 32 32 0 0 0 0 0 0
3328 msec/iter = 123.71 ROE[avg,max] = [0.267947824, 0.312500000] radices = 52 32 32 32 0 0 0 0 0 0
3584 msec/iter = 135.08 ROE[avg,max] = [0.267689732, 0.301757812] radices = 56 32 32 32 0 0 0 0 0 0
3840 msec/iter = 144.52 ROE[avg,max] = [0.242107282, 0.281250000] radices = 60 32 32 32 0 0 0 0 0 0
4096 msec/iter = 154.69 ROE[avg,max] = [0.263169643, 0.281250000] radices = 64 32 32 32 0 0 0 0 0 0
4608 msec/iter = 177.26 ROE[avg,max] = [0.236798968, 0.281250000] radices = 36 16 16 16 16 0 0 0 0 0
5120 msec/iter = 201.17 ROE[avg,max] = [0.257240513, 0.312500000] radices = 40 16 16 16 16 0 0 0 0 0
5632 msec/iter = 224.76 ROE[avg,max] = [0.291057478, 0.375000000] radices = 44 16 16 16 16 0 0 0 0 0
6144 msec/iter = 244.47 ROE[avg,max] = [0.233741978, 0.265625000] radices = 48 16 16 16 16 0 0 0 0 0
6656 msec/iter = 271.08 ROE[avg,max] = [0.264965820, 0.312500000] radices = 52 16 16 16 16 0 0 0 0 0
7168 msec/iter = 292.72 ROE[avg,max] = [0.274094936, 0.312500000] radices = 56 16 16 16 16 0 0 0 0 0
7680 msec/iter = 312.74 ROE[avg,max] = [0.249065290, 0.290039062] radices = 60 16 16 16 16 0 0 0 0 0
[/code]NTHREADS = 2
[code]
14.1
1024 msec/iter = 21.01 ROE[avg,max] = [0.273214286, 0.281250000] radices = 32 16 32 32 0 0 0 0 0 0
1152 msec/iter = 25.43 ROE[avg,max] = [0.237220982, 0.273437500] radices = 36 16 32 32 0 0 0 0 0 0
1280 msec/iter = 28.85 ROE[avg,max] = [0.259319196, 0.312500000] radices = 20 32 32 32 0 0 0 0 0 0
1408 msec/iter = 35.14 ROE[avg,max] = [0.280566406, 0.343750000] radices = 176 16 16 16 0 0 0 0 0 0
1536 msec/iter = 33.98 ROE[avg,max] = [0.239299665, 0.281250000] radices = 24 32 32 32 0 0 0 0 0 0
1664 msec/iter = 38.98 ROE[avg,max] = [0.261802455, 0.312500000] radices = 52 16 32 32 0 0 0 0 0 0
1792 msec/iter = 40.84 ROE[avg,max] = [0.267229353, 0.312500000] radices = 28 32 32 32 0 0 0 0 0 0
1920 msec/iter = 45.63 ROE[avg,max] = [0.243638393, 0.281250000] radices = 60 16 32 32 0 0 0 0 0 0
2048 msec/iter = 45.92 ROE[avg,max] = [0.257366071, 0.257812500] radices = 32 32 32 32 0 0 0 0 0 0
2304 msec/iter = 54.36 ROE[avg,max] = [0.236948940, 0.281250000] radices = 36 32 32 32 0 0 0 0 0 0
2560 msec/iter = 54.64 ROE[avg,max] = [0.255691964, 0.312500000] radices = 40 32 32 32 0 0 0 0 0 0
2816 msec/iter = 63.06 ROE[avg,max] = [0.283956473, 0.343750000] radices = 44 32 32 32 0 0 0 0 0 0
3072 msec/iter = 67.77 ROE[avg,max] = [0.233879743, 0.265625000] radices = 48 32 32 32 0 0 0 0 0 0
3328 msec/iter = 74.36 ROE[avg,max] = [0.267947824, 0.312500000] radices = 52 32 32 32 0 0 0 0 0 0
3584 msec/iter = 79.71 ROE[avg,max] = [0.267689732, 0.301757812] radices = 56 32 32 32 0 0 0 0 0 0
3840 msec/iter = 87.04 ROE[avg,max] = [0.242107282, 0.281250000] radices = 60 32 32 32 0 0 0 0 0 0
4096 msec/iter = 92.87 ROE[avg,max] = [0.263169643, 0.281250000] radices = 64 32 32 32 0 0 0 0 0 0
4608 msec/iter = 106.31 ROE[avg,max] = [0.238187081, 0.281250000] radices = 288 16 16 32 0 0 0 0 0 0
5120 msec/iter = 116.95 ROE[avg,max] = [0.241458566, 0.312500000] radices = 160 16 32 32 0 0 0 0 0 0
5632 msec/iter = 147.80 ROE[avg,max] = [0.278641183, 0.312500000] radices = 176 16 32 32 0 0 0 0 0 0
6144 msec/iter = 150.32 ROE[avg,max] = [0.247349330, 0.281250000] radices = 192 16 32 32 0 0 0 0 0 0
6656 msec/iter = 164.51 ROE[avg,max] = [0.250781250, 0.289062500] radices = 208 16 32 32 0 0 0 0 0 0
7168 msec/iter = 172.77 ROE[avg,max] = [0.277169364, 0.343750000] radices = 224 16 32 32 0 0 0 0 0 0
7680 msec/iter = 191.50 ROE[avg,max] = [0.253627232, 0.281250000] radices = 240 16 32 32 0 0 0 0 0 0

[/code]I also reran the Mprime 28.7 benchmark:
[code]
[Tue Mar 14 19:28:48 2017]
Compare your results to other computers at http://www.mersenne.org/report_benchmarks
Intel(R) Core(TM)2 Duo CPU E7400 @ 2.80GHz
CPU speed: 2800.02 MHz, 2 cores
CPU features: Prefetch, SSE, SSE2, SSE4
L1 cache size: 32 KB
L2 cache size: 3 MB
L1 cache line size: 64 bytes
L2 cache line size: 64 bytes
TLBS: 256
Prime95 64-bit version 28.7, RdtscTiming=1
Best time for 1024K FFT length: 16.199 ms., avg: 16.704 ms.
Best time for 1280K FFT length: 20.961 ms., avg: 21.575 ms.
Best time for 1536K FFT length: 26.163 ms., avg: 27.718 ms.
Best time for 1792K FFT length: 30.755 ms., avg: 32.141 ms.
Best time for 2048K FFT length: 34.946 ms., avg: 38.731 ms.
Best time for 2560K FFT length: 43.191 ms., avg: 46.909 ms.
Best time for 3072K FFT length: 53.965 ms., avg: 59.120 ms.
Best time for 3584K FFT length: 69.864 ms., avg: 83.959 ms.
Best time for 4096K FFT length: 71.973 ms., avg: 72.495 ms.
Best time for 5120K FFT length: 87.800 ms., avg: 88.870 ms.
Best time for 6144K FFT length: 110.473 ms., avg: 111.362 ms.
Best time for 7168K FFT length: 131.831 ms., avg: 132.743 ms.
Best time for 8192K FFT length: 146.812 ms., avg: 147.631 ms.
Timing FFTs using 2 threads.
Best time for 1024K FFT length: 15.401 ms., avg: 15.644 ms.
Best time for 1280K FFT length: 18.143 ms., avg: 19.026 ms.
Best time for 1536K FFT length: 21.927 ms., avg: 22.995 ms.
Best time for 1792K FFT length: 26.605 ms., avg: 27.481 ms.
Best time for 2048K FFT length: 30.460 ms., avg: 31.351 ms.
Best time for 2560K FFT length: 38.699 ms., avg: 39.689 ms.
Best time for 3072K FFT length: 47.988 ms., avg: 49.353 ms.
Best time for 3584K FFT length: 85.181 ms., avg: 85.865 ms.
Best time for 4096K FFT length: 62.209 ms., avg: 66.705 ms.
Best time for 5120K FFT length: 79.554 ms., avg: 80.260 ms.
Best time for 6144K FFT length: 92.489 ms., avg: 94.000 ms.
Best time for 7168K FFT length: 116.309 ms., avg: 119.709 ms.
Best time for 8192K FFT length: 125.236 ms., avg: 128.261 ms.

Timings for 1024K FFT length (1 cpu, 1 worker): 16.37 ms. Throughput: 61.08 iter/sec.
Timings for 1024K FFT length (2 cpus, 2 workers): 30.59, 31.69 ms. Throughput: 64.25 iter/sec.
Timings for 1280K FFT length (1 cpu, 1 worker): 21.24 ms. Throughput: 47.07 iter/sec.
Timings for 1280K FFT length (2 cpus, 2 workers): 37.86, 39.14 ms. Throughput: 51.96 iter/sec.
Timings for 1536K FFT length (1 cpu, 1 worker): 26.08 ms. Throughput: 38.34 iter/sec.
Timings for 1536K FFT length (2 cpus, 2 workers): 45.43, 47.68 ms. Throughput: 42.99 iter/sec.
Timings for 1792K FFT length (1 cpu, 1 worker): 31.05 ms. Throughput: 32.21 iter/sec.
Timings for 1792K FFT length (2 cpus, 2 workers): 52.50, 53.32 ms. Throughput: 37.81 iter/sec.
Timings for 2048K FFT length (1 cpu, 1 worker): 35.05 ms. Throughput: 28.53 iter/sec.
Timings for 2048K FFT length (2 cpus, 2 workers): 61.40, 63.17 ms. Throughput: 32.12 iter/sec.
Timings for 2560K FFT length (1 cpu, 1 worker): 43.36 ms. Throughput: 23.06 iter/sec.
Timings for 2560K FFT length (2 cpus, 2 workers): 77.50, 79.16 ms. Throughput: 25.54 iter/sec.
Timings for 3072K FFT length (1 cpu, 1 worker): 53.71 ms. Throughput: 18.62 iter/sec.
Timings for 3072K FFT length (2 cpus, 2 workers): 96.11, 97.25 ms. Throughput: 20.69 iter/sec.
Timings for 3584K FFT length (1 cpu, 1 worker): 67.86 ms. Throughput: 14.74 iter/sec.
Timings for 3584K FFT length (2 cpus, 2 workers): 164.50, 169.02 ms. Throughput: 12.00 iter/sec.
Timings for 4096K FFT length (1 cpu, 1 worker): 71.87 ms. Throughput: 13.91 iter/sec.
[Tue Mar 14 19:33:59 2017]
Timings for 4096K FFT length (2 cpus, 2 workers): 127.57, 128.14 ms. Throughput: 15.64 iter/sec.
Timings for 5120K FFT length (1 cpu, 1 worker): 87.87 ms. Throughput: 11.38 iter/sec.
Timings for 5120K FFT length (2 cpus, 2 workers): 153.62, 158.10 ms. Throughput: 12.83 iter/sec.
Timings for 6144K FFT length (1 cpu, 1 worker): 110.52 ms. Throughput: 9.05 iter/sec.
Timings for 6144K FFT length (2 cpus, 2 workers): 187.40, 186.73 ms. Throughput: 10.69 iter/sec.
Timings for 7168K FFT length (1 cpu, 1 worker): 132.18 ms. Throughput: 7.57 iter/sec.
Timings for 7168K FFT length (2 cpus, 2 workers): 236.89, 243.20 ms. Throughput: 8.33 iter/sec.
Timings for 8192K FFT length (1 cpu, 1 worker): 151.83 ms. Throughput: 6.59 iter/sec.
Timings for 8192K FFT length (2 cpus, 2 workers): 263.17, 260.16 ms. Throughput: 7.64 iter/sec.
[/code]

BTW: Is it possible to compile run Mlucas on Windows 7/10? If so, I could try to run benchmarks on my i5 2500k and/or i7 3770k


All times are UTC. The time now is 04:24.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.