mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Cloud Computing (https://www.mersenneforum.org/forumdisplay.php?f=134)
-   -   Skylake AVX-512: Google Cloud has announced general availability (https://www.mersenneforum.org/showthread.php?t=22367)

GP2 2017-06-03 21:56

Skylake AVX-512: Google Cloud has announced general availability
 
[url]https://cloudplatform.googleblog.com/2017/05/Compute-Engine-updates-bring-Skylake-GA-Extended-Memory-and-more-VM-flexibility.html[/url]

Skylake is available in Western US, Western Europe and Eastern Asia Pacific regions (i.e., that is where the servers themselves are located; customer can live anywhere :smile:)

Looks like Google beat Amazon to the punch here, but presumably the c5 instances will be available on AWS fairly soon.

ewmayer 2017-06-03 22:51

1 Attachment(s)
[QUOTE=GP2;460461][url]https://cloudplatform.googleblog.com/2017/05/Compute-Engine-updates-bring-Skylake-GA-Extended-Memory-and-more-VM-flexibility.html[/url]

Skylake is available in Western US, Western Europe and Eastern Asia Pacific regions (i.e., that is where the servers themselves are located; customer can live anywhere :smile:)

Looks like Google beat Amazon to the punch here, but presumably the c5 instances will be available on AWS fairly soon.[/QUOTE]

Anyone who wants to try out a beta of Mlucas v17 on Skylake Xeon, xzipped tarball attached (md5 = 9a97be71d623a7315ef360bf1ba2674b). After validating the checksum, xz -d *xz to uncompress, and tar xvf *tar to unpack the tarchive.

AFAICT the code is quite solid, but I'm still working on updating the linux auto-installer, so you'll want to instead follow the simple manual build procedure - create an obj_avx512 (or whatever) dir inside the src-dir of the unzipped tarball, then
[i]
gcc -c -O3 -march=knl -DUSE_AVX512 -DUSE_THREADS ../*.c >& build.log
grep -i error build.log
[if the above grep comes up empty] gcc -o Mlucas *.o -lm -lpthread -lrt
[/i]
Before building, you might also check whether your version of gcc supports -mavx512 -- I use -mavx2 for AVX2 builds, but I didn't see -mavx512 supported in the versions of gcc I use, only the above Knights-Landing-specific flag.

Note in v17 the -nthread flag is deprecated in favor of the new -cpu flag, and absent either flag the default is to run 1-thread on logical core 0. Best throughput for a multicore will likely be 1-thread-pre core, i.e. from within the various run-dirs do

./Mlucas -cpu 0
./Mlucas -cpu 1

etc, using Intel core-numbering scheme. But you're welcome to also try multiple threads, as described in the Performance-Tuning section of the [url=http://www.mersenneforum.org/mayer/README.html]Mlucas README[/url].

v17 also has a simple Python script primenet.py for automated Primenet assignments management in the src-dir, but not sure whether that will work from a cloud setup. Easy to try, though, after you create however many run-subdirs you want to run jobs from and copy the relevant mlucas.cfg file to each, just cd into one such rundir and run
[i]
python primenet.py -d -t 0 -T 100 -u [primenet uid] -p [pwd]
[/i]
The -t 0 means run a single-shot get-work-to-do, -d enables debug diagnostics. -T 100 means 'get smallest available first-time LL', just grep the .py for 'worktype' to see the other options.

If that works for you, you'll want to do similar from each rundir before launching Mlucas from within it and note that sans the '-t 0' the default is to check for 'results to submit/work to get?' every 6 hours.

Batalov 2017-06-03 22:59

1 Attachment(s)
Well, if you signed up for a "month of free trial" some time ago, ...there's good news:

CRGreathouse 2017-06-04 02:05

[QUOTE=Batalov;460470]Well, if you signed up for a "month of free trial" some time ago, ...there's good news:[/QUOTE]

Apparently they extended all the old free trial accounts by 305 days. :smile:

GP2 2017-06-04 05:07

[QUOTE=ewmayer;460469]Anyone who wants to try out a beta of Mlucas v17 on Skylake Xeon, xzipped tarball attached (md5 = ac27027656e8fdfe7e7dc52177faffc8). After validating the checksum, xz -d *xz to uncompress, and tar xvf *tar to unpack the tarchive.

AFAICT the code is quite solid, but I'm still working on updating the linux auto-installer, so you'll want to instead follow the simple manual build procedure - create an obj_avx512 (or whatever) dir inside the src-dir of the unzipped tarball, then
[i]
gcc -c -O3 -march=knl -DUSE_AVX512 -DUSE_THREADS ../*.c >& build.log
grep -i error build.log
[if the above grep comes up empty] gcc -o Mlucas *.o -lm -lpthread -lrt
[/i]
Before building, you might also check whether your version of gcc supports -mavx512 -- I use -mavx2 for AVX2 builds, but I didn't see -mavx512 supported in the versions of gcc I use, only the above Knights-Landing-specific flag.
[/QUOTE]

As a shortcut, you can say tar xvJf *.tar.xz to extract in a single step.

I selected Ubuntu 17.04 as the OS and gcc on it does indeed support AVX-512

However instead of -mavx512 there are various separate flags, see below.

You need to select regions us-west1-a or us-west1-b (not us-west1-c, see [URL="https://cloud.google.com/compute/docs/regions-zones/regions-zones#available"]the Regions and Zones page[/URL]), and you do need to specify that you want Skylake when starting your virtual instances, otherwise you might get Broadwell instead.

After you start the virtual instance, run less /proc/cpuinfo to make sure various avx512 flags are listed, so you know it really is Skylake. The flags listed there are avx512f avx512dq avx512cd avx512bw avx512vl

So I ran the command

[QUOTE]
gcc -c -O3 -mavx512f -mavx512cd -mavx512bw -mavx512dq -mavx512vl -DUSE_AVX512 -DUSE_THREADS ../*.c >& build.log
[/QUOTE]

Grepping for "error" as suggested gave:

[QUOTE]
../factor.c:658:4: error: #error USE_AVX512 only meaningful if 64-bit GCC (or GCC-compatible) build and USE_FLOAT also defined at compile time!
#error USE_AVX512 only meaningful if 64-bit GCC (or GCC-compatible) build and USE_FLOAT also defined at compile time!
[/QUOTE]

I assume that doesn't really matter since factor.c is presumably factoring code?

I am using n1-highcpu-2, which has 2 vCPUs (i.e., one actual core).

Anyways, the program is running and here is some output:

[CODE]
INFO: no restart file found...starting run from scratch.
M45676327: using FFT length 2560K = 2621440 8-byte floats.
this gives an average 17.424135971069337 bits per digit
Using complex FFT radices 160 16 16 32
[Jun 04 04:42:41] M45676327 Iter# = 10000 [ 0.02% complete] clocks = 00:04:53.152 [ 0.0293 sec/iter] Res64: B092FAA91F9
0CCC1. AvgMaxErr = 0.048224542. MaxErr = 0.070312500.
[Jun 04 04:47:36] M45676327 Iter# = 20000 [ 0.04% complete] clocks = 00:04:54.746 [ 0.0295 sec/iter] Res64: 9B83E49764C
D807E. AvgMaxErr = 0.048445035. MaxErr = 0.070312500.
[Jun 04 04:52:31] M45676327 Iter# = 30000 [ 0.07% complete] clocks = 00:04:55.124 [ 0.0295 sec/iter] Res64: BF44E167F14
222DA. AvgMaxErr = 0.048418895. MaxErr = 0.070312500.
[/CODE]

I assume it's using the AVX-512 instructions, any way to check? Should I have attempted to use flags corresponding to AVX512PF, AVX512ER even though /proc/cpuinfo didn't list these?

By comparison, mprime benchmark has:

[CODE]
Best time for 2560K FFT length: 17.573 ms., avg: 17.625 ms.
[/CODE]


PS,
Skylake on Google Cloud is only 2.0 GHz.

In general, Google Cloud has lower clock speeds than AWS; for instance, it has 2.3 GHz Haswell instances versus 2.9 GHz Haswell for AWS. Also Google's preemptible instances cost 1.5 cents per hour (fixed price) versus AWS's spot market (fluctuating price) which varies greatly from region to region but is currently very steady at around 1.3 cents per hour in us-east-2 (Ohio). So Google Cloud is not competitive with AWS for running mprime at the present time. Since mprime doesn't yet make use of AVX-512 for LL testing, it will just run relatively slowly on the Skylake box.

ewmayer 2017-06-04 06:05

[QUOTE=GP2;460485]As a shortcut, you can say tar xvJf *.tar.xz to extract in a single step.[/QUOTE]
My version of tar (under MacOS) is older and only supports lowercase 'j' (bzip2) - good to know that 'J' works the same way for xzip under newer version of tar.

[QUOTE]After you start the virtual instance, run less /proc/cpuinfo to make sure various avx512 flags are listed, so you know it really is Skylake. The flags listed there are avx512f avx512dq avx512cd avx512bw avx512vl

So I ran the command
[QUOTE]
gcc -c -O3 -mavx512f -mavx512cd -mavx512bw -mavx512dq -mavx512vl -DUSE_AVX512 -DUSE_THREADS ../*.c >& build.log
[/QUOTE][/QUOTE]
AVX-512F (512-bit vector foundation instructions) is the only instruction-subset Mlucas uses currently (and all that is on offer on the KNL, where I did my code-dev). OTOH, I wonder whether adding suitable arguments to -march and -mtune might help?

[QUOTE]Grepping for "error" as suggested gave:
[QUOTE]
../factor.c:658:4: error: #error USE_AVX512 only meaningful if 64-bit GCC (or GCC-compatible) build and USE_FLOAT also defined at compile time!
#error USE_AVX512 only meaningful if 64-bit GCC (or GCC-compatible) build and USE_FLOAT also defined at compile time!
[/QUOTE]
I assume that doesn't really matter since factor.c is presumably factoring code?[/QUOTE]
You assume correctly.

[QUOTE]Anyways, the program is running and here is some output:

[CODE]
INFO: no restart file found...starting run from scratch.
M45676327: using FFT length 2560K = 2621440 8-byte floats.
this gives an average 17.424135971069337 bits per digit
Using complex FFT radices 160 16 16 32
[Jun 04 04:42:41] M45676327 Iter# = 10000 [ 0.02% complete] clocks = 00:04:53.152 [ 0.0293 sec/iter] Res64: B092FAA91F9
0CCC1. AvgMaxErr = 0.048224542. MaxErr = 0.070312500.
[Jun 04 04:47:36] M45676327 Iter# = 20000 [ 0.04% complete] clocks = 00:04:54.746 [ 0.0295 sec/iter] Res64: 9B83E49764C
D807E. AvgMaxErr = 0.048445035. MaxErr = 0.070312500.
[Jun 04 04:52:31] M45676327 Iter# = 30000 [ 0.07% complete] clocks = 00:04:55.124 [ 0.0295 sec/iter] Res64: BF44E167F14
222DA. AvgMaxErr = 0.048418895. MaxErr = 0.070312500.
[/CODE]
I assume it's using the AVX-512 instructions, any way to check? Should I have attempted to use flags corresponding to AVX512PF, AVX512ER even though /proc/cpuinfo didn't list these?[/QUOTE]
By way of comparison, I get 54 msec/iter @2560K on one core of the 1.3GHz KNL. Re. AVX-512 usage, I doubt those extra flags would make a difference - but see my above note re. -march,-mtune - and on program launch you should look for the bolded line below:
[i]
Mlucas 17.0

[url]http://hogranch.com/mayer/README.html[/url]

INFO: testing qfloat routines...
CPU Family = x86_64, OS = Linux, 64-bit Version, compiled with Gnu C [or other compatible], Version 5.4.0.
[b]INFO: Build uses AVX512 instruction set.[/b]
[/i]
[QUOTE]By comparison, mprime benchmark has:
[CODE]
Best time for 2560K FFT length: 17.573 ms., avg: 17.625 ms.
[/CODE][/QUOTE]
If that's on the same Skylake Xeon hardware, it beats me, obviously.

[QUOTE]PS,
Skylake on Google Cloud is only 2.0 GHz. [/QUOTE]
Factoring in the clock speed difference between that and the 1.3 GHz of the KNL we have

Skylake Xeon: (29.5 msec/iter * 2.0 mcycles/msec) = 59 mcycles/iter
Knights Landing: (54.0 msec/iter * 1.3 mcycles/msec) = 70 mcycles/iter ,

hence ~1.2x greater per-cycle throughput for Skylake Xeon vs KNL, running the same code. That seems low, though perhaps I was expecting too much?

One last thing - would appreciate if you would be so kind as to also do an AVX2-build (just use -mavx2 for now, same as I use for AVX2 builds on both Haswell and KNL) and, with any running jobs paused, rerun the self-tests. I want to compare the AVX2/AVX512 timing ratios you get to the ~1.6x I see on the KNL. Thanks!

ewmayer 2017-06-04 06:42

[QUOTE=ewmayer;460488]One last thing - would appreciate if you would be so kind as to also do an AVX2-build (just use -mavx2 for now, same as I use for AVX2 builds on both Haswell and KNL) and, with any running jobs paused, rerun the self-tests. I want to compare the AVX2/AVX512 timing ratios you get to the ~1.6x I see on the KNL. Thanks![/QUOTE]

One more (i.e. last++) thing to try - for both avx2 and avx512 builds, check the effect of running on one versus both logical cores attached to your single-physical-core instance, using just a single FFT length and radix set for now. Say the same params being used in the production run you excerpted above:

./Mlucas -fftlen 2560 -iters 100 -radset 0 -cpu 0
and
./Mlucas -fftlen 2560 -iters 100 -radset 0 -cpu 0:1

(Yes, I know I'm very demanding. :)

GP2 2017-06-04 10:30

[QUOTE=ewmayer;460488]AVX-512F (512-bit vector foundation instructions) is the only instruction-subset Mlucas uses currently (and all that is on offer on the KNL, where I did my code-dev). OTOH, I wonder whether adding suitable arguments to -march and -mtune might help?[/QUOTE]

I recompiled with

[CODE]
gcc -c -O3 -march=skylake-avx512 -mtune=skylake-avx512 -DUSE_AVX512 -DUSE_THREADS ../*.c >& build.log
[/CODE]

but after re-running ./Mlucas -s m the mlucas.cfg file looked very similar to the earlier one (there's always a bit of noise with a virtual machine sharing the same physical hardware with other users). Running Mlucas on an exponent confirmed the timings are identical, at 29.3 to 29.4 ms/iter for 2560K FFT.

[QUOTE]
on program launch you should look for the bolded line below:

[b]INFO: Build uses AVX512 instruction set.[/b]
[/QUOTE]

Yes it's there, I should have noticed it earlier.

[QUOTE]
If that's on the same Skylake Xeon hardware, it beats me, obviously.
[/QUOTE]

Yes, it was. And on an AWS c4.large instance (2.9 GHz), 2560K FFT on mprime benchmarks at about 12 ms/iter.


Google doesn't provide any information about the specific Skylake model that they use, they only specify the frequency of 2.0 GHz. However, when mprime 29.1 runs it reports an L2 cache of 256 KB (this is on one core). I'm not sure what method it uses to detect that, but it must be detecting it dynamically because it reports the architecture as "Unknown Intel". I think I read that the higher-end Skylakes (the 7xxx series) are supposed to have 1 MB / core of L2 cache, so this must be in the 6xxx series.

gcc actually provides a flag to specify the L2 cache size. I wonder if it would be worthwhile to try -l2-cache-size=256

I'll try the AVX2 build and the other stuff a little later today.

GP2 2017-06-04 11:59

[QUOTE=ewmayer;460488]One last thing - would appreciate if you would be so kind as to also do an AVX2-build (just use -mavx2 for now, same as I use for AVX2 builds on both Haswell and KNL) and, with any running jobs paused, rerun the self-tests. I want to compare the AVX2/AVX512 timing ratios you get to the ~1.6x I see on the KNL. Thanks![/QUOTE]

I compiled with

[CODE]
gcc -c -O3 -mavx2 -DUSE_THREADS ../*.c >& build.log
[/CODE]

removing the -DUSE_AVX512 flag since it's presumably no longer applicable.

The self-test actually crashed in the middle of the FFT = 5120K section, with the error:

[CODE]
N = 5242880, radix_set = 6 : product of complex radices 0 != (FFT length/2)
ERROR: at line 2818 of file ../get_fft_radices.c
Assertion failed: 0
[/CODE]

The timings were about four times slower:

[CODE]
17.0
1024 msec/iter = 43.18 ROE[avg,max] = [0.255357143, 0.343750000] radices = 8 16 16 16 16 0 0 0 0 0
1152 msec/iter = 49.72 ROE[avg,max] = [0.227158901, 0.256347656] radices = 18 8 16 16 16 0 0 0 0 0
1280 msec/iter = 55.10 ROE[avg,max] = [0.249909319, 0.281250000] radices = 10 16 16 16 16 0 0 0 0 0
1408 msec/iter = 63.56 ROE[avg,max] = [0.231169782, 0.265625000] radices = 22 8 16 16 16 0 0 0 0 0
1536 msec/iter = 66.41 ROE[avg,max] = [0.226067243, 0.253906250] radices = 12 16 16 16 16 0 0 0 0 0
1664 msec/iter = 74.32 ROE[avg,max] = [0.255719866, 0.281250000] radices = 26 8 16 16 16 0 0 0 0 0
1792 msec/iter = 79.85 ROE[avg,max] = [0.232812500, 0.312500000] radices = 14 16 16 16 16 0 0 0 0 0
1920 msec/iter = 106.87 ROE[avg,max] = [0.244818987, 0.281250000] radices = 60 32 32 16 0 0 0 0 0 0
2048 msec/iter = 91.61 ROE[avg,max] = [0.255859375, 0.312500000] radices = 8 16 16 16 32 0 0 0 0 0
2304 msec/iter = 103.68 ROE[avg,max] = [0.231054687, 0.281250000] radices = 18 16 16 16 16 0 0 0 0 0
2560 msec/iter = 115.50 ROE[avg,max] = [0.256919643, 0.312500000] radices = 10 16 16 32 16 0 0 0 0 0
2816 msec/iter = 134.34 ROE[avg,max] = [0.226226153, 0.253906250] radices = 22 16 16 16 16 0 0 0 0 0
3072 msec/iter = 139.39 ROE[avg,max] = [0.229003906, 0.281250000] radices = 12 16 16 32 16 0 0 0 0 0
3328 msec/iter = 156.65 ROE[avg,max] = [0.255078125, 0.281250000] radices = 26 16 16 16 16 0 0 0 0 0
3584 msec/iter = 167.19 ROE[avg,max] = [0.234919085, 0.281250000] radices = 14 16 16 32 16 0 0 0 0 0
3840 msec/iter = 232.40 ROE[avg,max] = [0.260686384, 0.312500000] radices = 240 8 8 8 16 0 0 0 0 0
4096 msec/iter = 191.62 ROE[avg,max] = [0.242801339, 0.312500000] radices = 8 16 16 32 32 0 0 0 0 0
4608 msec/iter = 217.14 ROE[avg,max] = [0.226663644, 0.265625000] radices = 18 16 16 32 16 0 0 0 0 0
[/CODE]

compared to the AVX-512 version:

[CODE]
17.0
1024 msec/iter = 10.27 ROE[avg,max] = [0.234430804, 0.281250000] radices = 32 16 32 32 0 0 0 0 0 0
1152 msec/iter = 13.07 ROE[avg,max] = [0.274553571, 0.343750000] radices = 36 16 32 32 0 0 0 0 0 0
1280 msec/iter = 14.90 ROE[avg,max] = [0.290569196, 0.343750000] radices = 160 16 16 16 0 0 0 0 0 0
1408 msec/iter = 17.20 ROE[avg,max] = [0.262848772, 0.281250000] radices = 176 16 16 16 0 0 0 0 0 0
1536 msec/iter = 19.33 ROE[avg,max] = [0.250020926, 0.281250000] radices = 192 16 16 16 0 0 0 0 0 0
1664 msec/iter = 20.86 ROE[avg,max] = [0.264160156, 0.312500000] radices = 208 16 16 16 0 0 0 0 0 0
1792 msec/iter = 19.47 ROE[avg,max] = [0.282254464, 0.312500000] radices = 56 16 32 32 0 0 0 0 0 0
1920 msec/iter = 23.27 ROE[avg,max] = [0.256640625, 0.312500000] radices = 240 16 16 16 0 0 0 0 0 0
2048 msec/iter = 21.54 ROE[avg,max] = [0.238113839, 0.281250000] radices = 32 32 32 32 0 0 0 0 0 0
2304 msec/iter = 26.63 ROE[avg,max] = [0.266880580, 0.312500000] radices = 144 16 16 32 0 0 0 0 0 0
2560 msec/iter = 29.84 ROE[avg,max] = [0.257589286, 0.312500000] radices = 160 16 16 32 0 0 0 0 0 0
2816 msec/iter = 34.61 ROE[avg,max] = [0.245047433, 0.312500000] radices = 176 16 16 32 0 0 0 0 0 0
3072 msec/iter = 38.85 ROE[avg,max] = [0.275613839, 0.375000000] radices = 192 16 16 32 0 0 0 0 0 0
3328 msec/iter = 42.24 ROE[avg,max] = [0.270535714, 0.312500000] radices = 208 16 16 32 0 0 0 0 0 0
3584 msec/iter = 43.80 ROE[avg,max] = [0.269921875, 0.312500000] radices = 224 16 16 32 0 0 0 0 0 0
3840 msec/iter = 48.09 ROE[avg,max] = [0.252887835, 0.312500000] radices = 240 16 16 32 0 0 0 0 0 0
4096 msec/iter = 49.58 ROE[avg,max] = [0.245026507, 0.281250000] radices = 32 16 16 16 16 0 0 0 0 0
4608 msec/iter = 59.24 ROE[avg,max] = [0.236941964, 0.281250000] radices = 144 16 32 32 0 0 0 0 0 0
5120 msec/iter = 65.80 ROE[avg,max] = [0.297656250, 0.375000000] radices = 160 16 32 32 0 0 0 0 0 0
5632 msec/iter = 75.59 ROE[avg,max] = [0.234268624, 0.281250000] radices = 176 16 32 32 0 0 0 0 0 0
6144 msec/iter = 85.82 ROE[avg,max] = [0.258161272, 0.281250000] radices = 192 16 32 32 0 0 0 0 0 0
6656 msec/iter = 92.11 ROE[avg,max] = [0.250704738, 0.312500000] radices = 208 16 32 32 0 0 0 0 0 0
7168 msec/iter = 94.87 ROE[avg,max] = [0.264208984, 0.312500000] radices = 224 16 32 32 0 0 0 0 0 0
7680 msec/iter = 103.35 ROE[avg,max] = [0.266294643, 0.312500000] radices = 240 16 32 32 0 0 0 0 0 0
[/CODE]

I double-checked to make sure there were no running jobs in the background.

GP2 2017-06-04 12:22

[QUOTE=ewmayer;460489]./Mlucas -fftlen 2560 -iters 100 -radset 0 -cpu 0[/QUOTE]

[CODE]
NTHREADS = 1
M49005071: using FFT length 2560K = 2621440 8-byte floats.
this gives an average 18.693951034545897 bits per digit
Using complex FFT radices 160 16 16 32
mers_mod_square: Complex-roots arrays have 1024, 1280 elements.
Mers_mod_square: Init threadpool of 1 threads
radix16_dif_dit_pass pfetch_dist = 4096
radix16_wrapper_square: pfetch_dist = 4096
Using 1 threads in carry step
100 iterations of M49005071 with FFT length 2621440 = 2560 K
Res64: 07EFE3EF1F78E763. AvgMaxErr = 0.257589286. MaxErr = 0.312500000. Program: E17.0
Res mod 2^36 = 64952526691
Res mod 2^35 - 1 = 22407816581
Res mod 2^36 - 1 = 54111649274
Clocks = 00:00:02.357
Done ...
[/CODE]

[QUOTE]./Mlucas -fftlen 2560 -iters 100 -radset 0 -cpu 0:1[/QUOTE]

[CODE]
NTHREADS = 2
M49005071: using FFT length 2560K = 2621440 8-byte floats.
this gives an average 18.693951034545897 bits per digit
Using complex FFT radices 160 16 16 32
mers_mod_square: Complex-roots arrays have 1024, 1280 elements.
Mers_mod_square: Init threadpool of 2 threads
radix16_dif_dit_pass pfetch_dist = 4096
radix16_wrapper_square: pfetch_dist = 4096
Using 2 threads in carry step
100 iterations of M49005071 with FFT length 2621440 = 2560 K
Res64: 07EFE3EF1F78E763. AvgMaxErr = 0.257589286. MaxErr = 0.312500000. Program: E17.0
Res mod 2^36 = 64952526691
Res mod 2^35 - 1 = 22407816581
Res mod 2^36 - 1 = 54111649274
Clocks = 00:00:01.216
Done ...
[/CODE]


Hmmmmmm....

That was for the AVX-512, the version with -march and -mtune.

I retried running worktodo.ini with ./Mlucas -cpu 0:1 versus the version with no option flag, and [B]instead of 29.3–29.4 ms/iter it's down to 18.8–18.9 ms/iter[/B]. Wow.

I'm a little confused here because I'm pretty sure that on Google Cloud, 2 vCPUs = 1 actual core. Looking at /proc/cpuinfo, I think these lines confirm it:

[CODE]
processor : 0
physical id : 0
siblings : 2
core id : 0
cpu cores : 1

processor : 1
physical id : 0
siblings : 2
core id : 0
cpu cores : 1
[/CODE]

I don't think it's worth repeating the exercise with AVX2 since it seems to perform so much worse than AVX-512, but let me know if it still matters.

GP2 2017-06-04 13:11

Meanwhile, mprime 29.1 on the same box:

[B]EDIT: oops, this is a misleading comparison, the below figures are for a 2400K FFT, not 2560K.

As mentioned earlier, 2560K FFT in the benchmarks takes about 17.6 ms/iter[/B]


with CoresPerTest=1

[CODE]
[Work thread Jun 4 12:32] Iteration: 230000 / 45106307 [0.50%], ms/iter: 16.682, ETA: 8d 15:56
[Work thread Jun 4 12:35] Iteration: 240000 / 45106307 [0.53%], ms/iter: 16.697, ETA: 8d 16:05
[Work thread Jun 4 12:37] Iteration: 250000 / 45106307 [0.55%], ms/iter: 16.676, ETA: 8d 15:47
[Work thread Jun 4 12:40] Iteration: 260000 / 45106307 [0.57%], ms/iter: 16.684, ETA: 8d 15:49
[/CODE]

and with CoresPerTest=2

[CODE]
[Work thread Jun 4 12:46] Iteration: 280000 / 45106307 [0.62%], ms/iter: 16.678, ETA: 8d 15:39
[Work thread Jun 4 12:49] Iteration: 290000 / 45106307 [0.64%], ms/iter: 16.667, ETA: 8d 15:29
[Work thread Jun 4 12:52] Iteration: 300000 / 45106307 [0.66%], ms/iter: 16.697, ETA: 8d 15:48
[Work thread Jun 4 12:54] Iteration: 310000 / 45106307 [0.68%], ms/iter: 16.660, ETA: 8d 15:18
[/CODE]

There is very little if any difference between the two. I didn't try HyperthreadLL=1 since this seems to be more or less deprecated in version 29.

The slight variability from one set of 10000 to the next is explained by the fact that on a virtual machine, other users are sharing the same physical server, and among other things, they're competing for the L3 cache.

In any case, this confirms that the n1-highcpu-2 virtual machine type with "2 vCPUs" really is only one core. In other words, the same obfuscation as on AWS.

But the main revelation here is that Mlucas is competitive with mprime on this platform. The difference between 18.8 ms/iter and 16.8 ms/iter is only about 11%. And playing with compiler flags or tinkering further with the code might yield more improvements, perhaps more readily than George can tinker with assembler to implement AVX-512 for mprime LL testing (he's already said that it will take some time).

So it might be worthwhile to try out Mlucas on Google Cloud on Skylake. It's only 1.5 cents per hour for a preemptible instance. If there's interest, I might try to write a how-to guide. And hopefully real soon Amazon AWS will be ready with their c5 instances (also Skylake).

Now to wait for the exponents to run to completion, hopefully they will yield good verified results.


All times are UTC. The time now is 16:05.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.