 2017-11-07, 17:57 #1 GP2     Sep 2003 29×89 Posts c5 instances are now available https://aws.amazon.com/blogs/aws/now...or-amazon-ec2/ They run on 3.0 GHz Intel Xeon Platinum 8000-series, which has AVX-512. Amazon claims 25% price/performance improvement over c4. Many technical details will be provided at AWS re:Invent at the end of this month. They are not available yet in us-east-2 (Ohio), which usually has the cheapest spot prices.
 2017-11-07, 18:00 #2 Batalov     "Serge" Mar 2008 Phi(4,2^7658614+1)/2 22·2,281 Posts Cool! Just 1 year has passed since they announced that they will be deploying these "soon" - and there! The naysayers were shamed.
 2017-11-07, 18:29 #3 Mark Rose     "/X\(‘-‘)/X\" Jan 2013 23·359 Posts I'm benchmarking a c5.large, c5.xlarge, c5.2xlarge, and a c5.18xlarge now. The mprime 29.4 isn't setting affinities properly on the c5.18xlarge.
 2017-11-07, 19:12 #7 Mark Rose     "/X\(‘-‘)/X\" Jan 2013 23×359 Posts c5.18xlarge I wouldn't trust these as affinities weren't being set properly due to a bug. I've removed the error output Because the affinities are messed, I decided not to benchmark all the FFTs to find the fastest. [Worker #1 Nov 7 18:28] Timing 2048K all-complex FFT, 36 cores, 1 worker. Average times: 1.25 ms. Total throughput: 797.24 iter/sec. [Worker #1 Nov 7 18:28] Timing 2048K all-complex FFT, 36 cores, 2 workers. Average times: 1.08, 1.28 ms. Total throughput: 1706.70 iter/sec. [Worker #1 Nov 7 18:29] Timing 2048K all-complex FFT, 36 cores, 4 workers. Average times: 2.22, 2.22, 0.84, 1.41 ms. Total throughput: 2799.02 iter/sec. [Worker #1 Nov 7 18:29] Timing 2048K all-complex FFT, 36 cores, 36 workers. Average times: 34.21, 33.66, 33.35, 33.35, 33.83, 33.93, 33.23, 33.96, 32.88, 33.05, 33.53, 33.75, 33.47, 34.09, 33.56, 33.74, 33.43, 33.99, 8.53, 8.49, 8.76, 8.58, 8.67, 8.67, 8.76, 8.67, 8.74, 19.16, 18.99, 19.29, 19.05, 18.99, 19.06, 19.10, 19.04, 19.16 ms. Total throughput: 2047.25 iter/sec. [Worker #1 Nov 7 18:43] Timing 4096K all-complex FFT, 36 cores, 1 worker. Average times: 2.01 ms. Total throughput: 498.58 iter/sec. [Worker #1 Nov 7 18:43] Timing 4096K all-complex FFT, 36 cores, 2 workers. Average times: 2.15, 2.15 ms. Total throughput: 930.42 iter/sec. [Worker #1 Nov 7 18:43] Timing 4096K all-complex FFT, 36 cores, 4 workers. Average times: 6.26, 6.28, 1.45, 3.59 ms. Total throughput: 1285.26 iter/sec. [Worker #1 Nov 7 18:44] Timing 4096K all-complex FFT, 36 cores, 36 workers. Average times: 66.77, 69.46, 66.12, 69.89, 67.33, 68.40, 66.98, 69.07, 68.37, 69.44, 68.03, 68.91, 67.69, 69.14, 66.91, 69.37, 67.80, 70.73, 18.66, 18.45, 18.44, 18.66, 18.64, 18.64, 18.74, 18.73, 18.65, 37.09, 37.40, 36.93, 37.02, 37.17, 37.54, 37.17, 37.14, 37.33 ms. Total throughput: 988.67 iter/sec. [Worker #1 Nov 7 18:46] Timing 8192K all-complex FFT, 36 cores, 1 worker. Average times: 3.06 ms. Total throughput: 326.87 iter/sec. [Worker #1 Nov 7 18:47] Timing 8192K all-complex FFT, 36 cores, 2 workers. Average times: 6.13, 3.93 ms. Total throughput: 417.28 iter/sec. [Worker #1 Nov 7 18:47] Timing 8192K all-complex FFT, 36 cores, 4 workers. Average times: 15.32, 15.17, 3.53, 8.20 ms. Total throughput: 536.48 iter/sec. [Worker #1 Nov 7 18:47] Timing 8192K all-complex FFT, 36 cores, 36 workers. Average times: 137.61, 139.49, 134.09, 136.52, 135.62, 140.54, 137.09, 141.09, 137.18, 139.40, 136.97, 139.04, 134.56, 137.79, 135.77, 138.37, 136.77, 139.72, 37.54, 37.76, 37.79, 37.67, 38.11, 37.71, 37.95, 37.74, 37.68, 74.31, 73.64, 74.08, 74.22, 74.43, 73.80, 74.92, 74.60, 74.30 ms. Total throughput: 490.27 iter/sec.
 2017-11-08, 00:35 #8 GP2     Sep 2003 29·89 Posts c5.large seems to be about 30% faster then c4.large, using mprime 29.4b3 I didn't do a proper benchmark, I just started two LL tests at the same time, for nearly identical exponents in the 47.09M range, both in new subdirectories. The c5.large subdirectory had HyperthreadLL=1 in local.txt, as recommended by Mark Rose; the c4.large subdirectory did not, since my own earlier tests indicated that it doesn't help. Note that mprime has not yet been modified to use AVX-512 instructions, so further speed improvements may be available. Mlucas v17 does use AVX-512, but there's a compile error at the moment... Last fiddled with by GP2 on 2017-11-08 at 02:35
 Originally Posted by GP2 c5.large seems to be about 30% faster then c4.large, using mprime 29.4b3 I didn't do a proper benchmark, I just started two LL tests at the same time, for nearly identical exponents in the 47.09M range, both in new subdirectories. The c5.large subdirectory had HyperthreadLL=1 in local.txt, as recommended by Mark Rose; the c4.large subdirectory did not, since my own earlier tests indicated that it doesn't help. Note that mprime has not yet been modified to use AVX-512 instructions, so further speed improvements may be available. Mlucas v17 does use AVX-512, but there's a compile error at the moment...
c4.large already makes use of FMA3 instructions, and should work @ 2.9 GHz: where comes that 30% from, if not from AVX-512? Has the new platform a better handling of memory latencies/throughtput?

 Originally Posted by ET_ c4.large already makes use of FMA3 instructions, and should work @ 2.9 GHz: where comes that 30% from, if not from AVX-512? Has the new platform a better handling of memory latencies/throughtput?
The new platform has six rather than four memory channels, and 1MB rather than 256kB L2 caches.

 2017-11-09, 06:36 #11 GP2     Sep 2003 29·89 Posts Compiling code in Amazon Linux on c5 instances First of all, if you use Amazon Linux, you should use version 2017.09 or later. The instance launch page should propose this as one of the options, but if not, the AMI IDs are listed here for the various regions. In this table, we care mostly about the first column, because for c4 or c5 instances you can only use HVM (not PV) and EBS-Backed (not Instance Store), as shown in the type matrix. By default, Amazon Linux only supplies a minimum set of packages. If you want a compiler, you have to install it. As described in the Preparing to Compile Software documentation page, you can install the compiler and associated tools with the command Code: sudo yum groupinstall "Development Tools" But... this will install gcc version 4.8.5, which is very old, and doesn't know how to optimize for Skylake. However, as described in the Amazon Linux AMI 2017.09 Release Notes, gcc version 6.4 is available as a separate download: Code: sudo yum install gcc64 You have to invoke this compiler using gcc64, because the default gcc will be the 4.8.5 version. Furthermore, you should invoke the compiler with the -march=skylake-avx512 flag to generate code that takes advantage of Skylake. This is documented in the man gcc64 page. For example, to compile Mlucas, as described in the README page, you would: Fetch http://www.mersenneforum.org/mayer/src/C/mlucas_v17.txz and then run: Code: tar xJf mlucas_v17.txz cd mlucas_v17/src mkdir obj cd obj gcc64 -c -O3 -DUSE_AVX512 -DUSE_THREADS -march=skylake-avx512 ../*.c >& build.log grep -i error build.log gcc64 -o Mlucas *.o -lm -lpthread -lrt While hyperthreading did not benefit programs on c4 instances, it is beneficial on c5 instances. So to create an mlucas.cfg file: Code: ./Mlucas -s m -iters 1000 -cpu 0:1 >& selftest.log (this will take a long time to complete). Then copy the Mlucas executable and the mlucas.cfg file to an empty working directory, and create a worktodo.ini file with the usual Test= or DoubleCheck= lines, which you can get from the Manual Assignment page. Or you can use the primenet.py auxiliary program, as described in the README file, to populate worktodo.ini When invoking the program in the working directory, use Code: nohup ./Mlucas -cpu 0:1 & You can monitor the progress of the program by looking at the .stat file corresponding to the exponent being tested. When the test completes, the fourth-last line in the file will contain the Res64 residue, which can be submitted manually via the Manual Results page, or once again by using the primenet.py auxiliary program.

