![]() |
|
|
#12 |
|
(loop (#_fork))
Feb 2006
Cambridge, England
72×131 Posts |
Some timings for a Skylake Xeon machine, using 10 cores, manual prefetch and 256-bit vector size, for the whole Lanczos step for a 2.61M matrix
Code:
la_block time 1024 4901 2048 4140 4096 3769 8192 3928 16384 4288 With la_block=4096, manual prefetch and 256-bit vectors, I'm getting less than 130 hours expected runtime for a 20.83M matrix. I strongly recommend people doing linear algebra for nfs@home, particularly on the many-core Xeon machines that a number of them have recently picked up, to try some different la_block values. I'm doing an la_block sweep on the Ivy Bridge machine tomorrow. |
|
|
|
|
|
#13 |
|
"Curtis"
Feb 2005
Riverside, CA
4,861 Posts |
Thanks for the advice!
I tried two la_block settings on my new LA run 145_89 from 15e. The matrix is 11.9M with TD 136, and I have an old-ish precompiled msieve running on a SB (IB? I don't know) 10-core Xeon. la_block=4096 ETA 102h after 1% of the job was complete. restarting with -nc2 (not -ncr) with la_block=8192 ETA 89hr after 1%. Default on this msieve binary is 8192 on this chip. I didn't think to try 16384 until the matrix was 5% complete, and didn't care to restart. If you find 256bit vectors are faster on your Ivy Bridge, I may attempt to compile my own msieve for this machine. Last fiddled with by VBCurtis on 2018-08-10 at 19:27 |
|
|
|
|
|
#14 |
|
"Victor de Hollander"
Aug 2011
the Netherlands
22308 Posts |
Stupid question: but how do you compile Msieve with the longer vectors and/or with the manual prefetch
? Is it a compileflag like: "-O3 -march=native"On a Ubuntu box I usually just do: Code:
make all ECM=1 NO_ZLIB=0 or on a Windows box I use MSYS2/mingw64 with: Code:
make all WIN=1 WIN64=1 ECM=1 NO_ZLIB=0 I've got GCC 8.2.0 in MSYS2/mingw64 so I would like to give it a shot also: Code:
Victor@PCVICTOR MINGW64 ~ $ gcc -v Using built-in specs. COLLECT_GCC=C:\msys64\mingw64\bin\gcc.exe COLLECT_LTO_WRAPPER=C:/msys64/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.2.0/lto-wrapper.exe Target: x86_64-w64-mingw32 Configured with: ../gcc-8.2.0/configure --prefix=/mingw64 --with-local-prefix=/mingw64/local --build=x86_64-w64-mingw32 --host=x86_64-w64-mingw32 --target=x86_64-w64-mingw32 --with-native-system-header-dir=/mingw64/x86_64-w64-mingw32/include --libexecdir=/mingw64/lib --enable-bootstrap --with-arch=x86-64 --with-tune=generic --enable-languages=ada,c,lto,c++,objc,obj-c++,fortran --enable-shared --enable-static --enable-libatomic --enable-threads=posix --enable-graphite --enable-fully-dynamic-string --enable-libstdcxx-filesystem-ts=yes --enable-libstdcxx-time=yes --disable-libstdcxx-pch --disable-libstdcxx-debug --disable-isl-version-check --enable-lto --enable-libgomp --disable-multilib --enable-checking=release --disable-rpath --disable-win32-registry --disable-nls --disable-werror --disable-symvers --with-libiconv --with-system-zlib --with-gmp=/mingw64 --with-mpfr=/mingw64 --with-mpc=/mingw64 --with-isl=/mingw64 --with-pkgversion='Rev1, Built by MSYS2 project' --with-bugurl=https://sourceforge.net/projects/msys2 --with-gnu-as --with-gnu-ld Thread model: posix gcc version 8.2.0 (Rev1, Built by MSYS2 project) |
|
|
|
|
|
#15 |
|
(loop (#_fork))
Feb 2006
Cambridge, England
72·131 Posts |
Check out the msieve-lacuda branch (svn co http://svn.code.sf.net/p/msieve/code.../msieve-lacuda)
Edit makefile to add '-DMANUAL_PREFETCH' after -DVBITS=$(VBITS) Edit makefile to set CC=/wherever/you/installed/gcc-8.2.0/bin/gcc Then build with make VBITS=256 all to get 256-bit vectors (64 or 128 also work) If you use an older gcc then the VBITS will still work but it will produce code that doesn't use long-vector instructions and is significantly slower |
|
|
|
|
|
#16 |
|
(loop (#_fork))
Feb 2006
Cambridge, England
72×131 Posts |
Code:
4096 8192 16384 64 2.92 3.15 3.33 128 2.77 4.22 3.89 256 3.00 3.93 3.75 |
|
|
|
|
|
#17 |
|
"Victor de Hollander"
Aug 2011
the Netherlands
23×3×72 Posts |
Thanks for the instructions, looks like I'll be using a new Msieve executable for my next NFS@home post-processing job
.Code:
read 7232348 cycles matrix is 7232171 x 7232348 (2213.5 MB) with weight 685597573 (94.80/col) sparse part has weight 493466507 (68.23/col) msieve was run with: -nc2 -t 4 I ran the trunk version for the entire LA phase as a benchmark to compare the different versions with. Those were run for about an hour and I divided the dim done by the time to get to dim/day. Of course this matrix is much smaller, so can't directly compare to fivesmacks. Code:
Elapsed Dim done Dim/day Compared to trunk
Msieve1022-trunk 23:54:45 7232123 7258587 1.00x
Msievelacuda1022-VBITS=64 01:02:20 308519 7127284 0.98x
Msievelacuda1022-VBITS=128 01:51:21 607933 7861909 1.08x
Msievelacuda1022-VBITS=256 01:03:30 317035 7189455 0.99x
Msievelacuda1022-VBITS=64 -DMANUAL_PREFETCH 01:01:56 323898 7530890 1.04x
Msievelacuda1022-VBITS=128 -DMANUAL_PREFETCH 01:01:14 351660 8269849 1.14x
Msievelacuda1022-VBITS=256 -DMANUAL_PREFETCH 01:01:54 314466 7315526 1.01x
Code:
using block size 8192 and superblock size 294912 for processor cache size 6144 kB |
|
|
|
|
|
#18 | |
|
(loop (#_fork))
Feb 2006
Cambridge, England
11001000100112 Posts |
Quote:
la_superblock is determined as 3/4 of the largest-level cache, which I think makes sense; I haven't tried varying it, you're welcome to try and see if anything interesting happens :) |
|
|
|
|
|
|
#19 |
|
(loop (#_fork))
Feb 2006
Cambridge, England
191316 Posts |
Because you have to rebuild the matrix for the longer vectors, I found it important to time between 'commencing Lanczos iteration' and 'lanczos halted' - my runs varied by up to twenty minutes in the time spent in the 'read %d relations' phase, because I'm reading rather a lot of relations off a busy NFS server, and that variability swamped the linear-algebra timing variation.
|
|
|
|
|
|
#20 |
|
"Victor de Hollander"
Aug 2011
the Netherlands
23×3×72 Posts |
|
|
|
|
|
|
#21 |
|
Tribal Bullet
Oct 2004
67258 Posts |
Looks like I should be pulling the non-CUDA changes in that branch to the trunk for a wider audience.
|
|
|
|
|
|
#22 |
|
"Victor de Hollander"
Aug 2011
the Netherlands
23·3·72 Posts |
As long as it doesn't break older compilers :). We don't want to force people to upgrade to gcc-8.
maybe something in the makefile like: # for long vectors (VBITS) gcc-X or later is required # VBITS options are 64/128/256, larger values are not always beneficial # e.g. Sandy- and Ivybridge are fastest with VBITS=128 |
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Linear algebra with large vectors | jasonp | Msieve | 15 | 2018-02-12 23:40 |
| very long int | davar55 | Lounge | 60 | 2013-07-30 20:26 |
| Using long long's in Mingw with 32-bit Windows XP | grandpascorpion | Programming | 7 | 2009-10-04 12:13 |
| I think it's gonna be a long, long time | panic | Hardware | 9 | 2009-09-11 05:11 |
| Too long time to work ... ??? | Joël Harismendy | Software | 18 | 2005-05-16 15:05 |