mersenneforum.org  

Go Back   mersenneforum.org > Factoring Projects > Msieve

Reply
 
Thread Tools
Old 2018-08-09, 23:29   #12
fivemack
(loop (#_fork))
 
fivemack's Avatar
 
Feb 2006
Cambridge, England

144238 Posts
Default

Some timings for a Skylake Xeon machine, using 10 cores, manual prefetch and 256-bit vector size, for the whole Lanczos step for a 2.61M matrix

Code:
la_block  time
 1024     4901
 2048     4140
 4096     3769
 8192     3928
16384     4288
So that's quite a significant effect ... the machine has 1MB L2 cache per core, which could hold 32768 256-bit objects, but the fastest runtime is from a significantly smaller block size.

With la_block=4096, manual prefetch and 256-bit vectors, I'm getting less than 130 hours expected runtime for a 20.83M matrix.

I strongly recommend people doing linear algebra for nfs@home, particularly on the many-core Xeon machines that a number of them have recently picked up, to try some different la_block values.

I'm doing an la_block sweep on the Ivy Bridge machine tomorrow.
fivemack is offline   Reply With Quote
Old 2018-08-10, 19:24   #13
VBCurtis
 
VBCurtis's Avatar
 
"Curtis"
Feb 2005
Riverside, CA

4,861 Posts
Default

Thanks for the advice!

I tried two la_block settings on my new LA run 145_89 from 15e. The matrix is 11.9M with TD 136, and I have an old-ish precompiled msieve running on a SB (IB? I don't know) 10-core Xeon.
la_block=4096 ETA 102h after 1% of the job was complete.
restarting with -nc2 (not -ncr) with la_block=8192 ETA 89hr after 1%.
Default on this msieve binary is 8192 on this chip.
I didn't think to try 16384 until the matrix was 5% complete, and didn't care to restart.
If you find 256bit vectors are faster on your Ivy Bridge, I may attempt to compile my own msieve for this machine.

Last fiddled with by VBCurtis on 2018-08-10 at 19:27
VBCurtis is offline   Reply With Quote
Old 2018-08-10, 21:38   #14
VictordeHolland
 
VictordeHolland's Avatar
 
"Victor de Hollander"
Aug 2011
the Netherlands

100100110002 Posts
Default

Stupid question: but how do you compile Msieve with the longer vectors and/or with the manual prefetch ? Is it a compileflag like: "-O3 -march=native"

On a Ubuntu box I usually just do:

Code:
make all ECM=1 NO_ZLIB=0
and leaving the "-march=native" in the makefile should let gcc do the majority of the optimization work.

or on a Windows box I use MSYS2/mingw64 with:
Code:
 make all WIN=1 WIN64=1 ECM=1 NO_ZLIB=0
I usually change the "-march=native" to the respective architecture then (otherwise mingw sometimes uses generic)


I've got GCC 8.2.0 in MSYS2/mingw64 so I would like to give it a shot also:

Code:
Victor@PCVICTOR MINGW64 ~
$ gcc -v
Using built-in specs.
COLLECT_GCC=C:\msys64\mingw64\bin\gcc.exe
COLLECT_LTO_WRAPPER=C:/msys64/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.2.0/lto-wrapper.exe
Target: x86_64-w64-mingw32
Configured with: ../gcc-8.2.0/configure --prefix=/mingw64 --with-local-prefix=/mingw64/local --build=x86_64-w64-mingw32 --host=x86_64-w64-mingw32 --target=x86_64-w64-mingw32 --with-native-system-header-dir=/mingw64/x86_64-w64-mingw32/include --libexecdir=/mingw64/lib --enable-bootstrap --with-arch=x86-64 --with-tune=generic --enable-languages=ada,c,lto,c++,objc,obj-c++,fortran --enable-shared --enable-static --enable-libatomic --enable-threads=posix --enable-graphite --enable-fully-dynamic-string --enable-libstdcxx-filesystem-ts=yes --enable-libstdcxx-time=yes --disable-libstdcxx-pch --disable-libstdcxx-debug --disable-isl-version-check --enable-lto --enable-libgomp --disable-multilib --enable-checking=release --disable-rpath --disable-win32-registry --disable-nls --disable-werror --disable-symvers --with-libiconv --with-system-zlib --with-gmp=/mingw64 --with-mpfr=/mingw64 --with-mpc=/mingw64 --with-isl=/mingw64 --with-pkgversion='Rev1, Built by MSYS2 project' --with-bugurl=https://sourceforge.net/projects/msys2 --with-gnu-as --with-gnu-ld
Thread model: posix
gcc version 8.2.0 (Rev1, Built by MSYS2 project)
VictordeHolland is offline   Reply With Quote
Old 2018-08-11, 20:20   #15
fivemack
(loop (#_fork))
 
fivemack's Avatar
 
Feb 2006
Cambridge, England

72×131 Posts
Default

Check out the msieve-lacuda branch (svn co http://svn.code.sf.net/p/msieve/code.../msieve-lacuda)

Edit makefile to add '-DMANUAL_PREFETCH' after -DVBITS=$(VBITS)
Edit makefile to set CC=/wherever/you/installed/gcc-8.2.0/bin/gcc

Then build with

make VBITS=256 all

to get 256-bit vectors (64 or 128 also work)

If you use an older gcc then the VBITS will still work but it will produce code that doesn't use long-vector instructions and is significantly slower
fivemack is offline   Reply With Quote
Old 2018-08-11, 20:22   #16
fivemack
(loop (#_fork))
 
fivemack's Avatar
 
Feb 2006
Cambridge, England

144238 Posts
Default la_block sweep on IVB machine

Code:
    4096  8192 16384
64  2.92  3.15  3.33
128 2.77  4.22  3.89
256 3.00  3.93  3.75
So block size 8192 looks optimal for 128-bit and 256-bit vectors, and I should run la_block=32768 for 64-bit when the machine is next idle.
fivemack is offline   Reply With Quote
Old 2018-08-12, 21:15   #17
VictordeHolland
 
VictordeHolland's Avatar
 
"Victor de Hollander"
Aug 2011
the Netherlands

23·3·72 Posts
Default Sandy likes VBITS=128 -DMANUAL_PREFETCH

Thanks for the instructions, looks like I'll be using a new Msieve executable for my next NFS@home post-processing job .

Code:
read 7232348 cycles
matrix is 7232171 x 7232348 (2213.5 MB) with weight 685597573 (94.80/col)
 sparse part has weight 493466507 (68.23/col)
This is a Intel Core i5 2500k @4.0Ghz (SandyBridge) running Win10. Compiling was done with MSYS2/mingw using gcc8.2.0 .



msieve was run with: -nc2 -t 4
I ran the trunk version for the entire LA phase as a benchmark to compare the different versions with. Those were run for about an hour and I divided the dim done by the time to get to dim/day. Of course this matrix is much smaller, so can't directly compare to fivesmacks.
Code:
                                               Elapsed    Dim done   Dim/day  Compared to trunk
Msieve1022-trunk                               23:54:45   7232123    7258587   1.00x

Msievelacuda1022-VBITS=64                      01:02:20    308519    7127284   0.98x
Msievelacuda1022-VBITS=128                     01:51:21    607933    7861909   1.08x
Msievelacuda1022-VBITS=256                     01:03:30    317035    7189455   0.99x
    
Msievelacuda1022-VBITS=64  -DMANUAL_PREFETCH   01:01:56    323898    7530890   1.04x
Msievelacuda1022-VBITS=128 -DMANUAL_PREFETCH   01:01:14    351660    8269849   1.14x
Msievelacuda1022-VBITS=256 -DMANUAL_PREFETCH   01:01:54    314466    7315526   1.01x
The i5 has 256KB L2 cache per core and 6MB shared L3 cache. With VBITS=128 and DMANUAL_PREFETCH it uses:
Code:
 using block size 8192 and superblock size 294912 for processor cache size 6144 kB
Should I try different (manual) la_block and/or la_superblock sizes?
VictordeHolland is offline   Reply With Quote
Old 2018-08-13, 07:18   #18
fivemack
(loop (#_fork))
 
fivemack's Avatar
 
Feb 2006
Cambridge, England

72·131 Posts
Default

Quote:
Originally Posted by VictordeHolland View Post
The i5 has 256KB L2 cache per core and 6MB shared L3 cache. With VBITS=128 and DMANUAL_PREFETCH it uses:
Code:
 using block size 8192 and superblock size 294912 for processor cache size 6144 kB
Should I try different (manual) la_block and/or la_superblock sizes?
You can if you want, but I think the L2 architecture on your machine is very similar to that on my Ivy Bridge machine, on which I didn't see improvements from la_block changes away from the default for 128-bit vectors.

la_superblock is determined as 3/4 of the largest-level cache, which I think makes sense; I haven't tried varying it, you're welcome to try and see if anything interesting happens :)
fivemack is offline   Reply With Quote
Old 2018-08-13, 07:22   #19
fivemack
(loop (#_fork))
 
fivemack's Avatar
 
Feb 2006
Cambridge, England

72·131 Posts
Default

Quote:
Originally Posted by VictordeHolland View Post
Those were run for about an hour and I divided the dim done by the time to get to dim/day.
Because you have to rebuild the matrix for the longer vectors, I found it important to time between 'commencing Lanczos iteration' and 'lanczos halted' - my runs varied by up to twenty minutes in the time spent in the 'read %d relations' phase, because I'm reading rather a lot of relations off a busy NFS server, and that variability swamped the linear-algebra timing variation.
fivemack is offline   Reply With Quote
Old 2018-08-13, 10:01   #20
VictordeHolland
 
VictordeHolland's Avatar
 
"Victor de Hollander"
Aug 2011
the Netherlands

22308 Posts
Default

Quote:
Originally Posted by fivemack View Post
I found it important to time between 'commencing Lanczos iteration' and 'lanczos halted'
I already timed it that way .

Still it is a big enough improvement (10-15%) to justify recompiling on any machine running NFS post-processing jobs.
VictordeHolland is offline   Reply With Quote
Old 2018-08-13, 11:41   #21
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

3,541 Posts
Default

Looks like I should be pulling the non-CUDA changes in that branch to the trunk for a wider audience.
jasonp is offline   Reply With Quote
Old 2018-08-13, 14:50   #22
VictordeHolland
 
VictordeHolland's Avatar
 
"Victor de Hollander"
Aug 2011
the Netherlands

100100110002 Posts
Default

As long as it doesn't break older compilers :). We don't want to force people to upgrade to gcc-8.

maybe something in the makefile like:
# for long vectors (VBITS) gcc-X or later is required
# VBITS options are 64/128/256, larger values are not always beneficial
# e.g. Sandy- and Ivybridge are fastest with VBITS=128
VictordeHolland is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
Linear algebra with large vectors jasonp Msieve 15 2018-02-12 23:40
very long int davar55 Lounge 60 2013-07-30 20:26
Using long long's in Mingw with 32-bit Windows XP grandpascorpion Programming 7 2009-10-04 12:13
I think it's gonna be a long, long time panic Hardware 9 2009-09-11 05:11
Too long time to work ... ??? Joël Harismendy Software 18 2005-05-16 15:05

All times are UTC. The time now is 01:07.


Sat Jul 17 01:07:08 UTC 2021 up 49 days, 22:54, 1 user, load averages: 1.86, 1.86, 1.60

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.