mersenneforum.org  

Go Back   mersenneforum.org > Factoring Projects > Msieve

Reply
 
Thread Tools
Old 2018-08-13, 18:04   #23
CRGreathouse
 
CRGreathouse's Avatar
 
Aug 2006

135338 Posts
Default

Quote:
Originally Posted by VictordeHolland View Post
Still it is a big enough improvement (10-15%) to justify recompiling on any machine running NFS post-processing jobs.
That's huge!
CRGreathouse is offline   Reply With Quote
Old 2018-08-19, 02:44   #24
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

3,541 Posts
Default

Merged into the trunk, try switching away from the branch and see if anything explodes.
jasonp is offline   Reply With Quote
Old 2018-08-19, 05:42   #25
frmky
 
frmky's Avatar
 
Jul 2003
So Cal

210610 Posts
Default

In the LACUDA branch, there's still a bug when using MPI. LA completes successfully, but the dependencies can't be read by the sqrt code.

Also, let's add Tom's 512-bit version. I'll try it with icc on Skylake soon.
frmky is offline   Reply With Quote
Old 2018-08-19, 17:17   #26
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

1101110101012 Posts
Default

It might be a problem postprocessing the iteration result; that went through a lot of churn in the branch to make it easier to substitute GPU vectors. What does the square root complain about?
jasonp is offline   Reply With Quote
Old 2019-07-05, 21:50   #27
fivemack
(loop (#_fork))
 
fivemack's Avatar
 
Feb 2006
Cambridge, England

72·131 Posts
Default Successful build with zmm registers; slower than ymm

OK, I've built gcc-9.1 and built msieve with it, and discovered -mprefer-vector-width=512 which produces a binary which actually uses ZMM registers for mul_packed_core

Code:
  454e6e:       8d 78 01                lea    0x1(%rax),%edi
  454e71:       4c 8d 3c b9             lea    (%rcx,%rdi,4),%r15
  454e75:       41 0f b7 3f             movzwl (%r15),%edi
  454e79:       45 0f b7 7f 02          movzwl 0x2(%r15),%r15d
  454e7e:       48 c1 e7 06             shl    $0x6,%rdi
  454e82:       49 c1 e7 06             shl    $0x6,%r15
  454e86:       62 b1 fe 48 6f 14 3a    vmovdqu64 (%rdx,%r15,1),%zmm2
  454e8d:       48 01 df                add    %rbx,%rdi
  454e90:       62 f1 ed 48 ef 07       vpxorq (%rdi),%zmm2,%zmm0
  454e96:       62 f1 fe 48 7f 07       vmovdqu64 %zmm0,(%rdi)
On Skylake Xeon Silver, this is a lot slower than the VBITS=256 version; 56% the speed. I will try on an i9 system with the dual AVX-512 pipes later.

One question for jasonp: in lanczos_matmul1.c and lanczos_matmul2.c, why do we divide by VWORDS in

Code:
		#ifdef MANUAL_PREFETCH
		PREFETCH(entries + i + 48 / VWORDS);
		#endif
The 48 is a count of *matrix index records* and the idea is to have the indices for the next-but-three round through the unrolled loop already ready by the time the loop starts, so I don't see why you want to prefetch less far ahead just because the vectors are wider.
fivemack is offline   Reply With Quote
Old 2019-07-06, 00:25   #28
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

11101011001102 Posts
Default

Quote:
Originally Posted by fivemack View Post
On Skylake Xeon Silver, this is a lot slower than the VBITS=256 version; 56% the speed. I will try on an i9 system with the dual AVX-512 pipes later.
LLR is also slower using AVX-512 on CPUs with only one AVX-512 execution unit. See https://www.primegrid.com/forum_thre...ap=true#129729
Prime95 is offline   Reply With Quote
Old 2019-07-06, 21:46   #29
fivemack
(loop (#_fork))
 
fivemack's Avatar
 
Feb 2006
Cambridge, England

72×131 Posts
Default

I'm having another look at VBITS=512. It's a lot slower than VBITS=256 even when built with zmm and running on a two-AVX512-unit machine.

The perf profile (admittedly on a one-AVX512-unit machine) looks like

Code:
  40.88%  msieve   msieve              [.] core_NxB_BxB_acc 
  22.47%  msieve   msieve              [.] core_BxN_NxB    
  18.96%  msieve   msieve              [.] mul_packed_core 
  15.34%  msieve   msieve              [.] mul_trans_packed_core
which is obviously awful.

The problem is that the compiler generates awful code for the very-unwound loops in lanczos_vv.c. To be more specific, the compiler decides not to inline v_xor, and then finds that it has to copy the large v_t structure onto the stack in order to call v_xor, which it does using sets of eight 64-bit loads and stores.

I've re-wound the loops in the hope that modern gcc will be able to unwind them the right amount, and am running some more timings overnight.

Last fiddled with by fivemack on 2019-07-06 at 21:47
fivemack is offline   Reply With Quote
Old 2019-07-07, 15:04   #30
fivemack
(loop (#_fork))
 
fivemack's Avatar
 
Feb 2006
Cambridge, England

72×131 Posts
Default

I've been unable to convince the compiler to generate reasonable code, and have sent a problem report to the gcc mailing list and written something manually using AVX512 intrinsics. The profile now seems a bit more like I expect

Code:
  40.49%  msieve-V512-PF4  msieve-V512-PF48                      [.] mul_one_block.isra.0
  27.95%  msieve-V512-PF4  msieve-V512-PF48                      [.] mul_trans_one_block.isra.0
  17.43%  msieve-V512-PF4  msieve-V512-PF48                      [.] mul_BxN_NxB
  11.15%  msieve-V512-PF4  msieve-V512-PF48                      [.] core_NxB_BxB_acc
but it's still about 25% slower than the best VBITS=256 binary ... multiplying by the 512x512 bit-matrix is decidedly cache-unfriendly, with the access to a random entry in each of the 64 256x512-bit slices of precomputed product-by-byte. Seeing if I can do better using streaming-load intrinsics so the cache doesn't get so completely flattened.
fivemack is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Linear algebra with large vectors jasonp Msieve 15 2018-02-12 23:40
very long int davar55 Lounge 60 2013-07-30 20:26
Using long long's in Mingw with 32-bit Windows XP grandpascorpion Programming 7 2009-10-04 12:13
I think it's gonna be a long, long time panic Hardware 9 2009-09-11 05:11
Too long time to work ... ??? Joël Harismendy Software 18 2005-05-16 15:05

All times are UTC. The time now is 01:10.


Sat Jul 17 01:10:01 UTC 2021 up 49 days, 22:57, 1 user, load averages: 0.80, 1.45, 1.48

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.