![]() |
|
|
#23 |
|
Aug 2006
597910 Posts |
|
|
|
|
|
|
#24 |
|
Tribal Bullet
Oct 2004
3,541 Posts |
Merged into the trunk, try switching away from the branch and see if anything explodes.
|
|
|
|
|
|
#25 |
|
Jul 2003
So Cal
2·34·13 Posts |
In the LACUDA branch, there's still a bug when using MPI. LA completes successfully, but the dependencies can't be read by the sqrt code.
Also, let's add Tom's 512-bit version. I'll try it with icc on Skylake soon. |
|
|
|
|
|
#26 |
|
Tribal Bullet
Oct 2004
67258 Posts |
It might be a problem postprocessing the iteration result; that went through a lot of churn in the branch to make it easier to substitute GPU vectors. What does the square root complain about?
|
|
|
|
|
|
#27 |
|
(loop (#_fork))
Feb 2006
Cambridge, England
11001000100112 Posts |
OK, I've built gcc-9.1 and built msieve with it, and discovered -mprefer-vector-width=512 which produces a binary which actually uses ZMM registers for mul_packed_core
Code:
454e6e: 8d 78 01 lea 0x1(%rax),%edi 454e71: 4c 8d 3c b9 lea (%rcx,%rdi,4),%r15 454e75: 41 0f b7 3f movzwl (%r15),%edi 454e79: 45 0f b7 7f 02 movzwl 0x2(%r15),%r15d 454e7e: 48 c1 e7 06 shl $0x6,%rdi 454e82: 49 c1 e7 06 shl $0x6,%r15 454e86: 62 b1 fe 48 6f 14 3a vmovdqu64 (%rdx,%r15,1),%zmm2 454e8d: 48 01 df add %rbx,%rdi 454e90: 62 f1 ed 48 ef 07 vpxorq (%rdi),%zmm2,%zmm0 454e96: 62 f1 fe 48 7f 07 vmovdqu64 %zmm0,(%rdi) One question for jasonp: in lanczos_matmul1.c and lanczos_matmul2.c, why do we divide by VWORDS in Code:
#ifdef MANUAL_PREFETCH PREFETCH(entries + i + 48 / VWORDS); #endif |
|
|
|
|
|
#28 | |
|
P90 years forever!
Aug 2002
Yeehaw, FL
2·53·71 Posts |
Quote:
|
|
|
|
|
|
|
#29 |
|
(loop (#_fork))
Feb 2006
Cambridge, England
72·131 Posts |
I'm having another look at VBITS=512. It's a lot slower than VBITS=256 even when built with zmm and running on a two-AVX512-unit machine.
The perf profile (admittedly on a one-AVX512-unit machine) looks like Code:
40.88% msieve msieve [.] core_NxB_BxB_acc 22.47% msieve msieve [.] core_BxN_NxB 18.96% msieve msieve [.] mul_packed_core 15.34% msieve msieve [.] mul_trans_packed_core The problem is that the compiler generates awful code for the very-unwound loops in lanczos_vv.c. To be more specific, the compiler decides not to inline v_xor, and then finds that it has to copy the large v_t structure onto the stack in order to call v_xor, which it does using sets of eight 64-bit loads and stores. I've re-wound the loops in the hope that modern gcc will be able to unwind them the right amount, and am running some more timings overnight. Last fiddled with by fivemack on 2019-07-06 at 21:47 |
|
|
|
|
|
#30 |
|
(loop (#_fork))
Feb 2006
Cambridge, England
72×131 Posts |
I've been unable to convince the compiler to generate reasonable code, and have sent a problem report to the gcc mailing list and written something manually using AVX512 intrinsics. The profile now seems a bit more like I expect
Code:
40.49% msieve-V512-PF4 msieve-V512-PF48 [.] mul_one_block.isra.0 27.95% msieve-V512-PF4 msieve-V512-PF48 [.] mul_trans_one_block.isra.0 17.43% msieve-V512-PF4 msieve-V512-PF48 [.] mul_BxN_NxB 11.15% msieve-V512-PF4 msieve-V512-PF48 [.] core_NxB_BxB_acc |
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Linear algebra with large vectors | jasonp | Msieve | 15 | 2018-02-12 23:40 |
| very long int | davar55 | Lounge | 60 | 2013-07-30 20:26 |
| Using long long's in Mingw with 32-bit Windows XP | grandpascorpion | Programming | 7 | 2009-10-04 12:13 |
| I think it's gonna be a long, long time | panic | Hardware | 9 | 2009-09-11 05:11 |
| Too long time to work ... ??? | Joël Harismendy | Software | 18 | 2005-05-16 15:05 |