![]() |
![]() |
#1 |
(loop (#_fork))
Feb 2006
Cambridge, England
33×239 Posts |
![]()
Has anyone got thoughts about the NxB times BxB multiply once VBITS grows big?
My profile at the moment looks like Code:
26.39% msieve-MP-V256- msieve-MP-V256-BDW [.] mul_BxN_NxB 26.09% msieve-MP-V256- msieve-MP-V256-BDW [.] mul_trans_packed_core 21.01% msieve-MP-V256- msieve-MP-V256-BDW [.] mul_packed_core 17.99% msieve-MP-V256- msieve-MP-V256-BDW [.] core_NxB_BxB_acc 5.58% msieve-MP-V256- msieve-MP-V256-BDW [.] mul_packed_small_core Code:
37.72% msieve-MP-V128- msieve-MP-V128-HSW [.] mul_trans_packed_core 30.40% msieve-MP-V128- msieve-MP-V128-HSW [.] mul_packed_core 14.74% msieve-MP-V128- msieve-MP-V128-HSW [.] mul_BxN_NxB 7.40% msieve-MP-V128- msieve-MP-V128-HSW [.] core_NxB_BxB_acc 6.50% msieve-MP-V128- msieve-MP-V128-HSW [.] mul_packed_small_core The matrix itself is VBITS^2 bits, which is 8kB and fits nicely in L1 for VBITS=256, but is already getting inconvenient for VBITS=512; I wonder how much slower it is to do eight times as many accesses, in a perfectly uniform pattern so no address computation, to a table of 1/32 the size? I suppose I should also try VBITS/2 tables of 4*VBITS and VBITS/4 tables of 16*VBITS, which are 16k and 64k. Last fiddled with by fivemack on 2021-06-27 at 19:16 |
![]() |
![]() |
![]() |
#2 |
(loop (#_fork))
Feb 2006
Cambridge, England
33·239 Posts |
![]()
This is driving me slightly mad: I made the obvious tweaks to use a 2-bit rather than 8-bit table, it slowed down immensely, and this was because it wasn't using YMM operations at all, rather carrying around the vector in four 64-bit registers.
When I recoded to use absolutely explicit YMM operations: Code:
zero=_mm256_xor_si256(zero,zero); _mm256_zeroupper(); for (i = 0; i < n; i++) { accum = zero; jig = 0; for (j=0; j<VWORDS; j++) { uint64 g = v[i].w[j]; for (k=0; k<64; k+=2) { accum = _mm256_xor_si256(accum,*(__m256i*)(&(c[jig+(g&3)]))); jig+=4; g>>=2; } } __m256i* yy = (__m256i*)(&(y[i])); *yy = _mm256_xor_si256(*yy,accum); } VPXOR xmm0,xmm0,xmm0 (note the "x") from the intrinsic Code:
accum = _mm256_xor_si256(accum,accum); Last fiddled with by fivemack on 2021-06-26 at 22:23 |
![]() |
![]() |
![]() |
#3 |
(loop (#_fork))
Feb 2006
Cambridge, England
33×239 Posts |
![]()
I am just downloading gcc-11.1 in the hope that it will be equipped with a sufficiently heavy-duty yak razor
|
![]() |
![]() |
![]() |
#4 |
(loop (#_fork))
Feb 2006
Cambridge, England
645310 Posts |
![]()
Well, that's not actually the cause; the vpxor xmm0,xmm0,xmm0 instruction is in fact defined as clearing the top half of the register (whilst the superficially similar pxor xmm0,xmm0 is defined as leaving the top half intact and causing false-sharing issues).
Last fiddled with by fivemack on 2021-06-27 at 13:06 |
![]() |
![]() |
![]() |
#5 |
Tribal Bullet
Oct 2004
5·709 Posts |
![]()
If the working set is too large when biting off 8-bit chunks then another option is to use a larger number of 6-bit chunks. With word size W bits and chunk size C bits the table will have (W/8) * (W/C) * 2^C bytes.
The msieve-lacuda branch uses C=6 and it does make GPU runs faster, but the stock version uses W=64 only. Greg is spending time generalizing to larger word size and folding in MPI so that multiple GPUs can combine together. We are also going very carefully through the BxN * NxB vector-vector operation, which is a playground of fun to implement in CUDA. Last fiddled with by jasonp on 2021-07-02 at 17:07 |
![]() |
![]() |
![]() |
#6 |
Jul 2003
So Cal
53·19 Posts |
![]()
I reimplemented the NxB_BxB CUDA kernel to bite off 2 bits at a time and make 4 (or actually 3 since the 00 table isn't needed) arrays in GPU shared memory rather than doing this on the cpu and uploading the result to the gpu. This gave a 3-5x speedup depending on the GPU.
https://github.com/gchilders/msieve_...czos_kernel.cu Other than merging the non-lacuda branch changes back into the lacuda branch and testing the CUDA-aware MPI stuff once I have access to a cluster that supports it, I think I'm mostly done. CUDA and CUDA+MPI both work. As a test I used two (now ancient) Tesla K20 gpus to solve a 3.36M x 3.36M matrix in 2h22m. Once I have access to them again, probably by Tuesday, I'll test it out on a couple of V100's. In the CPU code, the explicit unrolling generally needs to be removed. GCC 10+ don't auto-vectorize the unrolled loops but are good about detecting and vectorizing the rolled versions. This really helps for ARM SVE. Also, on ARM SVE, adding the option -msve-vector-bits=512 helps significantly. I'm not sure if there's an equivalent option for AVX2 or AVX 512. |
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
CRT optimisation? | mickfrancis | Math | 9 | 2016-03-30 10:20 |
Fastest you've driven a car? | Oddball | Lounge | 43 | 2011-03-14 00:26 |
Hardware Profile | storm5510 | Hardware | 6 | 2009-08-19 13:05 |
Avatars and profile pictures... | Xyzzy | Lounge | 3 | 2005-07-12 23:12 |
Incorrect CPU profile in 22.9 and 22.8 | sdbardwick | Software | 5 | 2002-09-22 19:49 |