mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   mfaktc: a CUDA program for Mersenne prefactoring (https://www.mersenneforum.org/showthread.php?t=12827)

bsquared 2010-05-13 15:35

[QUOTE=TheJudger;214895]
sieve_table_ could be alligned (I think). I really mean [1-4] but this can be changed, too.
...
If helpfully for allignment: I could move the content of sieve_table_[0] to e.g. sieve_table_[8] and then it would be [0..3] and [4..7].
[/QUOTE]

Having this aligned on a 16 byte boundary will make a big difference in performance. According to Agner Fog's [URL="http://www.agner.org/optimize/instruction_tables.pdf"]instruction tables[/URL], MOVDQA has a fixed latency of 3 cycles and a reciprocal throughput of 1. MOVDQU (for unaligned memory) has a variable latency and a reciprocal throughput anywhere from 4 to 13 times slower. If I understand this correctly, you'll be using 2 MOVDQA's (or U's): one to bring in sieve_table and another to write out ktab. Alignment should help a lot.

TheJudger 2010-05-13 16:28

Hi Kevin,

without the if it is still faster than the "old method" but slower than the current method which includes the if. Moving the if to the 3rd or 5th position is a little bit slower, too.

Having sieve_table_[0] in a separate array is at least worth a try!


Hi bsquared,
I'm pretty sure that I know the ideas of SIMD and I know why alligned memory accesses are better than unalligned. I can look at instruction tables an understand basically what the ops do but I'm unable to write inline assembly code. :sad:
So help is needed :smile:

Oliver

bsquared 2010-05-13 17:35

[QUOTE=TheJudger;214961]

Hi bsquared,
I'm pretty sure that I know the ideas of SIMD and I know why alligned memory accesses are better than unalligned. I can look at instruction tables an understand basically what the ops do but I'm unable to write inline assembly code. :sad:

[/quote]
Sorry, I didn't know what level of background knowledge to assume.

[QUOTE=TheJudger;214961]
So help is needed :smile:

Oliver[/QUOTE]

Well, this (after modification for alignment):
[CODE]
ktab[k ]=ic+sieve_table_[0];
ktab[k+1]=ic+sieve_table_[1];
ktab[k+2]=ic+sieve_table_[2];
ktab[k+3]=ic+sieve_table_[3];
[/CODE]

Might become something like this, in gcc, also assuming that ktab is 16 byte aligned.
[CODE]
__asm__ (
"pshufd 0x00, (%0), %%xmm1 \n\t" /* broadcast ic to xmm1 */
"paddd (%1), %%xmm1 \n\t" /* add 4 copies of ic to 4 entries of sieve_table */
"movdqa %%xmm1, (%2) \n\t" /* mov result into 4 locations of ktab */
:
: "r"(&ic), "r"(sieve_table_), "r"(&ktab[k])
: "xmm1", "memory");
[/CODE]

Note also that paddd does not set a carry flag, and wraps around on overflow. So this is only useful if you can guarantee there will be no overflow in any of the adds. Since the original code doesn't account for it, I assume this is the case.

TheJudger 2010-05-13 22:58

Hi bsquared,

thank you! :smile:

[CODE] __asm__ (
"pshufd $0x00, (%0), %%xmm1\n\t" /* broadcast ic to xmm1 */
"paddd (%1), %%xmm1\n\t" /* add 4 copies of ic to 4 entries of sieve_table */
"movdqu %%xmm1, (%2)\n\t" /* mov result into 4 locations of ktab */
:
: "r"(&ic), "r"(sieve_table_), "r"(&ktab[k])
: "xmm1", "memory");
[/CODE]
Took me a few minutes to figure out that there is a missing $ sign. :confused:
And I must use movdqu because k_tab[k] can't be alligned on a 16 byte boundary all the time.
I haven't integrated this into mfaktc yet. This was just a small test for myself to see how it works.
The wrap around of paddd is not a problem. The results are < 2^24 unless something went really wrong.

Oliver

TheJudger 2010-05-13 23:56

OK, a quicky in mfaktc code:

- putting sieve_table_[0]: no effect on runtime
- using the asm code posted before instead of [CODE]
ktab[k ]=ic+sieve_table_[0];
ktab[k+1]=ic+sieve_table_[1];
ktab[k+2]=ic+sieve_table_[2];
ktab[k+3]=ic+sieve_table_[3];[/CODE]
reached only ~87% performance of the c code. :sad:

Oliver

kjaget 2010-05-14 00:04

[QUOTE=TheJudger;214961]Hi Kevin,

without the if it is still faster than the "old method" but slower than the current method which includes the if. Moving the if to the 3rd or 5th position is a little bit slower, too.

Having sieve_table_[0] in a separate array is at least worth a try!
[/QUOTE]

Oh, my comments were talking about after you'd converted to SIMD code. I doubt that they would make a difference in the C code (unless the compiler is somehow automatically vectorizing the code for you, which is rare but possible).

More thoughts tomorrow when I get a chance to look at the ASM code.

Kevin

bsquared 2010-05-14 03:35

[quote=TheJudger;214979]

Took me a few minutes to figure out that there is a missing $ sign. :confused:
[/quote]

Sorry about that! I haven't tried to build your code at all, so that ASM snippet was untested when I posted it.

bsquared 2010-05-14 03:42

[quote=TheJudger;214980]OK, a quicky in mfaktc code:

- putting sieve_table_[0]: no effect on runtime
- using the asm code posted before instead of [code]
ktab[k ]=ic+sieve_table_[0];
ktab[k+1]=ic+sieve_table_[1];
ktab[k+2]=ic+sieve_table_[2];
ktab[k+3]=ic+sieve_table_[3];[/code]
reached only ~87% performance of the c code. :sad:

Oliver[/quote]

This is maybe not too surprising considering that the c code's independent adds get done 3 per clock tick and the compiler no doubt interleaves the reads and writes to hide latency. Whereas the three sse2 instructions are all in a dependency chain. Also, the variable latency of the last instruction in that chain likely disrupts the pipelining of subsequent writes.

Here is where movdqa might really help if ktab could be made to be aligned somehow.

frmky 2010-05-16 02:36

I tried this code (v 0.06-hack-2). I'm using 64-bit linux with gcc 4.3.2 and CUDA 3.0 on a 2.4GHz Core 2 Quad with a GTX 480. With
./mfaktc.exe 3321932839 66 71
I'm only getting about 38 million/sec. However, running 3 copies simultaneously gives about 31 million/sec for each, totaling about 93 million/sec. Here's a snippet of the output of one of 3 copies running simultaneously. BTW, a minor cosmetic bug... The GTX 480 has 32 cores/multiprocessor, not 8, so the number of "shader cores" is incorrect.

[CODE][cluster@node01 0.06]$ ./mfaktc.exe 3321932839 66 71
mfaktc v0.06
Compiletime Options
THREADS_PER_GRID 983040
THREADS_PER_BLOCK 256
SIEVE_SIZE_LIMIT 32kiB
SIEVE_SIZE 230945bits
USE_PINNED_MEMORY enabled
USE_ASYNC_COPY enabled
VERBOSE_TIMING disabled
SELFTEST disabled
MORE_CLASSES disabled

Runtime Options
SievePrimes 25000
SievePrimesAdjust 2
CudaStreams 2

CUDA device info
name: GeForce GTX 480
compute capabilities: 2.0
maximum threads per block: 1024
number of multiprocessors: 15 (120 shader cores)
clock rate: 1401MHz

tf(3321932839, 66, 71);
k_min = 11106030600
k_max = 355392982921
class 0: tested 150405120 candidates in 5730ms (26248712/sec) (avg. wait: 11849usec)
class 5: tested 150405120 candidates in 5704ms (26368359/sec) (avg. wait: 11821usec)
sp = 52500, min = 5739
sp = 26250, min = 0, max = 100000, prev = 52500 5739
class 9: tested 160235520 candidates in 5835ms (27461100/sec) (avg. wait: 15431usec)
class 12: tested 160235520 candidates in 5267ms (30422540/sec) (avg. wait: 11899usec)
sp = 26250, min = 5285
sp = 13125, min = 0, max = 52500, prev = 26250 5285
class 20: tested 170065920 candidates in 5552ms (30631469/sec) (avg. wait: 14334usec)
class 21: tested 170065920 candidates in 5434ms (31296635/sec) (avg. wait: 13653usec)
sp = 13125, min = 5442
sp = 32812, min = 13125, max = 52500, prev = 26250 0
class 29: tested 157286400 candidates in 5386ms (29202822/sec) (avg. wait: 11929usec)
class 32: tested 157286400 candidates in 5376ms (29257142/sec) (avg. wait: 11907usec)
sp = 32812, min = 5397
sp = 22968, min = 13125, max = 52500, prev = 32812 5397
class 36: tested 162201600 candidates in 5448ms (29772687/sec) (avg. wait: 13131usec)
class 41: tested 162201600 candidates in 5260ms (30836806/sec) (avg. wait: 12016usec)
sp = 22968, min = 5275
sp = 18046, min = 13125, max = 32812, prev = 22968 5275
class 44: tested 165150720 candidates in 5323ms (31025872/sec) (avg. wait: 12653usec)
class 56: tested 165150720 candidates in 5277ms (31296327/sec) (avg. wait: 12394usec)
sp = 18046, min = 5289
sp = 25429, min = 18046, max = 32812, prev = 22968 0
class 57: tested 160235520 candidates in 5242ms (30567630/sec) (avg. wait: 11922usec)
class 60: tested 160235520 candidates in 5243ms (30561800/sec) (avg. wait: 11908usec)
sp = 25429, min = 5258
sp = 21737, min = 18046, max = 32812, prev = 25429 5258
class 65: tested 162201600 candidates in 5284ms (30696744/sec) (avg. wait: 12329usec)
class 69: tested 162201600 candidates in 5217ms (31090971/sec) (avg. wait: 12030usec)
sp = 21737, min = 5231
sp = 19891, min = 18046, max = 25429, prev = 21737 5231
class 72: tested 164167680 candidates in 5276ms (31115936/sec) (avg. wait: 12254usec)
class 77: tested 164167680 candidates in 5276ms (31115936/sec) (avg. wait: 12195usec)
sp = 19891, min = 5289
sp = 22660, min = 19891, max = 25429, prev = 21737 0
[/CODE]

TheJudger 2010-05-16 14:37

Hi frmky,

[QUOTE=frmky;215107]I tried this code (v 0.06-hack-2). I'm using 64-bit linux with gcc 4.3.2 and CUDA 3.0 on a 2.4GHz Core 2 Quad with a GTX 480. With
./mfaktc.exe 3321932839 66 71
I'm only getting about 38 million/sec. However, running 3 copies simultaneously gives about 31 million/sec for each, totaling about 93 million/sec. Here's a snippet of the output of one of 3 copies running simultaneously. BTW, a minor cosmetic bug... The GTX 480 has 32 cores/multiprocessor, not 8, so the number of "shader cores" is incorrect.
[/QUOTE]

OK, with 3 copies the GPU performance is as expected, I've (limited) access to a GTX 480, too. Your performance with a single copy seems to be too low (I've no idea why).
Btw.: did you modify the compile script to enable sm_20 code? I've noticed when the GPU part was compiled for compute capability 1.x the GF100 chip just ignores the code and does nothing! I know that you're familar with CUDA to I think you've adjusted this.
Just in case: one possibility is to add "-arch=sm_20" to the nvcc command line in the compile script. (my todo list contains a check for problem...)

The number of shaders cores is allready fixed in the (unfinished, unreleased) 0.07 version.

If you want to spent some time: rerun with more CudaStreams (3,4 or 5) and perhaps with SievePrimesAdjust=1.

I never ran Kevins modified versions of 0.06 by myself (didn't compile on my system out of the box). Offcourse I've read his modification and implemented some of his stuff in 0.07.

Oliver

frmky 2010-05-16 17:42

I did not need to enable sm_20 code. It ran fine with the default sm_10 code, and found appropriate factors. (The GF100 cannot run binary code generated by CUDA 2.x, but can run PTX code from 2.x or binary code from CUDA 3.0 for any sm.) I did a little playing around. Adjusting CudaStreams above 2 had little effect on the speed. SievePrimesAdjust=2 was significantly faster than 1, but both were much slower than expected so I'm not sure what is happening there.


All times are UTC. The time now is 22:00.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.