![]() |
[QUOTE=TheJudger;214895]
sieve_table_ could be alligned (I think). I really mean [1-4] but this can be changed, too. ... If helpfully for allignment: I could move the content of sieve_table_[0] to e.g. sieve_table_[8] and then it would be [0..3] and [4..7]. [/QUOTE] Having this aligned on a 16 byte boundary will make a big difference in performance. According to Agner Fog's [URL="http://www.agner.org/optimize/instruction_tables.pdf"]instruction tables[/URL], MOVDQA has a fixed latency of 3 cycles and a reciprocal throughput of 1. MOVDQU (for unaligned memory) has a variable latency and a reciprocal throughput anywhere from 4 to 13 times slower. If I understand this correctly, you'll be using 2 MOVDQA's (or U's): one to bring in sieve_table and another to write out ktab. Alignment should help a lot. |
Hi Kevin,
without the if it is still faster than the "old method" but slower than the current method which includes the if. Moving the if to the 3rd or 5th position is a little bit slower, too. Having sieve_table_[0] in a separate array is at least worth a try! Hi bsquared, I'm pretty sure that I know the ideas of SIMD and I know why alligned memory accesses are better than unalligned. I can look at instruction tables an understand basically what the ops do but I'm unable to write inline assembly code. :sad: So help is needed :smile: Oliver |
[QUOTE=TheJudger;214961]
Hi bsquared, I'm pretty sure that I know the ideas of SIMD and I know why alligned memory accesses are better than unalligned. I can look at instruction tables an understand basically what the ops do but I'm unable to write inline assembly code. :sad: [/quote] Sorry, I didn't know what level of background knowledge to assume. [QUOTE=TheJudger;214961] So help is needed :smile: Oliver[/QUOTE] Well, this (after modification for alignment): [CODE] ktab[k ]=ic+sieve_table_[0]; ktab[k+1]=ic+sieve_table_[1]; ktab[k+2]=ic+sieve_table_[2]; ktab[k+3]=ic+sieve_table_[3]; [/CODE] Might become something like this, in gcc, also assuming that ktab is 16 byte aligned. [CODE] __asm__ ( "pshufd 0x00, (%0), %%xmm1 \n\t" /* broadcast ic to xmm1 */ "paddd (%1), %%xmm1 \n\t" /* add 4 copies of ic to 4 entries of sieve_table */ "movdqa %%xmm1, (%2) \n\t" /* mov result into 4 locations of ktab */ : : "r"(&ic), "r"(sieve_table_), "r"(&ktab[k]) : "xmm1", "memory"); [/CODE] Note also that paddd does not set a carry flag, and wraps around on overflow. So this is only useful if you can guarantee there will be no overflow in any of the adds. Since the original code doesn't account for it, I assume this is the case. |
Hi bsquared,
thank you! :smile: [CODE] __asm__ ( "pshufd $0x00, (%0), %%xmm1\n\t" /* broadcast ic to xmm1 */ "paddd (%1), %%xmm1\n\t" /* add 4 copies of ic to 4 entries of sieve_table */ "movdqu %%xmm1, (%2)\n\t" /* mov result into 4 locations of ktab */ : : "r"(&ic), "r"(sieve_table_), "r"(&ktab[k]) : "xmm1", "memory"); [/CODE] Took me a few minutes to figure out that there is a missing $ sign. :confused: And I must use movdqu because k_tab[k] can't be alligned on a 16 byte boundary all the time. I haven't integrated this into mfaktc yet. This was just a small test for myself to see how it works. The wrap around of paddd is not a problem. The results are < 2^24 unless something went really wrong. Oliver |
OK, a quicky in mfaktc code:
- putting sieve_table_[0]: no effect on runtime - using the asm code posted before instead of [CODE] ktab[k ]=ic+sieve_table_[0]; ktab[k+1]=ic+sieve_table_[1]; ktab[k+2]=ic+sieve_table_[2]; ktab[k+3]=ic+sieve_table_[3];[/CODE] reached only ~87% performance of the c code. :sad: Oliver |
[QUOTE=TheJudger;214961]Hi Kevin,
without the if it is still faster than the "old method" but slower than the current method which includes the if. Moving the if to the 3rd or 5th position is a little bit slower, too. Having sieve_table_[0] in a separate array is at least worth a try! [/QUOTE] Oh, my comments were talking about after you'd converted to SIMD code. I doubt that they would make a difference in the C code (unless the compiler is somehow automatically vectorizing the code for you, which is rare but possible). More thoughts tomorrow when I get a chance to look at the ASM code. Kevin |
[quote=TheJudger;214979]
Took me a few minutes to figure out that there is a missing $ sign. :confused: [/quote] Sorry about that! I haven't tried to build your code at all, so that ASM snippet was untested when I posted it. |
[quote=TheJudger;214980]OK, a quicky in mfaktc code:
- putting sieve_table_[0]: no effect on runtime - using the asm code posted before instead of [code] ktab[k ]=ic+sieve_table_[0]; ktab[k+1]=ic+sieve_table_[1]; ktab[k+2]=ic+sieve_table_[2]; ktab[k+3]=ic+sieve_table_[3];[/code] reached only ~87% performance of the c code. :sad: Oliver[/quote] This is maybe not too surprising considering that the c code's independent adds get done 3 per clock tick and the compiler no doubt interleaves the reads and writes to hide latency. Whereas the three sse2 instructions are all in a dependency chain. Also, the variable latency of the last instruction in that chain likely disrupts the pipelining of subsequent writes. Here is where movdqa might really help if ktab could be made to be aligned somehow. |
I tried this code (v 0.06-hack-2). I'm using 64-bit linux with gcc 4.3.2 and CUDA 3.0 on a 2.4GHz Core 2 Quad with a GTX 480. With
./mfaktc.exe 3321932839 66 71 I'm only getting about 38 million/sec. However, running 3 copies simultaneously gives about 31 million/sec for each, totaling about 93 million/sec. Here's a snippet of the output of one of 3 copies running simultaneously. BTW, a minor cosmetic bug... The GTX 480 has 32 cores/multiprocessor, not 8, so the number of "shader cores" is incorrect. [CODE][cluster@node01 0.06]$ ./mfaktc.exe 3321932839 66 71 mfaktc v0.06 Compiletime Options THREADS_PER_GRID 983040 THREADS_PER_BLOCK 256 SIEVE_SIZE_LIMIT 32kiB SIEVE_SIZE 230945bits USE_PINNED_MEMORY enabled USE_ASYNC_COPY enabled VERBOSE_TIMING disabled SELFTEST disabled MORE_CLASSES disabled Runtime Options SievePrimes 25000 SievePrimesAdjust 2 CudaStreams 2 CUDA device info name: GeForce GTX 480 compute capabilities: 2.0 maximum threads per block: 1024 number of multiprocessors: 15 (120 shader cores) clock rate: 1401MHz tf(3321932839, 66, 71); k_min = 11106030600 k_max = 355392982921 class 0: tested 150405120 candidates in 5730ms (26248712/sec) (avg. wait: 11849usec) class 5: tested 150405120 candidates in 5704ms (26368359/sec) (avg. wait: 11821usec) sp = 52500, min = 5739 sp = 26250, min = 0, max = 100000, prev = 52500 5739 class 9: tested 160235520 candidates in 5835ms (27461100/sec) (avg. wait: 15431usec) class 12: tested 160235520 candidates in 5267ms (30422540/sec) (avg. wait: 11899usec) sp = 26250, min = 5285 sp = 13125, min = 0, max = 52500, prev = 26250 5285 class 20: tested 170065920 candidates in 5552ms (30631469/sec) (avg. wait: 14334usec) class 21: tested 170065920 candidates in 5434ms (31296635/sec) (avg. wait: 13653usec) sp = 13125, min = 5442 sp = 32812, min = 13125, max = 52500, prev = 26250 0 class 29: tested 157286400 candidates in 5386ms (29202822/sec) (avg. wait: 11929usec) class 32: tested 157286400 candidates in 5376ms (29257142/sec) (avg. wait: 11907usec) sp = 32812, min = 5397 sp = 22968, min = 13125, max = 52500, prev = 32812 5397 class 36: tested 162201600 candidates in 5448ms (29772687/sec) (avg. wait: 13131usec) class 41: tested 162201600 candidates in 5260ms (30836806/sec) (avg. wait: 12016usec) sp = 22968, min = 5275 sp = 18046, min = 13125, max = 32812, prev = 22968 5275 class 44: tested 165150720 candidates in 5323ms (31025872/sec) (avg. wait: 12653usec) class 56: tested 165150720 candidates in 5277ms (31296327/sec) (avg. wait: 12394usec) sp = 18046, min = 5289 sp = 25429, min = 18046, max = 32812, prev = 22968 0 class 57: tested 160235520 candidates in 5242ms (30567630/sec) (avg. wait: 11922usec) class 60: tested 160235520 candidates in 5243ms (30561800/sec) (avg. wait: 11908usec) sp = 25429, min = 5258 sp = 21737, min = 18046, max = 32812, prev = 25429 5258 class 65: tested 162201600 candidates in 5284ms (30696744/sec) (avg. wait: 12329usec) class 69: tested 162201600 candidates in 5217ms (31090971/sec) (avg. wait: 12030usec) sp = 21737, min = 5231 sp = 19891, min = 18046, max = 25429, prev = 21737 5231 class 72: tested 164167680 candidates in 5276ms (31115936/sec) (avg. wait: 12254usec) class 77: tested 164167680 candidates in 5276ms (31115936/sec) (avg. wait: 12195usec) sp = 19891, min = 5289 sp = 22660, min = 19891, max = 25429, prev = 21737 0 [/CODE] |
Hi frmky,
[QUOTE=frmky;215107]I tried this code (v 0.06-hack-2). I'm using 64-bit linux with gcc 4.3.2 and CUDA 3.0 on a 2.4GHz Core 2 Quad with a GTX 480. With ./mfaktc.exe 3321932839 66 71 I'm only getting about 38 million/sec. However, running 3 copies simultaneously gives about 31 million/sec for each, totaling about 93 million/sec. Here's a snippet of the output of one of 3 copies running simultaneously. BTW, a minor cosmetic bug... The GTX 480 has 32 cores/multiprocessor, not 8, so the number of "shader cores" is incorrect. [/QUOTE] OK, with 3 copies the GPU performance is as expected, I've (limited) access to a GTX 480, too. Your performance with a single copy seems to be too low (I've no idea why). Btw.: did you modify the compile script to enable sm_20 code? I've noticed when the GPU part was compiled for compute capability 1.x the GF100 chip just ignores the code and does nothing! I know that you're familar with CUDA to I think you've adjusted this. Just in case: one possibility is to add "-arch=sm_20" to the nvcc command line in the compile script. (my todo list contains a check for problem...) The number of shaders cores is allready fixed in the (unfinished, unreleased) 0.07 version. If you want to spent some time: rerun with more CudaStreams (3,4 or 5) and perhaps with SievePrimesAdjust=1. I never ran Kevins modified versions of 0.06 by myself (didn't compile on my system out of the box). Offcourse I've read his modification and implemented some of his stuff in 0.07. Oliver |
I did not need to enable sm_20 code. It ran fine with the default sm_10 code, and found appropriate factors. (The GF100 cannot run binary code generated by CUDA 2.x, but can run PTX code from 2.x or binary code from CUDA 3.0 for any sm.) I did a little playing around. Adjusting CudaStreams above 2 had little effect on the speed. SievePrimesAdjust=2 was significantly faster than 1, but both were much slower than expected so I'm not sure what is happening there.
|
| All times are UTC. The time now is 22:00. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.